• Open

    Importance of methodological choices in data manipulation for validating epileptic seizure detection models. (arXiv:2302.10672v1 [cs.LG])
    Epilepsy is a chronic neurological disorder that affects a significant portion of the human population and imposes serious risks in the daily life of patients. Despite advances in machine learning and IoT, small, nonstigmatizing wearable devices for continuous monitoring and detection in outpatient environments are not yet available. Part of the reason is the complexity of epilepsy itself, including highly imbalanced data, multimodal nature, and very subject-specific signatures. However, another problem is the heterogeneity of methodological approaches in research, leading to slower progress, difficulty comparing results, and low reproducibility. Therefore, this article identifies a wide range of methodological decisions that must be made and reported when training and evaluating the performance of epilepsy detection systems. We characterize the influence of individual choices using a typical ensemble random-forest model and the publicly available CHB-MIT database, providing a broader picture of each decision and giving good-practice recommendations, based on our experience, where possible.  ( 2 min )
    Diffusion Probabilistic Models for Graph-Structured Prediction. (arXiv:2302.10506v1 [cs.LG])
    This paper studies graph-structured prediction for supervised learning on graphs with node-wise or edge-wise target dependencies. To solve this problem, recent works investigated combining graph neural networks (GNNs) with conventional structured prediction algorithms like conditional random fields. However, in this work, we pursue an alternative direction building on the recent successes of diffusion probabilistic models (DPMs). That is, we propose a new framework using DPMs to make graph-structured predictions. In the fully supervised setting, our DPM captures the target dependencies by iteratively updating each target estimate based on the estimates of nearby targets. We also propose a variational expectation maximization algorithm to train our DPM in the semi-supervised setting. Extensive experiments verify that our framework consistently outperforms existing neural structured prediction models on inductive and transductive node classification. We also demonstrate the competitive performance of our framework for algorithmic reasoning tasks.  ( 2 min )
    Conditioning Hierarchical Reinforcement Learning on Flexible Constraints. (arXiv:2302.10639v1 [cs.AI])
    Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks (goal is not too far away). In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as (1) robots that have to clean different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; (2) autonomous electric vehicles that have to reach a far away destination while having to optimize charging locations along the way; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Planning with Reinforcement Learning (CoP-RL) mechanism that combines a high-level constrained planning agent (which computes a reward maximizing path from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoP-RL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR, and also on expected value). We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading best approaches in constrained and hierarchical RL.  ( 2 min )
    CADIS: Handling Cluster-skewed Non-IID Data in Federated Learning with Clustered Aggregation and Knowledge DIStilled Regularization. (arXiv:2302.10413v1 [cs.LG])
    Federated learning enables edge devices to train a global model collaboratively without exposing their data. Despite achieving outstanding advantages in computing efficiency and privacy protection, federated learning faces a significant challenge when dealing with non-IID data, i.e., data generated by clients that are typically not independent and identically distributed. In this paper, we tackle a new type of Non-IID data, called cluster-skewed non-IID, discovered in actual data sets. The cluster-skewed non-IID is a phenomenon in which clients can be grouped into clusters with similar data distributions. By performing an in-depth analysis of the behavior of a classification model's penultimate layer, we introduce a metric that quantifies the similarity between two clients' data distributions without violating their privacy. We then propose an aggregation scheme that guarantees equality between clusters. In addition, we offer a novel local training regularization based on the knowledge-distillation technique that reduces the overfitting problem at clients and dramatically boosts the training scheme's performance. We theoretically prove the superiority of the proposed aggregation over the benchmark FedAvg. Extensive experimental results on both standard public datasets and our in-house real-world dataset demonstrate that the proposed approach improves accuracy by up to 16% compared to the FedAvg algorithm.  ( 2 min )
    LU-Net: Invertible Neural Networks Based on Matrix Factorization. (arXiv:2302.10524v1 [cs.LG])
    LU-Net is a simple and fast architecture for invertible neural networks (INN) that is based on the factorization of quadratic weight matrices $\mathsf{A=LU}$, where $\mathsf{L}$ is a lower triangular matrix with ones on the diagonal and $\mathsf{U}$ an upper triangular matrix. Instead of learning a fully occupied matrix $\mathsf{A}$, we learn $\mathsf{L}$ and $\mathsf{U}$ separately. If combined with an invertible activation function, such layers can easily be inverted whenever the diagonal entries of $\mathsf{U}$ are different from zero. Also, the computation of the determinant of the Jacobian matrix of such layers is cheap. Consequently, the LU architecture allows for cheap computation of the likelihood via the change of variables formula and can be trained according to the maximum likelihood principle. In our numerical experiments, we test the LU-net architecture as generative model on several academic datasets. We also provide a detailed comparison with conventional invertible neural networks in terms of performance, training as well as run time.  ( 2 min )
    Certified Defences Against Adversarial Patch Attacks on Semantic Segmentation. (arXiv:2209.05980v2 [cs.CV] UPDATED)
    Adversarial patch attacks are an emerging security threat for real world deep learning applications. We present Demasked Smoothing, the first approach (up to our knowledge) to certify the robustness of semantic segmentation models against this threat model. Previous work on certifiably defending against patch attacks has mostly focused on image classification task and often required changes in the model architecture and additional training which is undesirable and computationally expensive. In Demasked Smoothing, any segmentation model can be applied without particular training, fine-tuning, or restriction of the architecture. Using different masking strategies, Demasked Smoothing can be applied both for certified detection and certified recovery. In extensive experiments we show that Demasked Smoothing can on average certify 64% of the pixel predictions for a 1% patch in the detection task and 48% against a 0.5% patch for the recovery task on the ADE20K dataset.  ( 2 min )
    Multiagent Inverse Reinforcement Learning via Theory of Mind Reasoning. (arXiv:2302.10238v1 [cs.AI])
    To understand how people interact with each other in collaborative settings, especially in situations where individuals know little about their teammates, Multiagent Inverse Reinforcement Learning (MIRL) aims to infer the reward functions guiding the behavior of each individual given trajectories of a team's behavior during task performance. Unlike current MIRL approaches, team members \emph{are not} assumed to know each other's goals a priori, rather they collaborate by adapting to the goals of others perceived by observing their behavior, all while jointly performing a task. To address this problem, we propose a novel approach to MIRL via Theory of Mind (MIRL-ToM). For each agent, we first use ToM reasoning to estimate a posterior distribution over baseline reward profiles given their demonstrated behavior. We then perform MIRL via decentralized equilibrium by employing single-agent Maximum Entropy IRL to infer a reward function for each agent, where we simulate the behavior of other teammates according to the time-varying distribution over profiles. We evaluate our approach in a simulated 2-player search-and-rescue operation where the goal of the agents, playing different roles, is to search for and evacuate victims in the environment. Results show that the choice of baseline profiles is paramount to the recovery of ground-truth rewards, and MIRL-ToM is able to recover the rewards used by agents interacting with either known and unknown teammates.  ( 2 min )
    Neural Collapse Inspired Attraction-Repulsion-Balanced Loss for Imbalanced Learning. (arXiv:2204.08735v3 [cs.LG] UPDATED)
    Class imbalance distribution widely exists in real-world engineering. However, the mainstream optimization algorithms that seek to minimize error will trap the deep learning model in sub-optimums when facing extreme class imbalance. It seriously harms the classification precision, especially on the minor classes. The essential reason is that the gradients of the classifier weights are imbalanced among the components from different classes. In this paper, we propose Attraction-Repulsion-Balanced Loss (ARB-Loss) to balance the different components of the gradients. We perform experiments on the large-scale classification and segmentation datasets and our ARB-Loss can achieve state-of-the-art performance via only one-stage training instead of 2-stage learning like nowadays SOTA works.  ( 2 min )
    Learning from Label Proportions with Instance-wise Consistency. (arXiv:2203.12836v2 [cs.LG] UPDATED)
    Learning from Label Proportions (LLP) is a weakly supervised learning method that aims to perform instance classification from training data consisting of pairs of bags containing multiple instances and the class label proportions within the bags. Previous studies on multiclass LLP can be divided into two categories according to the learning task: per-instance label classification and per-bag label proportion estimation. However, these methods often results in high variance estimates of the risk when applied to complex models, or lack statistical learning theory arguments. To address this issue, we propose new learning methods based on statistical learning theory for both per-instance and per-bag policies. We demonstrate that the proposed methods are respectively risk-consistent and classifier-consistent in an instance-wise manner, and analyze the estimation error bounds. Additionally, we present a heuristic approximation method that utilizes an existing method for regressing label proportions to reduce the computational complexity of the proposed methods. Through benchmark experiments, we demonstrated the effectiveness of the proposed methods.  ( 2 min )
    Exploring Local Norms in Exp-concave Statistical Learning. (arXiv:2302.10726v1 [cs.LG])
    We consider the problem of stochastic convex optimization with exp-concave losses using Empirical Risk Minimization in a convex class. Answering a question raised in several prior works, we provide a $O( d / n + \log( 1 / \delta) / n )$ excess risk bound valid for a wide class of bounded exp-concave losses, where $d$ is the dimension of the convex reference set, $n$ is the sample size, and $\delta$ is the confidence level. Our result is based on a unified geometric assumption on the gradient of losses and the notion of local norms.
    Valid Inference for Machine Learning Model Parameters. (arXiv:2302.10840v1 [stat.ML])
    The parameters of a machine learning model are typically learned by minimizing a loss function on a set of training data. However, this can come with the risk of overtraining; in order for the model to generalize well, it is of great importance that we are able to find the optimal parameter for the model on the entire population -- not only on the given training sample. In this paper, we construct valid confidence sets for this optimal parameter of a machine learning model, which can be generated using only the training data without any knowledge of the population. We then show that studying the distribution of this confidence set allows us to assign a notion of confidence to arbitrary regions of the parameter space, and we demonstrate that this distribution can be well-approximated using bootstrapping techniques.
    SF2Former: Amyotrophic Lateral Sclerosis Identification From Multi-center MRI Data Using Spatial and Frequency Fusion Transformer. (arXiv:2302.10859v1 [eess.IV])
    Amyotrophic Lateral Sclerosis (ALS) is a complex neurodegenerative disorder involving motor neuron degeneration. Significant research has begun to establish brain magnetic resonance imaging (MRI) as a potential biomarker to diagnose and monitor the state of the disease. Deep learning has turned into a prominent class of machine learning programs in computer vision and has been successfully employed to solve diverse medical image analysis tasks. However, deep learning-based methods applied to neuroimaging have not achieved superior performance in ALS patients classification from healthy controls due to having insignificant structural changes correlated with pathological features. Therefore, the critical challenge in deep models is to determine useful discriminative features with limited training data. By exploiting the long-range relationship of image features, this study introduces a framework named SF2Former that leverages vision transformer architecture's power to distinguish the ALS subjects from the control group. To further improve the network's performance, spatial and frequency domain information are combined because MRI scans are captured in the frequency domain before being converted to the spatial domain. The proposed framework is trained with a set of consecutive coronal 2D slices, which uses the pre-trained weights on ImageNet by leveraging transfer learning. Finally, a majority voting scheme has been employed to those coronal slices of a particular subject to produce the final classification decision. Our proposed architecture has been thoroughly assessed with multi-modal neuroimaging data using two well-organized versions of the Canadian ALS Neuroimaging Consortium (CALSNIC) multi-center datasets. The experimental results demonstrate the superiority of our proposed strategy in terms of classification accuracy compared with several popular deep learning-based techniques.
    KG-ECO: Knowledge Graph Enhanced Entity Correction for Query Rewriting. (arXiv:2302.10454v1 [cs.CL])
    Query Rewriting (QR) plays a critical role in large-scale dialogue systems for reducing frictions. When there is an entity error, it imposes extra challenges for a dialogue system to produce satisfactory responses. In this work, we propose KG-ECO: Knowledge Graph enhanced Entity COrrection for query rewriting, an entity correction system with corrupt entity span detection and entity retrieval/re-ranking functionalities.To boost the model performance, we incorporate Knowledge Graph (KG) to provide entity structural information (neighboring entities encoded by graph neural networks) and textual information (KG entity descriptions encoded by RoBERTa). Experimental results show that our approach yields a clear performance gain over two baselines: utterance level QR and entity correction without utilizing KG information. The proposed system is particularly effective for few-shot learning cases where target entities are rarely seen in training or there is a KG relation between the target entity and other contextual entities in the query.
    On the Behaviour of Pulsed Qubits and their Application to Feed Forward Networks. (arXiv:2302.10467v1 [quant-ph])
    In the last two decades, the combination of machine learning and quantum computing has been an ever-growing topic of interest but, to this date, the limitations of quantum computing hardware have somewhat restricted the use of complex multi-qubit operations for machine learning. In this paper, we capitalize on the cyclical nature of quantum state probabilities observed on pulsed qubits to propose a single-qubit feed forward block whose architecture allows for classical parameters to be used in a way similar to classical neural networks. To do this, we modulate the pulses exciting qubits to induce superimposed rotations around the Bloch Sphere. The approach presented here has the advantage of employing a single qubit per block. Thus, it is linear with respect to the number of blocks, not polynomial with respect to the number of neurons as opposed to the majority of methods elsewhere. Further, since it employs classical parameters, a large number of iterations and updates at training can be effected without dwelling on coherence times and the gradients can be reused and stored if necessary. We also show how an analogy can be drawn to neural networks using sine-squared activation functions and illustrate how the feed-forward block presented here may be used and implemented on pulse-enabled quantum computers.
    Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels. (arXiv:2302.10586v1 [cs.CV])
    We propose a three-stage training strategy called dual pseudo training (DPT) for conditional image generation and classification in semi-supervised learning. First, a classifier is trained on partially labeled data and predicts pseudo labels for all data. Second, a conditional generative model is trained on all data with pseudo labels and generates pseudo images given labels. Finally, the classifier is trained on real data augmented by pseudo images with labels. We demonstrate large-scale diffusion models and semi-supervised learners benefit mutually with a few labels via DPT. In particular, on the ImageNet 256x256 generation benchmark, DPT can generate realistic, diverse, and semantically correct images with very few labels. With two (i.e., < 0.2%) and five (i.e., < 0.4%) labels per class, DPT achieves an FID of 3.44 and 3.37 respectively, outperforming strong diffusion models with full labels, such as IDDPM, CDM, ADM, and LDM. Besides, DPT outperforms competitive semi-supervised baselines substantially on ImageNet classification benchmarks with one, two, and five labels per class, achieving state-of-the-art top-1 accuracies of 59.0 (+2.8), 69.5 (+3.0), and 73.6 (+1.2) respectively.
    Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered Environment. (arXiv:2302.10717v1 [cs.RO])
    In this paper, a novel robotic grasping system is established to automatically pick up objects in cluttered scenes. A composite robotic hand composed of a suction cup and a gripper is designed for grasping the object stably. The suction cup is used for lifting the object from the clutter first and the gripper for grasping the object accordingly. We utilize the affordance map to provide pixel-wise lifting point candidates for the suction cup. To obtain a good affordance map, the active exploration mechanism is introduced to the system. An effective metric is designed to calculate the reward for the current affordance map, and a deep Q-Network (DQN) is employed to guide the robotic hand to actively explore the environment until the generated affordance map is suitable for grasping. Experimental results have demonstrated that the proposed robotic grasping system is able to greatly increase the success rate of the robotic grasping in cluttered scenes.
    Improving Pareto Front Learning via Multi-Sample Hypernetworks. (arXiv:2212.01130v2 [cs.LG] UPDATED)
    Pareto Front Learning (PFL) was recently introduced as an effective approach to obtain a mapping function from a given trade-off vector to a solution on the Pareto front, which solves the multi-objective optimization (MOO) problem. Due to the inherent trade-off between conflicting objectives, PFL offers a flexible approach in many scenarios in which the decision makers can not specify the preference of one Pareto solution over another, and must switch between them depending on the situation. However, existing PFL methods ignore the relationship between the solutions during the optimization process, which hinders the quality of the obtained front. To overcome this issue, we propose a novel PFL framework namely PHN-HVI, which employs a hypernetwork to generate multiple solutions from a set of diverse trade-off preferences and enhance the quality of the Pareto front by maximizing the Hypervolume indicator defined by these solutions. The experimental results on several MOO machine learning tasks show that the proposed framework significantly outperforms the baselines in producing the trade-off Pareto front.  ( 2 min )
    A Statistically-Based Approach to Feedforward Neural Network Model Selection. (arXiv:2207.04248v4 [stat.ME] UPDATED)
    Feedforward neural networks (FNNs) can be viewed as non-linear regression models, where covariates enter the model through a combination of weighted summations and non-linear functions. Although these models have some similarities to the models typically used in statistical modelling, the majority of neural network research has been conducted outside of the field of statistics. This has resulted in a lack of statistically-based methodology, and, in particular, there has been little emphasis on model parsimony. Determining the input layer structure is analogous to variable selection, while the structure for the hidden layer relates to model complexity. In practice, neural network model selection is often carried out by comparing models using out-of-sample performance. However, in contrast, the construction of an associated likelihood function opens the door to information-criteria-based variable and architecture selection. A novel model selection method, which performs both input- and hidden-node selection, is proposed using the Bayesian information criterion (BIC) for FNNs. The choice of BIC over out-of-sample performance as the model selection objective function leads to an increased probability of recovering the true model, while parsimoniously achieving favourable out-of-sample performance. Simulation studies are used to evaluate and justify the proposed method, and applications on real data are investigated.  ( 2 min )
    NeuralStagger: accelerating physics-constrained neural PDE solver with spatial-temporal decomposition. (arXiv:2302.10255v1 [cs.LG])
    Neural networks have shown great potential in accelerating the solution of partial differential equations (PDEs). Recently, there has been a growing interest in introducing physics constraints into training neural PDE solvers to reduce the use of costly data and improve the generalization ability. However, these physics constraints, based on certain finite dimensional approximations over the function space, must resolve the smallest scaled physics to ensure the accuracy and stability of the simulation, resulting in high computational costs from large input, output, and neural networks. This paper proposes a general acceleration methodology called NeuralStagger by spatially and temporally decomposing the original learning tasks into several coarser-resolution subtasks. We define a coarse-resolution neural solver for each subtask, which requires fewer computational resources, and jointly train them with the vanilla physics-constrained loss by simply arranging their outputs to reconstruct the original solution. Due to the perfect parallelism between them, the solution is achieved as fast as a coarse-resolution neural solver. In addition, the trained solvers bring the flexibility of simulating with multiple levels of resolution. We demonstrate the successful application of NeuralStagger on 2D and 3D fluid dynamics simulations, which leads to an additional $10\sim100\times$ speed-up. Moreover, the experiment also shows that the learned model could be well used for optimal control.
    Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes. (arXiv:2302.10434v1 [cs.LG])
    In recent years, large amounts of electronic health records (EHRs) concerning chronic diseases, such as cancer, diabetes, and mental disease, have been collected to facilitate medical diagnosis. Modeling the dynamic properties of EHRs related to chronic diseases can be efficiently done using dynamic treatment regimes (DTRs), which are a set of sequential decision rules. While Reinforcement learning (RL) is a widely used method for creating DTRs, there is ongoing research in developing RL algorithms that can effectively handle large amounts of data. In this paper, we present a novel approach, a distributed Q-learning algorithm, for generating DTRs. The novelties of our research are as follows: 1) From a methodological perspective, we present a novel and scalable approach for generating DTRs by combining distributed learning with Q-learning. The proposed approach is specifically designed to handle large amounts of data and effectively generate DTRs. 2) From a theoretical standpoint, we provide generalization error bounds for the proposed distributed Q-learning algorithm, which are derived within the framework of statistical learning theory. These bounds quantify the relationships between sample size, prediction accuracy, and computational burden, providing insights into the performance of the algorithm. 3) From an applied perspective, we demonstrate the effectiveness of our proposed distributed Q-learning algorithm for DTRs by applying it to clinical cancer treatments. The results show that our algorithm outperforms both traditional linear Q-learning and commonly used deep Q-learning in terms of both prediction accuracy and computation cost.
    Deep Generative Neural Embeddings for High Dimensional Data Visualization. (arXiv:2302.10801v1 [cs.LG])
    We propose a visualization technique that utilizes neural network embeddings and a generative network to reconstruct original data. This method allows for independent manipulation of individual image embeddings through its non-parametric structure, providing more flexibility than traditional autoencoder approaches. We have evaluated the effectiveness of this technique in data visualization and compared it to t-SNE and VAE methods. Furthermore, we have demonstrated the scalability of our method through visualizations on the ImageNet dataset. Our technique has potential applications in human-in-the-loop training, as it allows for independent editing of embedding locations without affecting the optimization process.
    Learning Gradually Non-convex Image Priors Using Score Matching. (arXiv:2302.10502v1 [cs.LG])
    In this paper, we propose a unified framework of denoising score-based models in the context of graduated non-convex energy minimization. We show that for sufficiently large noise variance, the associated negative log density -- the energy -- becomes convex. Consequently, denoising score-based models essentially follow a graduated non-convexity heuristic. We apply this framework to learning generalized Fields of Experts image priors that approximate the joint density of noisy images and their associated variances. These priors can be easily incorporated into existing optimization algorithms for solving inverse problems and naturally implement a fast and robust graduated non-convexity mechanism.
    $\omega$PAP Spaces: Reasoning Denotationally About Higher-Order, Recursive Probabilistic and Differentiable Programs. (arXiv:2302.10636v1 [cs.PL])
    We introduce a new setting, the category of $\omega$PAP spaces, for reasoning denotationally about expressive differentiable and probabilistic programming languages. Our semantics is general enough to assign meanings to most practical probabilistic and differentiable programs, including those that use general recursion, higher-order functions, discontinuous primitives, and both discrete and continuous sampling. But crucially, it is also specific enough to exclude many pathological denotations, enabling us to establish new results about both deterministic differentiable programs and probabilistic programs. In the deterministic setting, we prove very general correctness theorems for automatic differentiation and its use within gradient descent. In the probabilistic setting, we establish the almost-everywhere differentiability of probabilistic programs' trace density functions, and the existence of convenient base measures for density computation in Monte Carlo inference. In some cases these results were previously known, but required detailed proofs with an operational flavor; by contrast, all our proofs work directly with programs' denotations.
    RealFusion: 360{\deg} Reconstruction of Any Object from a Single Image. (arXiv:2302.10663v1 [cs.CV])
    We consider the problem of reconstructing a full 360{\deg} photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to ``dream up'' novel views of the object. Using an approach inspired by DreamFields and DreamFusion, we fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction. We demonstrate state-of-the-art reconstruction results on benchmark images when compared to prior methods for monocular 3D reconstruction of objects. Qualitatively, our reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape, including to the side of the object not visible in the image.
    Contrastive Learning and the Emergence of Attributes Associations. (arXiv:2302.10763v1 [cs.CV])
    In response to an object presentation, supervised learning schemes generally respond with a parsimonious label. Upon a similar presentation we humans respond again with a label, but are flooded, in addition, by a myriad of associations. A significant portion of these consist of the presented object attributes. Contrastive learning is a semi-supervised learning scheme based on the application of identity preserving transformations on the object input representations. It is conjectured in this work that these same applied transformations preserve, in addition to the identity of the presented object, also the identity of its semantically meaningful attributes. The corollary of this is that the output representations of such a contrastive learning scheme contain valuable information not only for the classification of the presented object, but also for the presence or absence decision of any attribute of interest. Simulation results which demonstrate this idea and the feasibility of this conjecture are presented.
    Higher-order Sparse Convolutions in Graph Neural Networks. (arXiv:2302.10505v1 [cs.LG])
    Graph Neural Networks (GNNs) have been applied to many problems in computer sciences. Capturing higher-order relationships between nodes is crucial to increase the expressive power of GNNs. However, existing methods to capture these relationships could be infeasible for large-scale graphs. In this work, we introduce a new higher-order sparse convolution based on the Sobolev norm of graph signals. Our Sparse Sobolev GNN (S-SobGNN) computes a cascade of filters on each layer with increasing Hadamard powers to get a more diverse set of functions, and then a linear combination layer weights the embeddings of each filter. We evaluate S-SobGNN in several applications of semi-supervised learning. S-SobGNN shows competitive performance in all applications as compared to several state-of-the-art methods.
    SurvLIMEpy: A Python package implementing SurvLIME. (arXiv:2302.10571v1 [stat.ML])
    In this paper we present SurvLIMEpy, an open-source Python package that implements the SurvLIME algorithm. This method allows to compute local feature importance for machine learning algorithms designed for modelling Survival Analysis data. Our implementation takes advantage of the parallelisation paradigm as all computations are performed in a matrix-wise fashion which speeds up execution time. Additionally, SurvLIMEpy assists the user with visualization tools to better understand the result of the algorithm. The package supports a wide variety of survival models, from the Cox Proportional Hazards Model to deep learning models such as DeepHit or DeepSurv. Two types of experiments are presented in this paper. First, by means of simulated data, we study the ability of the algorithm to capture the importance of the features. Second, we use three open source survival datasets together with a set of survival algorithms in order to demonstrate how SurvLIMEpy behaves when applied to different models.
    UAV Path Planning Employing MPC- Reinforcement Learning Method for search and rescue mission. (arXiv:2302.10669v1 [cs.LG])
    In this paper, we tackle the problem of Unmanned Aerial (UA V) path planning in complex and uncertain environments by designing a Model Predictive Control (MPC), based on a Long-Short-Term Memory (LSTM) network integrated into the Deep Deterministic Policy Gradient algorithm. In the proposed solution, LSTM-MPC operates as a deterministic policy within the DDPG network, and it leverages a predicting pool to store predicted future states and actions for improved robustness and efficiency. The use of the predicting pool also enables the initialization of the critic network, leading to improved convergence speed and reduced failure rate compared to traditional reinforcement learning and deep reinforcement learning methods. The effectiveness of the proposed solution is evaluated by numerical simulations.
    ChatGPT: Jack of all trades, master of none. (arXiv:2302.10724v1 [cs.CL])
    OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. There are several publications on ChatGPT evaluation, testing its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness and stance detection, natural language inference, word sense disambiguation, linguistic acceptability and question answering. We automated ChatGPT's querying process and analyzed more than 38k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability of personalizing ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.
    Differentiable Multi-Target Causal Bayesian Experimental Design. (arXiv:2302.10607v1 [cs.LG])
    We introduce a gradient-based approach for the problem of Bayesian optimal experimental design to learn causal models in a batch setting -- a critical component for causal discovery from finite data where interventions can be costly or risky. Existing methods rely on greedy approximations to construct a batch of experiments while using black-box methods to optimize over a single target-state pair to intervene with. In this work, we completely dispose of the black-box optimization techniques and greedy heuristics and instead propose a conceptually simple end-to-end gradient-based optimization procedure to acquire a set of optimal intervention target-state pairs. Such a procedure enables parameterization of the design space to efficiently optimize over a batch of multi-target-state interventions, a setting which has hitherto not been explored due to its complexity. We demonstrate that our proposed method outperforms baselines and existing acquisition strategies in both single-target and multi-target settings across a number of synthetic datasets.
    Density Ratio Estimation and Neyman Pearson Classification with Missing Data. (arXiv:2302.10655v1 [stat.ML])
    Density Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with missing not at random (MNAR) data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this MNAR setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance both synthetic data and real-world data with simulated missingness.
    DrasCLR: A Self-supervised Framework of Learning Disease-related and Anatomy-specific Representation for 3D Medical Images. (arXiv:2302.10390v1 [cs.CV])
    Large-scale volumetric medical images with annotation are rare, costly, and time prohibitive to acquire. Self-supervised learning (SSL) offers a promising pre-training and feature extraction solution for many downstream tasks, as it only uses unlabeled data. Recently, SSL methods based on instance discrimination have gained popularity in the medical imaging domain. However, SSL pre-trained encoders may use many clues in the image to discriminate an instance that are not necessarily disease-related. Moreover, pathological patterns are often subtle and heterogeneous, requiring the ability of the desired method to represent anatomy-specific features that are sensitive to abnormal changes in different body parts. In this work, we present a novel SSL framework, named DrasCLR, for 3D medical imaging to overcome these challenges. We propose two domain-specific contrastive learning strategies: one aims to capture subtle disease patterns inside a local anatomical region, and the other aims to represent severe disease patterns that span larger regions. We formulate the encoder using conditional hyper-parameterized network, in which the parameters are dependant on the anatomical location, to extract anatomically sensitive features. Extensive experiments on large-scale computer tomography (CT) datasets of lung images show that our method improves the performance of many downstream prediction and segmentation tasks. The patient-level representation improves the performance of the patient survival prediction task. We show how our method can detect emphysema subtypes via dense prediction. We demonstrate that fine-tuning the pre-trained model can significantly reduce annotation efforts without sacrificing emphysema detection accuracy. Our ablation study highlights the importance of incorporating anatomical context into the SSL framework.
    Creating Disasters: Recession Forecasting with GAN-Generated Synthetic Time Series Data. (arXiv:2302.10490v1 [cs.LG])
    A common problem when forecasting rare events, such as recessions, is limited data availability. Recent advancements in deep learning and generative adversarial networks (GANs) make it possible to produce high-fidelity synthetic data in large quantities. This paper uses a model called DoppelGANger, a GAN tailored to producing synthetic time series data, to generate synthetic Treasury yield time series and associated recession indicators. It is then shown that short-range forecasting performance for Treasury yields is improved for models trained on synthetic data relative to models trained only on real data. Finally, synthetic recession conditions are produced and used to train classification models to predict the probability of a future recession. It is shown that training models on synthetic recessions can improve a model's ability to predict future recessions over a model trained only on real data.
    Reentry Risk and Safety Assessment of Spacecraft Debris Based on Machine Learning. (arXiv:2302.10530v1 [cs.LG])
    Uncontrolled spacecraft will disintegrate and generate a large amount of debris in the reentry process, and ablative debris may cause potential risks to the safety of human life and property on the ground. Therefore, predicting the landing points of spacecraft debris and forecasting the degree of risk of debris to human life and property is very important. In view that it is difficult to predict the process of reentry process and the reentry point in advance, and the debris generated from reentry disintegration may cause ground damage for the uncontrolled space vehicle on expiration of service. In this paper, we adopt the object-oriented approach to consider the spacecraft and its disintegrated components as consisting of simple basic geometric models, and introduce three machine learning models: the support vector regression (SVR), decision tree regression (DTR) and multilayer perceptron (MLP) to predict the velocity, longitude and latitude of spacecraft debris landing points for the first time. Then, we compare the prediction accuracy of the three models. Furthermore, we define the reentry risk and the degree of danger, and we calculate the risk level for each spacecraft debris and make warnings accordingly. The experimental results show that the proposed method can obtain high accuracy prediction results in at least 15 seconds and make safety level warning more real-time.
    Tree-Based Machine Learning Methods For Vehicle Insurance Claims Size Prediction. (arXiv:2302.10612v1 [cs.LG])
    Vehicle insurance claims size prediction needs methods to efficiently handle these claims. Machine learning (ML) is one of the methods that solve this problem. Tree-based ensemble learning algorithms are highly effective and widely used ML methods. This study considers how vehicle insurance providers incorporate ML methods in their companies and explores how the models can be applied to insurance big data. We utilize various tree-based ML methods, such as bagging, random forest, and gradient boosting, to determine the relative importance of predictors in predicting claims size and to explore the relationships between claims size and predictors. Furthermore, we evaluate and compare these models' performances. The results show that tree-based ensemble methods are better than the classical least square method. Keywords: claims size prediction; machine learning; tree-based ensemble methods; vehicle insurance.
    Directive Explanations for Monitoring the Risk of Diabetes Onset: Introducing Directive Data-Centric Explanations and Combinations to Support What-If Explorations. (arXiv:2302.10671v1 [cs.HC])
    Explainable artificial intelligence is increasingly used in machine learning (ML) based decision-making systems in healthcare. However, little research has compared the utility of different explanation methods in guiding healthcare experts for patient care. Moreover, it is unclear how useful, understandable, actionable and trustworthy these methods are for healthcare experts, as they often require technical ML knowledge. This paper presents an explanation dashboard that predicts the risk of diabetes onset and explains those predictions with data-centric, feature-importance, and example-based explanations. We designed an interactive dashboard to assist healthcare experts, such as nurses and physicians, in monitoring the risk of diabetes onset and recommending measures to minimize risk. We conducted a qualitative study with 11 healthcare experts and a mixed-methods study with 45 healthcare experts and 51 diabetic patients to compare the different explanation methods in our dashboard in terms of understandability, usefulness, actionability, and trust. Results indicate that our participants preferred our representation of data-centric explanations that provide local explanations with a global overview over other methods. Therefore, this paper highlights the importance of visually directive data-centric explanation method for assisting healthcare experts to gain actionable insights from patient health records. Furthermore, we share our design implications for tailoring the visual representation of different explanation methods for healthcare experts.
    Binding-and-folding recognition of an intrinsically disordered protein using online learning molecular dynamics. (arXiv:2302.10348v1 [q-bio.BM])
    Intrinsically disordered proteins participate in many biological processes by folding upon binding with other proteins. However, coupled folding and binding processes are not well understood from an atomistic point of view. One of the main questions is whether folding occurs prior to or after binding. Here we use a novel unbiased high-throughput adaptive sampling approach to reconstruct the binding and folding between the disordered transactivation domain of \mbox{c-Myb} and the KIX domain of the CREB-binding protein. The reconstructed long-term dynamical process highlights the binding of a short stretch of amino acids on \mbox{c-Myb} as a folded $\alpha$-helix. Leucine residues, specially Leu298 to Leu302, establish initial native contacts that prime the binding and folding of the rest of the peptide, with a mixture of conformational selection on the N-terminal region with an induced fit of the C-terminal.
    FedSpeed: Larger Local Interval, Less Communication Round, and Higher Generalization Accuracy. (arXiv:2302.10429v1 [cs.LG])
    Federated learning is an emerging distributed machine learning framework which jointly trains a global model via a large number of local devices with data privacy protections. Its performance suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting. In this paper, we propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems. Concretely, FedSpeed applies the prox-correction term on the current local updates to efficiently reduce the biases introduced by the prox-term, a necessary regularizer to maintain the strong local consistency. Furthermore, FedSpeed merges the vanilla stochastic gradient with a perturbation computed from an extra gradient ascent step in the neighborhood, thereby alleviating the issue of local over-fitting. Our theoretical analysis indicates that the convergence rate is related to both the communication rounds $T$ and local intervals $K$ with a upper bound $\small \mathcal{O}(1/T)$ if setting a proper local interval. Moreover, we conduct extensive experiments on the real-world dataset to demonstrate the efficiency of our proposed FedSpeed, which performs significantly faster and achieves the state-of-the-art (SOTA) performance on the general FL experimental settings than several baselines including FedAvg, FedProx, FedCM, FedAdam, SCAFFOLD, FedDyn, FedADMM, etc.
    Generalization Bounds for Adversarial Contrastive Learning. (arXiv:2302.10633v1 [cs.LG])
    Deep networks are well-known to be fragile to adversarial attacks, and adversarial training is one of the most popular methods used to train a robust model. To take advantage of unlabeled data, recent works have applied adversarial training to contrastive learning (Adversarial Contrastive Learning; ACL for short) and obtain promising robust performance. However, the theory of ACL is not well understood. To fill this gap, we leverage the Rademacher complexity to analyze the generalization performance of ACL, with a particular focus on linear models and multi-layer neural networks under $\ell_p$ attack ($p \ge 1$). Our theory shows that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task. The experimental results validate our theory.
    Unpaired Translation from Semantic Label Maps to Images by Leveraging Domain-Specific Simulations. (arXiv:2302.10698v1 [cs.CV])
    Photorealistic image generation from simulated label maps are necessitated in several contexts, such as for medical training in virtual reality. With conventional deep learning methods, this task requires images that are paired with semantic annotations, which typically are unavailable. We introduce a contrastive learning framework for generating photorealistic images from simulated label maps, by learning from unpaired sets of both. Due to potentially large scene differences between real images and label maps, existing unpaired image translation methods lead to artifacts of scene modification in synthesized images. We utilize simulated images as surrogate targets for a contrastive loss, while ensuring consistency by utilizing features from a reverse translation network. Our method enables bidirectional label-image translations, which is demonstrated in a variety of scenarios and datasets, including laparoscopy, ultrasound, and driving scenes. By comparing with state-of-the-art unpaired translation methods, our proposed method is shown to generate realistic and scene-accurate translations.
    Variational Autoencoding Neural Operators. (arXiv:2302.10351v1 [cs.LG])
    Unsupervised learning with functional data is an emerging paradigm of machine learning research with applications to computer vision, climate modeling and physical systems. A natural way of modeling functional data is by learning operators between infinite dimensional spaces, leading to discretization invariant representations that scale independently of the sample grid resolution. Here we present Variational Autoencoding Neural Operators (VANO), a general strategy for making a large class of operator learning architectures act as variational autoencoders. For this purpose, we provide a novel rigorous mathematical formulation of the variational objective in function spaces for training. VANO first maps an input function to a distribution over a latent space using a parametric encoder and then decodes a sample from the latent distribution to reconstruct the input, as in classic variational autoencoders. We test VANO with different model set-ups and architecture choices for a variety of benchmarks. We start from a simple Gaussian random field where we can analytically track what the model learns and progressively transition to more challenging benchmarks including modeling phase separation in Cahn-Hilliard systems and real world satellite data for measuring Earth surface deformation.
    FedST: Federated Shapelet Transformation for Interpretable Time Series Classification. (arXiv:2302.10631v1 [cs.LG])
    This paper studies how to develop accurate and interpretable time series classification (TSC) models with the help of external data in a privacy-preserving federated learning (FL) scenario. To the best of our knowledge, we are the first to study on this essential topic. Achieving this goal requires us to seamlessly integrate the techniques from multiple fields including Data Mining, Machine Learning, and Security. In this paper, we formulate the problem and identify the interpretability constraints under the FL setting. We systematically investigate existing TSC solutions for the centralized scenario and propose FedST, a novel FL-enabled TSC framework based on a shapelet transformation method. We recognize the federated shapelet search step as the kernel of FedST. Thus, we design FedSS-B, a basic protocol for the FedST kernel that we prove to be secure and accurate. Further, we identify the efficiency bottlenecks of the basic protocol and propose optimizations tailored for the FL setting for acceleration. Our theoretical analysis shows that the proposed optimizations are secure and more efficient. We conduct extensive experiments using both synthetic and real-world datasets. Empirical results show that our FedST solution is effective in terms of TSC accuracy, and the proposed optimizations can achieve three orders of magnitude of speedup.
    $PC^2$: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction. (arXiv:2302.10668v1 [cs.CV])
    Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image, and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks, but also gives large qualitative improvements on complex real-world data.
    Data-driven prognostics based on time-frequency analysis and symbolic recurrent neural network for fuel cells under dynamic load. (arXiv:2302.10771v1 [cs.LG])
    Data-centric prognostics is beneficial to improve the reliability and safety of proton exchange membrane fuel cell (PEMFC). For the prognostics of PEMFC operating under dynamic load, the challenges come from extracting degradation features, improving prediction accuracy, expanding the prognostics horizon, and reducing computational cost. To address these issues, this work proposes a data-driven PEMFC prognostics approach, in which Hilbert-Huang transform is used to extract health indicator in dynamic operating conditions and symbolic-based gated recurrent unit model is used to enhance the accuracy of life prediction. Comparing with other state-of-the-art methods, the proposed data-driven prognostics approach provides a competitive prognostics horizon with lower computational cost. The prognostics performance shows consistency and generalizability under different failure threshold settings.
    SolidGen: An Autoregressive Model for Direct B-rep Synthesis. (arXiv:2203.13944v2 [cs.LG] UPDATED)
    The Boundary representation (B-rep) format is the de-facto shape representation in computer-aided design (CAD) to model solid and sheet objects. Recent approaches to generating CAD models have focused on learning sketch-and-extrude modeling sequences that are executed by a solid modeling kernel in postprocess to recover a B-rep. In this paper we present a new approach that enables learning from and synthesizing B-reps without the need for supervision through CAD modeling sequence data. Our method SolidGen, is an autoregressive neural network that models the B-rep directly by predicting the vertices, edges, and faces using Transformer-based and pointer neural networks. Key to achieving this is our Indexed Boundary Representation that references B-rep vertices, edges and faces in a well-defined hierarchy to capture the geometric and topological relations suitable for use with machine learning. SolidGen can be easily conditioned on contexts e.g., class labels, images, and voxels thanks to its probabilistic modeling of the B-rep distribution. We demonstrate qualitatively, quantitatively, and through perceptual evaluation by human subjects that SolidGen can produce high quality, realistic CAD models.  ( 2 min )
    Deep Reinforcement Learning Based on Local GNN for Goal-conditioned Deformable Object Rearranging. (arXiv:2302.10446v1 [cs.RO])
    Object rearranging is one of the most common deformable manipulation tasks, where the robot needs to rearrange a deformable object into a goal configuration. Previous studies focus on designing an expert system for each specific task by model-based or data-driven approaches and the application scenarios are therefore limited. Some research has been attempting to design a general framework to obtain more advanced manipulation capabilities for deformable rearranging tasks, with lots of progress achieved in simulation. However, transferring from simulation to reality is difficult due to the limitation of the end-to-end CNN architecture. To address these challenges, we design a local GNN (Graph Neural Network) based learning method, which utilizes two representation graphs to encode keypoints detected from images. Self-attention is applied for graph updating and cross-attention is applied for generating manipulation actions. Extensive experiments have been conducted to demonstrate that our framework is effective in multiple 1-D (rope, rope ring) and 2-D (cloth) rearranging tasks in simulation and can be easily transferred to a real robot by fine-tuning a keypoint detector.  ( 2 min )
    FrankenSplit: Saliency Guided Neural Feature Compression with Shallow Variational Bottleneck Injection. (arXiv:2302.10681v1 [eess.IV])
    Lightweight neural networks exchange fast inference for predictive strength. Conversely, large deep neural networks have low prediction error but incur prolonged inference times and high energy consumption on resource-constrained devices. This trade-off is unacceptable for latency-sensitive and performance-critical applications. Offloading inference tasks to a server is unsatisfactory due to the inevitable network congestion by high-dimensional data competing for limited bandwidth and leaving valuable client-side resources idle. This work demonstrates why existing methods cannot adequately address the need for high-performance inference in mobile edge computing. Then, we show how to overcome current limitations by introducing a novel training method to reduce bandwidth consumption in Machine-to-Machine communication and a generalizable design heuristic for resource-conscious compression models. We extensively evaluate our proposed method against a wide range of baselines for latency and compressive strength in an environment with asymmetric resource distribution between edge devices and servers. Despite our edge-oriented lightweight encoder, our method achieves considerably better compression rates.
    Structured Bayesian Compression for Deep Neural Networks Based on The Turbo-VBI Approach. (arXiv:2302.10483v1 [cs.LG])
    With the growth of neural network size, model compression has attracted increasing interest in recent research. As one of the most common techniques, pruning has been studied for a long time. By exploiting the structured sparsity of the neural network, existing methods can prune neurons instead of individual weights. However, in most existing pruning methods, surviving neurons are randomly connected in the neural network without any structure, and the non-zero weights within each neuron are also randomly distributed. Such irregular sparse structure can cause very high control overhead and irregular memory access for the hardware and even increase the neural network computational complexity. In this paper, we propose a three-layer hierarchical prior to promote a more regular sparse structure during pruning. The proposed three-layer hierarchical prior can achieve per-neuron weight-level structured sparsity and neuron-level structured sparsity. We derive an efficient Turbo-variational Bayesian inferencing (Turbo-VBI) algorithm to solve the resulting model compression problem with the proposed prior. The proposed Turbo-VBI algorithm has low complexity and can support more general priors than existing model compression algorithms. Simulation results show that our proposed algorithm can promote a more regular structure in the pruned neural networks while achieving even better performance in terms of compression rate and inferencing accuracy compared with the baselines.
    On Inductive Biases for Machine Learning in Data Constrained Settings. (arXiv:2302.10692v1 [cs.LG])
    Learning with limited data is one of the biggest problems of machine learning. Current approaches to this issue consist in learning general representations from huge amounts of data before fine-tuning the model on a small dataset of interest. While such technique, coined transfer learning, is very effective in domains such as computer vision or natural langage processing, it does not yet solve common problems of deep learning such as model interpretability or the overall need for data. This thesis explores a different answer to the problem of learning expressive models in data constrained settings: instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data. Very often, these functions will be drawn from the rich literature of kernel methods. Indeed, many kernels can reflect the underlying structure of the data, thus sparing learning parameters to some extent. Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore during learning. We demonstrate the effectiveness of this approach in the context of sequences, such as sentences in natural language or protein sequences, and graphs, such as molecules. We also highlight the relationship between our work and recent advances in deep learning. Additionally, we study convex machine learning models. Here, rather than proposing new models, we wonder which proportion of the samples in a dataset is really needed to learn a "good" model. More precisely, we study the problem of safe sample screening, i.e, executing simple tests to discard uninformative samples from a dataset even before fitting a machine learning model, without affecting the optimal model. Such techniques can be used to prune datasets or mine for rare samples.
    Don't guess what's true: choose what's optimal. A probability transducer for machine-learning classifiers. (arXiv:2302.10578v1 [cs.LG])
    In fields such as medicine and drug discovery, the ultimate goal of a classification is not to guess a class, but to choose the optimal course of action among a set of possible ones, usually not in one-one correspondence with the set of classes. This decision-theoretic problem requires sensible probabilities for the classes. Probabilities conditional on the features are computationally almost impossible to find in many important cases. The main idea of the present work is to calculate probabilities conditional not on the features, but on the trained classifier's output. This calculation is cheap, needs to be made only once, and provides an output-to-probability "transducer" that can be applied to all future outputs of the classifier. In conjunction with problem-dependent utilities, the probabilities of the transducer allow us to find the optimal choice among the classes or among a set of more general decisions, by means of expected-utility maximization. This idea is demonstrated in a simplified drug-discovery problem with a highly imbalanced dataset. The transducer and utility maximization together always lead to improved results, sometimes close to theoretical maximum, for all sets of problem-dependent utilities. The one-time-only calculation of the transducer also provides, automatically: (i) a quantification of the uncertainty about the transducer itself; (ii) the expected utility of the augmented algorithm (including its uncertainty), which can be used for algorithm selection; (iii) the possibility of using the algorithm in a "generative mode", useful if the training dataset is biased.
    FedSDG-FS: Efficient and Secure Feature Selection for Vertical Federated Learning. (arXiv:2302.10417v1 [cs.LG])
    Vertical Federated Learning (VFL) enables multiple data owners, each holding a different subset of features about largely overlapping sets of data sample(s), to jointly train a useful global model. Feature selection (FS) is important to VFL. It is still an open research problem as existing FS works designed for VFL either assumes prior knowledge on the number of noisy features or prior knowledge on the post-training threshold of useful features to be selected, making them unsuitable for practical applications. To bridge this gap, we propose the Federated Stochastic Dual-Gate based Feature Selection (FedSDG-FS) approach. It consists of a Gaussian stochastic dual-gate to efficiently approximate the probability of a feature being selected, with privacy protection through Partially Homomorphic Encryption without a trusted third-party. To reduce overhead, we propose a feature importance initialization method based on Gini impurity, which can accomplish its goals with only two parameter transmissions between the server and the clients. Extensive experiments on both synthetic and real-world datasets show that FedSDG-FS significantly outperforms existing approaches in terms of achieving accurate selection of high-quality features as well as building global models with improved performance.
    Classy Ensemble: A Novel Ensemble Algorithm for Classification. (arXiv:2302.10580v1 [cs.LG])
    We present Classy Ensemble, a novel ensemble-generation algorithm for classification tasks, which aggregates models through a weighted combination of per-class accuracy. Tested over 153 machine learning datasets we demonstrate that Classy Ensemble outperforms two other well-known aggregation algorithms -- order-based pruning and clustering-based pruning -- as well as the recently introduced lexigarden ensemble generator. We also show preliminary results for deep networks.
    On discrete symmetries of robotics systems: A group-theoretic and data-driven analysis. (arXiv:2302.10433v1 [cs.RO])
    In this work, we study discrete morphological symmetries of dynamical systems, a predominant feature in animal biology and robotic systems, expressed when the system's morphology has one or more planes of symmetry describing the duplication and balanced distribution of body parts. These morphological symmetries imply that the system's dynamics are symmetric (or approximately symmetric), which in turn imprints symmetries in optimal control policies and in all proprioceptive and exteroceptive measurements related to the evolution of the system's dynamics. For data-driven methods, symmetry represents an inductive bias that justifies data augmentation and the construction of symmetric function approximators. To this end, we use group theory to present a theoretical and practical framework allowing for (1) the identification of the system's morphological symmetry group $\G$, (2) data-augmentation of proprioceptive and exteroceptive measurements, and (3) the exploitation of data symmetries through the use of $\G$-equivariant/invariant neural networks, for which we present experimental results on synthetic and real-world applications, demonstrating how symmetry constraints lead to better sample efficiency and generalization while reducing the number of trainable parameters.
    Variance reduced Shapley value estimation for trustworthy data valuation. (arXiv:2210.16835v4 [stat.ML] UPDATED)
    Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.  ( 2 min )
    When are Post-hoc Conceptual Explanantions Identifiable?. (arXiv:2206.13872v3 [stat.ML] UPDATED)
    Interest in understanding and factorizing learned embedding spaces through conceptual explanations is steadily growing. When no human concept labels are available, concept discovery methods search trained embedding spaces for interpretable concepts like object shape or color that can be used to provide post-hoc explanations for decisions. Unlike previous work, we argue that concept discovery should be identifiable, meaning that a number of known concepts can be provably recovered to guarantee reliability of the explanations. As a starting point, we explicitly make the connection between concept discovery and classical methods like Principal Component Analysis and Independent Component Analysis by showing that they can recover independent concepts with non-Gaussian distributions. For dependent concepts, we propose two novel approaches that exploit functional compositionality properties of image-generating processes. Our provably identifiable concept discovery methods substantially outperform competitors on a battery of experiments including hundreds of trained models and dependent concepts, where they exhibit up to 29 % better alignment with the ground truth. Our results provide a rigorous foundation for reliable concept discovery without human labels.  ( 2 min )
    Online Symbolic Regression with Informative Query. (arXiv:2302.10539v1 [cs.LG])
    Symbolic regression, the task of extracting mathematical expressions from the observed data $\{ \vx_i, y_i \}$, plays a crucial role in scientific discovery. Despite the promising performance of existing methods, most of them conduct symbolic regression in an \textit{offline} setting. That is, they treat the observed data points as given ones that are simply sampled from uniform distributions without exploring the expressive potential of data. However, for real-world scientific problems, the data used for symbolic regression are usually actively obtained by doing experiments, which is an \textit{online} setting. Thus, how to obtain informative data that can facilitate the symbolic regression process is an important problem that remains challenging. In this paper, we propose QUOSR, a \textbf{qu}ery-based framework for \textbf{o}nline \textbf{s}ymbolic \textbf{r}egression that can automatically obtain informative data in an iterative manner. Specifically, at each step, QUOSR receives historical data points, generates new $\vx$, and then queries the symbolic expression to get the corresponding $y$, where the $(\vx, y)$ serves as new data points. This process repeats until the maximum number of query steps is reached. To make the generated data points informative, we implement the framework with a neural network and train it by maximizing the mutual information between generated data points and the target expression. Through comprehensive experiments, we show that QUOSR can facilitate modern symbolic regression methods by generating informative data.
    Scalable Infomin Learning. (arXiv:2302.10701v1 [cs.LG])
    The task of infomin learning aims to learn a representation with high utility while being uninformative about a specified target, with the latter achieved by minimising the mutual information between the representation and the target. It has broad applications, ranging from training fair prediction models against protected attributes, to unsupervised learning with disentangled representations. Recent works on infomin learning mainly use adversarial training, which involves training a neural network to estimate mutual information or its proxy and thus is slow and difficult to optimise. Drawing on recent advances in slicing techniques, we propose a new infomin learning approach, which uses a novel proxy metric to mutual information. We further derive an accurate and analytically computable approximation to this proxy metric, thereby removing the need of constructing neural network-based mutual information estimators. Experiments on algorithmic fairness, disentangled representation learning and domain adaptation verify that our method can effectively remove unwanted information with limited time budget.
    An Efficient Two-stage Gradient Boosting Framework for Short-term Traffic State Estimation. (arXiv:2302.10400v1 [cs.LG])
    Real-time traffic state estimation is essential for intelligent transportation systems. The NeurIPS 2022 Traffic4cast challenge provides an excellent testbed for benchmarking short-term traffic state estimation approaches. This technical report describes our solution to this challenge. In particular, we present an efficient two-stage gradient boosting framework for short-term traffic state estimation. The first stage derives the month, day of the week, and time slot index based on the sparse loop counter data, and the second stage predicts the future traffic states based on the sparse loop counter data and the derived month, day of the week, and time slot index. Experimental results demonstrate that our two-stage gradient boosting framework achieves strong empirical performance, achieving third place in both the core and the extended challenges while remaining highly efficient. The source code for this technical report is available at \url{https://github.com/YichaoLu/Traffic4cast2022}.
    Climate Model Driven Seasonal Forecasting Approach with Deep Learning. (arXiv:2302.10480v1 [cs.LG])
    Understanding seasonal climatic conditions is critical for better management of resources such as water, energy and agriculture. Recently, there has been a great interest in utilizing the power of artificial intelligence methods in climate studies. This paper presents a cutting-edge deep learning model (UNet++) trained by state-of-the-art global CMIP6 models to forecast global temperatures a month ahead using the ERA5 reanalysis dataset. ERA5 dataset was also used for finetuning as well performance analysis in the validation dataset. Three different setups (CMIP6; CMIP6 + elevation; CMIP6 + elevation + ERA5 finetuning) were used with both UNet and UNet++ algorithms resulting in six different models. For each model 14 different sequential and non-sequential temporal settings were used. The Mean Absolute Error (MAE) analysis revealed that UNet++ with CMIP6 with elevation and ERA5 finetuning model with "Year 3 Month 2" temporal case provided the best outcome with an MAE of 0.7. Regression analysis over the validation dataset between the ERA5 data values and the corresponding AI model predictions revealed slope and $R^2$ values close to 1 suggesting a very good agreement. The AI model predicts significantly better than the mean CMIP6 ensemble between 2016 and 2021. Both models predict the summer months more accurately than the winter months.
    Speech Privacy Leakage from Shared Gradients in Distributed Learning. (arXiv:2302.10441v1 [cs.LG])
    Distributed machine learning paradigms, such as federated learning, have been recently adopted in many privacy-critical applications for speech analysis. However, such frameworks are vulnerable to privacy leakage attacks from shared gradients. Despite extensive efforts in the image domain, the exploration of speech privacy leakage from gradients is quite limited. In this paper, we explore methods for recovering private speech/speaker information from the shared gradients in distributed learning settings. We conduct experiments on a keyword spotting model with two different types of speech features to quantify the amount of leaked information by measuring the similarity between the original and recovered speech signals. We further demonstrate the feasibility of inferring various levels of side-channel information, including speech content and speaker identity, under the distributed learning framework without accessing the user's data.
    Interval Type-2 Fuzzy Neural Networks for Multi-Label Classification. (arXiv:2302.10430v1 [cs.LG])
    Prediction of multi-dimensional labels plays an important role in machine learning problems. We found that the classical binary labels could not reflect the contents and their relationships in an instance. Hence, we propose a multi-label classification model based on interval type-2 fuzzy logic. In the proposed model, we use a deep neural network to predict the type-1 fuzzy membership of an instance and another one to predict the fuzzifiers of the membership to generate interval type-2 fuzzy memberships. We also propose a loss function to measure the similarities between binary labels in datasets and interval type-2 fuzzy memberships generated by our model. The experiments validate that our approach outperforms baselines on multi-label classification benchmarks.
    Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space. (arXiv:2302.10667v1 [cs.LG])
    In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the \emph{diameter} $D$ of the MDP is $\Omega(S^S)$, where $S$ is the number of states. Therefore, the existing lower and upper bounds on the regret at time$T$, of order $O(\sqrt{DSAT})$ for MDPs with $S$ states and $A$ actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm {\sc Ucrl2} is in fact upper bounded by $\tilde{\mathcal{O}}(\sqrt{E_2AT})$ where $E_2$ is related to the weighted second moment of the stationary measure of a reference policy. Importantly, $E_2$ is bounded independently of $S$. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.
    Robust Meta Learning for Image based tasks. (arXiv:2301.12698v2 [cs.CV] UPDATED)
    A machine learning model that generalizes well should obtain low errors on unseen test examples. Thus, if we learn an optimal model in training data, it could have better generalization performance in testing tasks. However, learning such a model is not possible in standard machine learning frameworks as the distribution of the test data is unknown. To tackle this challenge, we propose a novel robust meta-learning method, which is more robust to the image-based testing tasks which is unknown and has distribution shifts with training tasks. Our robust meta-learning method can provide robust optimal models even when data from each distribution are scarce. In experiments, we demonstrate that our algorithm not only has better generalization performance but also robust to different unknown testing tasks.  ( 2 min )
    Time to Embrace Natural Language Processing (NLP)-based Digital Pathology: Benchmarking NLP- and Convolutional Neural Network-based Deep Learning Pipelines. (arXiv:2302.10406v1 [cs.CL])
    NLP-based computer vision models, particularly vision transformers, have been shown to outperform CNN models in many imaging tasks. However, most digital pathology artificial-intelligence models are based on CNN architectures, probably owing to a lack of data regarding NLP models for pathology images. In this study, we developed digital pathology pipelines to benchmark the five most recently proposed NLP models (vision transformer (ViT), Swin Transformer, MobileViT, CMT, and Sequencer2D) and four popular CNN models (ResNet18, ResNet50, MobileNetV2, and EfficientNet) to predict biomarkers in colorectal cancer (microsatellite instability, CpG island methylator phenotype, and BRAF mutation). Hematoxylin and eosin-stained whole-slide images from Molecular and Cellular Oncology and The Cancer Genome Atlas were used as training and external validation datasets, respectively. Cross-study external validations revealed that the NLP-based models significantly outperformed the CNN-based models in biomarker prediction tasks, improving the overall prediction and precision up to approximately 10% and 26%, respectively. Notably, compared with existing models in the current literature using large training datasets, our NLP models achieved state-of-the-art predictions for all three biomarkers using a relatively small training dataset, suggesting that large training datasets are not a prerequisite for NLP models or transformers, and NLP may be more suitable for clinical studies in which small training datasets are commonly collected. The superior performance of Sequencer2D suggests that further research and innovation on both transformer and bidirectional long short-term memory architectures are warranted in the field of digital pathology. NLP models can replace classic CNN architectures and become the new workhorse backbone in the field of digital pathology.
    Transformed Distribution Matching for Missing Value Imputation. (arXiv:2302.10363v1 [cs.LG])
    We study the problem of imputing missing values in a dataset, which has important applications in many domains. The key to missing value imputation is to capture the data distribution with incomplete samples and impute the missing values accordingly. In this paper, by leveraging the fact that any two batches of data with missing values come from the same data distribution, we propose to impute the missing values of two batches of samples by transforming them into a latent space through deep invertible functions and matching them distributionally. To learn the transformations and impute the missing values simultaneously, a simple and well-motivated algorithm is proposed. Extensive experiments over a large number of datasets and competing benchmark algorithms show that our method achieves state-of-the-art performance.
    Benchmarking energy consumption and latency for neuromorphic computing in condensed matter and particle physics. (arXiv:2209.10481v2 [cs.ET] UPDATED)
    The massive use of artificial neural networks (ANNs), increasingly popular in many areas of scientific computing, rapidly increases the energy consumption of modern high-performance computing systems. An appealing and possibly more sustainable alternative is provided by novel neuromorphic paradigms, which directly implement ANNs in hardware. However, little is known about the actual benefits of running ANNs on neuromorphic hardware for use cases in scientific computing. Here we present a methodology for measuring the energy cost and compute time for inference tasks with ANNs on conventional hardware. In addition, we have designed an architecture for these tasks and estimate the same metrics based on a state-of-the-art analog in-memory computing (AIMC) platform, one of the key paradigms in neuromorphic computing. Both methodologies are compared for a use case in quantum many-body physics in two dimensional condensed matter systems and for anomaly detection at 40 MHz rates at the Large Hadron Collider in particle physics. We find that AIMC can achieve up to one order of magnitude shorter computation times than conventional hardware, at an energy cost that is up to three orders of magnitude smaller. This suggests great potential for faster and more sustainable scientific computing with neuromorphic hardware.
    Improving Recommendation Fairness via Data Augmentation. (arXiv:2302.06333v2 [cs.IR] UPDATED)
    Collaborative filtering based recommendation learns users' preferences from all users' historical behavior data, and has been popular to facilitate decision making. R Recently, the fairness issue of recommendation has become more and more essential. A recommender system is considered unfair when it does not perform equally well for different user groups according to users' sensitive attributes~(e.g., gender, race). Plenty of methods have been proposed to alleviate unfairness by optimizing a predefined fairness goal or changing the distribution of unbalanced training data. However, they either suffered from the specific fairness optimization metrics or relied on redesigning the current recommendation architecture. In this paper, we study how to improve recommendation fairness from the data augmentation perspective. The recommendation model amplifies the inherent unfairness of imbalanced training data. We augment imbalanced training data towards balanced data distribution to improve fairness. The proposed framework is generally applicable to any embedding-based recommendation, and does not need to pre-define a fairness metric. Extensive experiments on two real-world datasets clearly demonstrate the superiority of our proposed framework. We publish the source code at https://github.com/newlei/FDA.
    Watch and Match: Supercharging Imitation with Regularized Optimal Transport. (arXiv:2206.15469v2 [cs.RO] UPDATED)
    Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8X faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.
    Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform. (arXiv:2210.15975v2 [eess.AS] UPDATED)
    We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed. Code and audio samples are available from https://github.com/MasayaKawamura/MB-iSTFT-VITS.
    SimPer: Simple Self-Supervised Learning of Periodic Targets. (arXiv:2210.03115v2 [cs.LG] UPDATED)
    From human physiology to environmental evolution, important processes in nature often exhibit meaningful and strong periodic or quasi-periodic changes. Due to their inherent label scarcity, learning useful representations for periodic tasks with limited or no supervision is of great benefit. Yet, existing self-supervised learning (SSL) methods overlook the intrinsic periodicity in data, and fail to learn representations that capture periodic or frequency attributes. In this paper, we present SimPer, a simple contrastive SSL regime for learning periodic information in data. To exploit the periodic inductive bias, SimPer introduces customized augmentations, feature similarity measures, and a generalized contrastive loss for learning efficient and robust periodic representations. Extensive experiments on common real-world tasks in human behavior analysis, environmental sensing, and healthcare domains verify the superior performance of SimPer compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Code and data are available at: https://github.com/YyzHarry/SimPer.  ( 2 min )
    Estimating long-term causal effects from short-term experiments and long-term observational data with unobserved confounding. (arXiv:2302.10625v1 [stat.ML])
    Understanding and quantifying cause and effect is an important problem in many domains. The generally-agreed solution to this problem is to perform a randomised controlled trial. However, even when randomised controlled trials can be performed, they usually have relatively short duration's due to cost considerations. This makes learning long-term causal effects a very challenging task in practice, since the long-term outcome is only observed after a long delay. In this paper, we study the identification and estimation of long-term treatment effects when both experimental and observational data are available. Previous work provided an estimation strategy to determine long-term causal effects from such data regimes. However, this strategy only works if one assumes there are no unobserved confounders in the observational data. In this paper, we specifically address the challenging case where unmeasured confounders are present in the observational data. Our long-term causal effect estimator is obtained by combining regression residuals with short-term experimental outcomes in a specific manner to create an instrumental variable, which is then used to quantify the long-term causal effect through instrumental variable regression. We prove this estimator is unbiased, and analytically study its variance. In the context of the front-door causal structure, this provides a new causal estimator, which may be of independent interest. Finally, we empirically test our approach on synthetic-data, as well as real-data from the International Stroke Trial.  ( 2 min )
    Explain Influence Maximization with Sobol Indices. (arXiv:2207.07833v2 [cs.SI] UPDATED)
    Due to its vast application on online social networks, Influence Maximization (IM) has garnered considerable attention over the last couple of decades. Current IM research lacks human-comprehensible explanations of how the seed set results in the influence effect, hence reducing the trustworthiness of existing solutions despite their applicability. Due to the intricacy of IM, the majority of current research concentrate on estimating first-order spreading power and often is regard the interplay between flows dispersed from different seeds. This study uses Sobol indices, the cornerstone of variance-based sensitivity analysis, to decompose the influence effect to individual seeds and their interactions. The Sobol indices are tailored for IM contexts by modeling the seed selection as binary variables. This explanation method is universally applicable to all network types, IM techniques, and diffusion models. Based on the explanation method, a general framework dubbed SobolIM is proposed to improve the performance of current IM studies by over-selecting nodes followed by an elimination strategy. Experiments on synthetic and real-world graphs demonstrate that the explanation of the impact effect can dependably identify the key high-order interaction between seeds across a variety of networks and IM methods. SobolIM is empirically proved to be superior on effectiveness and competitive on efficiency.
    Self-supervised learning of Split Invariant Equivariant representations. (arXiv:2302.10283v1 [cs.CV])
    Recent progress has been made towards learning invariant or equivariant representations with self-supervised learning. While invariant methods are evaluated on large scale datasets, equivariant ones are evaluated in smaller, more controlled, settings. We aim at bridging the gap between the two in order to learn more diverse representations that are suitable for a wide range of tasks. We start by introducing a dataset called 3DIEBench, consisting of renderings from 3D models over 55 classes and more than 2.5 million images where we have full control on the transformations applied to the objects. We further introduce a predictor architecture based on hypernetworks to learn equivariant representations with no possible collapse to invariance. We introduce SIE (Split Invariant-Equivariant) which combines the hypernetwork-based predictor with representations split in two parts, one invariant, the other equivariant, to learn richer representations. We demonstrate significant performance gains over existing methods on equivariance related tasks from both a qualitative and quantitative point of view. We further analyze our introduced predictor and show how it steers the learned latent space. We hope that both our introduced dataset and approach will enable learning richer representations without supervision in more complex scenarios.  ( 2 min )
    $\{\text{PF}\}^2$ES: Parallel Feasible Pareto Frontier Entropy Search for Multi-Objective Bayesian Optimization. (arXiv:2204.05411v2 [cs.LG] UPDATED)
    We present Parallel Feasible Pareto Frontier Entropy Search ($\{\text{PF}\}^2$ES) -- a novel information-theoretic acquisition function for multi-objective Bayesian optimization supporting unknown constraints and batch query. Due to the complexity of characterizing the mutual information between candidate evaluations and (feasible) Pareto frontiers, existing approaches must either employ crude approximations that significantly hamper their performance or rely on expensive inference schemes that substantially increase the optimization's computational overhead. By instead using a variational lower bound, $\{\text{PF}\}^2$ES provides a low-cost and accurate estimate of the mutual information. We benchmark $\{\text{PF}\}^2$ES against other information-theoretic acquisition functions, demonstrating its competitive performance for optimization across synthetic and real-world design problems.
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v2 [cs.LG] UPDATED)
    Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.  ( 2 min )
    GAUCHE: A Library for Gaussian Processes in Chemistry. (arXiv:2212.04450v2 [physics.chem-ph] UPDATED)
    We introduce GAUCHE, a library for GAUssian processes in CHEmistry. Gaussian processes have long been a cornerstone of probabilistic machine learning, affording particular advantages for uncertainty quantification and Bayesian optimisation. Extending Gaussian processes to chemical representations, however, is nontrivial, necessitating kernels defined over structured inputs such as graphs, strings and bit vectors. By defining such kernels in GAUCHE, we seek to open the door to powerful tools for uncertainty quantification and Bayesian optimisation in chemistry. Motivated by scenarios frequently encountered in experimental chemistry, we showcase applications for GAUCHE in molecular discovery and chemical reaction optimisation. The codebase is made available at https://github.com/leojklarner/gauche
    Unsupervised Seismic Footprint Removal With Physical Prior Augmented Deep Autoencoder. (arXiv:2302.10756v1 [cs.CV])
    Seismic acquisition footprints appear as stably faint and dim structures and emerge fully spatially coherent, causing inevitable damage to useful signals during the suppression process. Various footprint removal methods, including filtering and sparse representation (SR), have been reported to attain promising results for surmounting this challenge. However, these methods, e.g., SR, rely solely on the handcrafted image priors of useful signals, which is sometimes an unreasonable demand if complex geological structures are contained in the given seismic data. As an alternative, this article proposes a footprint removal network (dubbed FR-Net) for the unsupervised suppression of acquired footprints without any assumptions regarding valuable signals. The key to the FR-Net is to design a unidirectional total variation (UTV) model for footprint acquisition according to the intrinsically directional property of noise. By strongly regularizing a deep convolutional autoencoder (DCAE) using the UTV model, our FR-Net transforms the DCAE from an entirely data-driven model to a \textcolor{black}{prior-augmented} approach, inheriting the superiority of the DCAE and our footprint model. Subsequently, the complete separation of the footprint noise and useful signals is projected in an unsupervised manner, specifically by optimizing the FR-Net via the backpropagation (BP) algorithm. We provide qualitative and quantitative evaluations conducted on three synthetic and field datasets, demonstrating that our FR-Net surpasses the previous state-of-the-art (SOTA) methods.  ( 2 min )
    Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking. (arXiv:2208.10583v2 [cs.LG] UPDATED)
    Evolution Strategy (ES) is a powerful black-box optimization technique based on the idea of natural evolution. In each of its iterations, a key step entails ranking candidate solutions based on some fitness score. For an ES method in Reinforcement Learning (RL), this ranking step requires evaluating multiple policies. This is presently done via on-policy approaches: each policy's score is estimated by interacting several times with the environment using that policy. This leads to a lot of wasteful interactions since, once the ranking is done, only the data associated with the top-ranked policies is used for subsequent learning. To improve sample efficiency, we propose a novel off-policy alternative for ranking, based on a local approximation for the fitness function. We demonstrate our idea in the context of a state-of-the-art ES method called the Augmented Random Search (ARS). Simulations in MuJoCo tasks show that, compared to the original ARS, our off-policy variant has similar running times for reaching reward thresholds but needs only around 70% as much data. It also outperforms the recent Trust Region ES. We believe our ideas should be extendable to other ES methods as well.
    Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation. (arXiv:2210.05991v2 [cs.CV] UPDATED)
    Anticipating future actions in a video is useful for many autonomous and assistive technologies. Most prior action anticipation work treat this as a vision modality problem, where the models learn the task information primarily from the video features in the action anticipation datasets. However, knowledge about action sequences can also be obtained from external textual data. In this work, we show how knowledge in pretrained language models can be adapted and distilled into vision-based action anticipation models. We show that a simple distillation technique can achieve effective knowledge transfer and provide consistent gains on a strong vision model (Anticipative Vision Transformer) for two action anticipation datasets (3.5% relative gain on EGTEA-GAZE+ and 7.2% relative gain on EPIC-KITCHEN 55), giving a new state-of-the-art result.
    Entire Space Counterfactual Learning: Tuning, Analytical Properties and Industrial Applications. (arXiv:2210.11039v2 [cs.LG] UPDATED)
    As a basic research problem for building effective recommender systems, post-click conversion rate (CVR) estimation has long been plagued by sample selection bias and data sparsity issues. To address the data sparsity issue, prevalent methods based on entire space multi-task model leverage the sequential pattern of user actions, i.e. exposure $\rightarrow$ click $\rightarrow$ conversion to construct auxiliary learning tasks. However, they still fall short of guaranteeing the unbiasedness of CVR estimates. This paper theoretically demonstrates two defects of these entire space multi-task models: (1) inherent estimation bias (IEB) for CVR estimation, where the CVR estimate is inherently higher than the ground truth; (2) potential independence priority (PIP) for CTCVR estimation, where the causality from click to conversion might be overlooked. This paper further proposes a principled method named entire space counterfactual multi-task model (ESCM$^2$), which employs a counterfactual risk minimizer to handle both IEB and PIP issues at once. To demonstrate the effectiveness of the proposed method, this paper explores its parameter tuning in practice, derives its analytic properties, and showcases its effectiveness in industrial CVR estimation, where ESCM$^2$ can effectively alleviate the intrinsic IEB and PIP issues and outperform baseline models.
    Understanding new tasks through the lens of training data via exponential tilting. (arXiv:2205.13577v2 [cs.LG] UPDATED)
    Deploying machine learning models to new tasks is a major challenge despite the large size of the modern training datasets. However, it is conceivable that the training data can be reweighted to be more representative of the new (target) task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.
    A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. (arXiv:2301.05339v2 [cs.GR] UPDATED)
    Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.
    Generative De Novo Protein Design with Global Context. (arXiv:2204.10673v2 [q-bio.BM] UPDATED)
    The linear sequence of amino acids determines protein structure and function. Protein design, known as the inverse of protein structure prediction, aims to obtain a novel protein sequence that will fold into the defined structure. Recent works on computational protein design have studied designing sequences for the desired backbone structure with local positional information and achieved competitive performance. However, similar local environments in different backbone structures may result in different amino acids, indicating that protein structure's global context matters. Thus, we propose the Global-Context Aware generative de novo protein design method (GCA), consisting of local and global modules. While local modules focus on relationships between neighbor amino acids, global modules explicitly capture non-local contexts. Experimental results demonstrate that the proposed GCA method outperforms state-of-the-arts on de novo protein design. Our code and pretrained model will be released.
    A Detailed Study of Interpretability of Deep Neural Network based Top Taggers. (arXiv:2210.04371v3 [hep-ex] UPDATED)
    Recent developments in the methods of explainable AI (XAI) allow researchers to explore the inner workings of deep neural networks (DNNs), revealing crucial information about input-output relationships and realizing how data connects with machine learning models. In this paper we explore interpretability of DNN models designed to identify jets coming from top quark decay in high energy proton-proton collisions at the Large Hadron Collider (LHC). We review a subset of existing top tagger models and explore different quantitative methods to identify which features play the most important roles in identifying the top jets. We also investigate how and why feature importance varies across different XAI metrics, how correlations among features impact their explainability, and how latent space representations encode information as well as correlate with physically meaningful quantities. Our studies uncover some major pitfalls of existing XAI methods and illustrate how they can be overcome to obtain consistent and meaningful interpretation of these models. We additionally illustrate the activity of hidden layers as Neural Activation Pattern (NAP) diagrams and demonstrate how they can be used to understand how DNNs relay information across the layers and how this understanding can help to make such models significantly simpler by allowing effective model reoptimization and hyperparameter tuning. These studies not only facilitate a methodological approach to interpreting models but also unveil new insights about what these models learn. Incorporating these observations into augmented model design, we propose the Particle Flow Interaction Network (PFIN) model and demonstrate how interpretability-inspired model augmentation can improve top tagging performance.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v4 [cs.LG] UPDATED)
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with pretrained token embeddings. To form these associations, a modern Hopfield network stores these token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    A Meta-Reinforcement Learning Algorithm for Causal Discovery. (arXiv:2207.08457v2 [cs.LG] UPDATED)
    Causal discovery is a major task with the utmost importance for machine learning since causal structures can enable models to go beyond pure correlation-based inference and significantly boost their performance. However, finding causal structures from data poses a significant challenge both in computational effort and accuracy, let alone its impossibility without interventions in general. In this paper, we develop a meta-reinforcement learning algorithm that performs causal discovery by learning to perform interventions such that it can construct an explicit causal graph. Apart from being useful for possible downstream applications, the estimated causal graph also provides an explanation for the data-generating process. In this article, we show that our algorithm estimates a good graph compared to the SOTA approaches, even in environments whose underlying causal structure is previously unseen. Further, we make an ablation study that shows how learning interventions contribute to the overall performance of our approach. We conclude that interventions indeed help boost the performance, efficiently yielding an accurate estimate of the causal structure of a possibly unseen environment.
    Hidden Heterogeneity: When to Choose Similarity-Based Calibration. (arXiv:2202.01840v2 [cs.LG] UPDATED)
    Trustworthy classifiers are essential to the adoption of machine learning predictions in many real-world settings. The predicted probability of possible outcomes can inform high-stakes decision making, particularly when assessing the expected value of alternative decisions or the risk of bad outcomes. These decisions require well-calibrated probabilities, not just the correct prediction of the most likely class. Black-box classifier calibration methods can improve the reliability of a classifier's output without requiring retraining. However, these methods are unable to detect subpopulations where calibration could also improve prediction accuracy. Such subpopulations are said to exhibit "hidden heterogeneity" (HH), because the original classifier did not detect them. This paper proposes a quantitative measure for HH. It also introduces two similarity-weighted calibration methods that can address HH by adapting locally to each test item: SWC weights the calibration set by similarity to the test item, and SWC-HH explicitly incorporates hidden heterogeneity to filter the calibration set. Experiments show that the improvements in calibration achieved by similarity-based calibration methods correlate with the amount of HH present and, given sufficient calibration data, generally exceed calibration achieved by global methods. HH can therefore serve as a useful diagnostic tool for identifying when local calibration methods would be beneficial.
    Wassmap: Wasserstein Isometric Mapping for Image Manifold Learning. (arXiv:2204.06645v3 [cs.LG] UPDATED)
    In this paper, we propose Wasserstein Isometric Mapping (Wassmap), a nonlinear dimensionality reduction technique that provides solutions to some drawbacks in existing global nonlinear dimensionality reduction algorithms in imaging applications. Wassmap represents images via probability measures in Wasserstein space, then uses pairwise Wasserstein distances between the associated measures to produce a low-dimensional, approximately isometric embedding. We show that the algorithm is able to exactly recover parameters of some image manifolds including those generated by translations or dilations of a fixed generating measure. Additionally, we show that a discrete version of the algorithm retrieves parameters from manifolds generated from discrete measures by providing a theoretical bridge to transfer recovery results from functional data to discrete data. Testing of the proposed algorithms on various image data manifolds show that Wassmap yields good embeddings compared with other global and local techniques.
    Generalized Gumbel-Softmax Gradient Estimator for Various Discrete Random Variables. (arXiv:2003.01847v3 [cs.LG] UPDATED)
    Estimating the gradients of stochastic nodes is one of the crucial research questions in the deep generative modeling community, which enables the gradient descent optimization on neural network parameters. This estimation problem becomes further complex when we regard the stochastic nodes to be discrete because pathwise derivative techniques cannot be applied. Hence, the stochastic gradient estimation of discrete distributions requires either a score function method or continuous relaxation of the discrete random variables. This paper proposes a general version of the Gumbel-Softmax estimator with continuous relaxation, and this estimator is able to relax the discreteness of probability distributions including more diverse types, other than categorical and Bernoulli. In detail, we utilize the truncation of discrete random variables and the Gumbel-Softmax trick with a linear transformation for the relaxed reparameterization. The proposed approach enables the relaxed discrete random variable to be reparameterized and to backpropagated through a large scale stochastic computational graph. Our experiments consist of (1) synthetic data analyses, which show the efficacy of our methods; and (2) applications on VAE and topic model, which demonstrate the value of the proposed estimation in practices.
    On Calibrating Diffusion Probabilistic Models. (arXiv:2302.10688v1 [cs.LG])
    Recently, diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. A typical DPM framework includes a forward process that gradually diffuses the data distribution and a reverse process that recovers the data distribution from time-dependent data scores. In this work, we observe that the stochastic reverse process of data scores is a martingale, from which concentration bounds and the optional stopping theorem for data scores can be derived. Then, we discover a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can consequently be increased. We provide general calibration guidelines under various model parametrizations. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling. We conduct experiments on multiple datasets to empirically validate our proposal. Our code is at https://github.com/thudzj/Calibrated-DPMs.  ( 2 min )
    Meta-Uncertainty in Bayesian Model Comparison. (arXiv:2210.07278v3 [stat.ML] UPDATED)
    Bayesian model comparison (BMC) offers a principled probabilistic approach to study and rank competing models. In standard BMC, we construct a discrete probability distribution over the set of possible models, conditional on the observed data of interest. These posterior model probabilities (PMPs) are measures of uncertainty, but -- when derived from a finite number of observations -- are also uncertain themselves. In this paper, we conceptualize distinct levels of uncertainty which arise in BMC. We explore a fully probabilistic framework for quantifying meta-uncertainty, resulting in an applied method to enhance any BMC workflow. Drawing on both Bayesian and frequentist techniques, we represent the uncertainty over the uncertain PMPs via meta-models which combine simulated and observed data into a predictive distribution for PMPs on new data. We demonstrate the utility of the proposed method in the context of conjugate Bayesian regression, likelihood-based inference with Markov chain Monte Carlo, and simulation-based inference with neural networks.
    PrecTime: A Deep Learning Architecture for Precise Time Series Segmentation in Industrial Manufacturing Operations. (arXiv:2302.10182v1 [cs.LG])
    The fourth industrial revolution creates ubiquitous sensor data in production plants. To generate maximum value out of these data, reliable and precise time series-based machine learning methods like temporal neural networks are needed. This paper proposes a novel sequence-to-sequence deep learning architecture for time series segmentation called PrecTime which tries to combine the concepts and advantages of sliding window and dense labeling approaches. The general-purpose architecture is evaluated on a real-world industry dataset containing the End-of-Line testing sensor data of hydraulic pumps. We are able to show that PrecTime outperforms five implemented state-of-the-art baseline networks based on multiple metrics. The achieved segmentation accuracy of around 96% shows that PrecTime can achieve results close to human intelligence in operational state segmentation within a testing cycle.  ( 2 min )
    A Dynamic Temporal Self-attention Graph Convolutional Network for Traffic Prediction. (arXiv:2302.10428v1 [cs.LG])
    Accurate traffic prediction in real time plays an important role in Intelligent Transportation System (ITS) and travel navigation guidance. There have been many attempts to predict short-term traffic status which consider the spatial and temporal dependencies of traffic information such as temporal graph convolutional network (T-GCN) model and convolutional long short-term memory (Conv-LSTM) model. However, most existing methods use simple adjacent matrix consisting of 0 and 1 to capture the spatial dependence which can not meticulously describe the urban road network topological structure and the law of dynamic change with time. In order to tackle the problem, this paper proposes a dynamic temporal self-attention graph convolutional network (DT-SGN) model which considers the adjacent matrix as a trainable attention score matrix and adapts network parameters to different inputs. Specially, self-attention graph convolutional network (SGN) is chosen to capture the spatial dependence and the dynamic gated recurrent unit (Dynamic-GRU) is chosen to capture temporal dependence and learn dynamic changes of input data. Experiments demonstrate the superiority of our method over state-of-art model-driven model and data-driven models on real-world traffic datasets.  ( 2 min )
    Diversify and Disambiguate: Learning From Underspecified Data. (arXiv:2202.03418v3 [cs.LG] UPDATED)
    Many datasets are underspecified: there exist multiple equally viable solutions to a given task. Underspecification can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus produce widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.
    SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. (arXiv:2207.04606v4 [cs.LG] UPDATED)
    Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging because a single sparse format cannot maximize hardware efficiency, and single-shot compilers cannot keep up with latest hardware and system advances. In this paper, we observe that the key to addressing both these challenges is to leverage composable formats and composable transformations. We propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations for deep learning workloads. SparseTIR constructs a search space over these composable components for performance tuning. With these improvements, SparseTIR obtains consistent performance speedups vs vendor libraries on GPUs for single operators: 1.20-2.34x for GNN operators, 1.05-2.98x for sparse attention operators, and 0.56-7.45x for sparse convolution operators. SparseTIR also accelerates end-to-end GNNs by 1.08-1.52x for GraphSAGE training, and 4.20-40.18x for RGCN inference.
    On Bridging the Gap between Mean Field and Finite Width in Deep Random Neural Networks with Batch Normalization. (arXiv:2205.13076v3 [cs.LG] UPDATED)
    Mean field theory is widely used in the theoretical studies of neural networks. In this paper, we analyze the role of depth in the concentration of mean-field predictions, specifically for deep multilayer perceptron (MLP) with batch normalization (BN) at initialization. By scaling the network width to infinity, it is postulated that the mean-field predictions suffer from layer-wise errors that amplify with depth. We demonstrate that BN stabilizes the distribution of representations that avoids the error propagation of mean-field predictions. This stabilization, which is characterized by a geometric mixing property, allows us to establish concentration bounds for mean field predictions in infinitely-deep neural networks with a finite width.
    Federated Learning for ASR based on Wav2vec 2.0. (arXiv:2302.10790v1 [eess.AS])
    This paper presents a study on the use of federated learning to train an ASR model based on a wav2vec 2.0 model pre-trained by self supervision. Carried out on the well-known TED-LIUM 3 dataset, our experiments show that such a model can obtain, with no use of a language model, a word error rate of 10.92% on the official TED-LIUM 3 test set, without sharing any data from the different users. We also analyse the ASR performance for speakers depending to their participation to the federated learning. Since federated learning was first introduced for privacy purposes, we also measure its ability to protect speaker identity. To do that, we exploit an approach to analyze information contained in exchanged models based on a neural network footprint on an indicator dataset. This analysis is made layer-wise and shows which layers in an exchanged wav2vec 2.0 based model bring the speaker identity information.
    GDBN: a Graph Neural Network Approach to Dynamic Bayesian Network. (arXiv:2302.10804v1 [cs.LG])
    Identifying causal relations among multi-variate time series is one of the most important elements towards understanding the complex mechanisms underlying the dynamic system. It provides critical tools for forecasting, simulations and interventions in science and business analytics. In this paper, we proposed a graph neural network approach with score-based method aiming at learning a sparse DAG that captures the causal dependencies in a discretized time temporal graph. We demonstrate methods with graph neural network significantly outperformed other state-of-the-art methods with dynamic bayesian networking inference. In addition, from the experiments, the structural causal model can be more accurate than a linear SCM discovered by the methods such as Notears.
    AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization. (arXiv:2106.16101v6 [math.OC] UPDATED)
    In the paper, we propose a class of faster adaptive Gradient Descent Ascent (GDA) methods for solving the nonconvex-strongly-concave minimax problems by using the unified adaptive matrices, which include almost all existing coordinate-wise and global adaptive learning rates. In particular, we provide an effective convergence analysis framework for our adaptive GDA methods. Specifically, we propose a fast Adaptive Gradient Descent Ascent (AdaGDA) method based on the basic momentum technique, which reaches a lower gradient complexity of $\tilde{O}(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the existing results of the adaptive GDA methods by a factor of $O(\sqrt{\kappa})$. Moreover, we propose an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves a lower gradient complexity of $\tilde{O}(\kappa^{4.5}\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches, which improves the existing results of the adaptive GDA methods by a factor of $O(\epsilon^{-1})$. Moreover, we prove that our VR-AdaGDA method can reach the best known gradient complexity of $\tilde{O}(\kappa^{3}\epsilon^{-3})$ with the mini-batch size $O(\kappa^3)$. The experiments on policy evaluation and fair classifier learning tasks are conducted to verify the efficiency of our new algorithms.
    Deterministic training of generative autoencoders using invertible layers. (arXiv:2205.09546v4 [stat.ML] UPDATED)
    In this work, we provide a deterministic alternative to the stochastic variational training of generative autoencoders. We refer to these new generative autoencoders as AutoEncoders within Flows (AEF), since the encoder and decoder are defined as affine layers of an overall invertible architecture. This results in a deterministic encoding of the data, as opposed to the stochastic encoding of VAEs. The paper introduces two related families of AEFs. The first family relies on a partition of the ambient space and is trained by exact maximum-likelihood. The second family exploits a deterministic expansion of the ambient space and is trained by maximizing the log-probability in this extended space. This latter case leaves complete freedom in the choice of encoder, decoder and prior architectures, making it a drop-in replacement for the training of existing VAEs and VAE-style models. We show that these AEFs can have strikingly higher performance than architecturally identical VAEs in terms of log-likelihood and sample quality, especially for low dimensional latent spaces. Importantly, we show that AEF samples are substantially sharper than VAE samples.
    Minimax-Bayes Reinforcement Learning. (arXiv:2302.10831v1 [cs.LG])
    While the Bayesian decision-theoretic framework offers an elegant solution to the problem of decision making under uncertainty, one question is how to appropriately select the prior distribution. One idea is to employ a worst-case prior. However, this is not as easy to specify in sequential decision making as in simple statistical estimation problems. This paper studies (sometimes approximate) minimax-Bayes solutions for various reinforcement learning problems to gain insights into the properties of the corresponding priors and policies. We find that while the worst-case prior depends on the setting, the corresponding minimax policies are more robust than those that assume a standard (i.e. uniform) prior.
    Internal Wasserstein Distance for Adversarial Attack and Defense. (arXiv:2103.07598v4 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks that would trigger misclassification of DNNs but may be imperceptible to human perception. Adversarial defense has been an important way to improve the robustness of DNNs. Existing attack methods often construct adversarial examples relying on some metrics like the $\ell_p$ distance to perturb samples. However, these metrics can be insufficient to conduct adversarial attacks due to their limited perturbations. In this paper, we propose a new internal Wasserstein distance (IWD) to capture the semantic similarity of two samples, and thus it helps to obtain larger perturbations than currently used metrics such as the $\ell_p$ distance. We then apply the internal Wasserstein distance to perform adversarial attack and defense. In particular, we develop a novel attack method relying on IWD to calculate the similarities between an image and its adversarial examples. In this way, we can generate diverse and semantically similar adversarial examples that are more difficult to defend by existing defense methods. Moreover, we devise a new defense method relying on IWD to learn robust models against unseen adversarial examples. We provide both thorough theoretical and empirical evidence to support our methods.
    Quantile Bandits for Best Arms Identification. (arXiv:2010.11568v3 [cs.LG] UPDATED)
    We consider a variant of the best arm identification task in stochastic multi-armed bandits. Motivated by risk-averse decision-making problems, our goal is to identify a set of $m$ arms with the highest $\tau$-quantile values within a fixed budget. We prove asymmetric two-sided concentration inequalities for order statistics and quantiles of random variables that have non-decreasing hazard rate, which may be of independent interest. With these inequalities, we analyse a quantile version of Successive Accepts and Rejects (Q-SAR). We derive an upper bound for the probability of arm misidentification, the first justification of a quantile based algorithm for fixed budget multiple best arms identification. We show illustrative experiments for best arm identification.
    Physics-Informed Long Short-Term Memory for Forecasting and Reconstruction of Chaos. (arXiv:2302.10779v1 [cs.LG])
    We present the Physics-Informed Long Short-Term Memory (PI-LSTM) network to reconstruct and predict the evolution of unmeasured variables in a chaotic system. The training is constrained by a regularization term, which penalizes solutions that violate the system's governing equations. The network is showcased on the Lorenz-96 model, a prototypical chaotic dynamical system, for a varying number of variables to reconstruct. First, we show the PI-LSTM architecture and explain how to constrain the differential equations, which is a non-trivial task in LSTMs. Second, the PI-LSTM is numerically evaluated in the long-term autonomous evolution to study its ergodic properties. We show that it correctly predicts the statistics of the unmeasured variables, which cannot be achieved without the physical constraint. Third, we compute the Lyapunov exponents of the network to infer the key stability properties of the chaotic system. For reconstruction purposes, adding the physics-informed loss qualitatively enhances the dynamical behaviour of the network, compared to a data-driven only training. This is quantified by the agreement of the Lyapunov exponents. This work opens up new opportunities for state reconstruction and learning of the dynamics of nonlinear systems.
    A Unifying Perspective on Multi-Calibration: Unleashing Game Dynamics for Multi-Objective Learning. (arXiv:2302.10863v1 [cs.LG])
    We provide a unifying framework for the design and analysis of multi-calibrated and moment-multi-calibrated predictors. Placing the multi-calibration problem in the general setting of \emph{multi-objective learning} -- where learning guarantees must hold simultaneously over a set of distributions and loss functions -- we exploit connections to game dynamics to obtain state-of-the-art guarantees for a diverse set of multi-calibration learning problems. In addition to shedding light on existing multi-calibration guarantees, and greatly simplifying their analysis, our approach yields a $1/\epsilon^2$ improvement in the number of oracle calls compared to the state-of-the-art algorithm of Jung et al. 2021 for learning deterministic moment-calibrated predictors and an exponential improvement in $k$ compared to the state-of-the-art algorithm of Gopalan et al. 2022 for learning a $k$-class multi-calibrated predictor. Beyond multi-calibration, we use these game dynamics to address existing and emerging considerations in the study of group fairness and multi-distribution learning.
    Combining Blockchain and Biometrics: A Survey on Technical Aspects and a First Legal Analysis. (arXiv:2302.10883v1 [cs.CV])
    Biometric recognition as a unique, hard-to-forge, and efficient way of identification and verification has become an indispensable part of the current digital world. The fast evolution of this technology has been a strong incentive for integrating it into many applications. Meanwhile, blockchain, the very attractive decentralized ledger technology, has been widely received both by the research and industry in the past years and it is being increasingly deployed nowadays in many different applications, such as money transfer, IoT, healthcare, or logistics. Recently, researchers have started to speculate what would be the pros and cons and what would be the best applications when these two technologies cross paths. This paper provides a survey of technical literature research on the combination of blockchain and biometrics and includes a first legal analysis of this integration to shed light on challenges and potentials. While this combination is still in its infancy and a growing body of literature discusses specific blockchain applications and solutions in an advanced technological set-up, this paper presents a holistic understanding of blockchains applicability in the biometric sector. This study demonstrates that combining blockchain and biometrics would be beneficial for novel applications in biometrics such as the PKI mechanism, distributed trusted service, and identity management. However, blockchain networks at their current stage are not efficient and economical for real-time applications. From a legal point of view, the allocation of accountability remains a main issue, while other difficulties remain, such as conducting a proper Data Protection Impact Assessment. Finally, it supplies technical and legal recommendations to reap the benefits and mitigate the risks of the combination.
    Robust Mean Estimation Without a Mean: Dimension-Independent Error in Polynomial Time for Symmetric Distributions. (arXiv:2302.10844v1 [cs.DS])
    In this work, we study the problem of robustly estimating the mean/location parameter of distributions without moment bounds. For a large class of distributions satisfying natural symmetry constraints we give a sequence of algorithms that can efficiently estimate its location without incurring dimension-dependent factors in the error. Concretely, suppose an adversary can arbitrarily corrupt an $\varepsilon$-fraction of the observed samples. For every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ such that the dependence of the error on the corruption level $\varepsilon$ is an additive factor of $O(\varepsilon^{1-\frac{1}{2k}})$. The dependence on other problem parameters is also nearly optimal. Our class contains products of arbitrary symmetric one-dimensional distributions as well as elliptical distributions, a vast generalization of the Gaussian distribution. Examples include product Cauchy distributions and multi-variate $t$-distributions. In particular, even the first moment might not exist. We provide the first efficient algorithms for this class of distributions. Previously, such results where only known under boundedness assumptions on the moments of the distribution and in particular, are provably impossible in the absence of symmetry [KSS18, CTBJ22]. For the class of distributions we consider, all previous estimators either require exponential time or incur error depending on the dimension. Our algorithms are based on a generalization of the filtering technique [DK22]. We show how this machinery can be combined with Huber-loss-based approach to work with projections of the noise. Moreover, we show how sum-of-squares proofs can be used to obtain algorithmic guarantees even for distributions without first moment. We believe that this approach may find other application in future works.
    Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark. (arXiv:2109.12769v5 [cs.LG] UPDATED)
    Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years, and have been applied to in econometrics and machine learning communities. These studies acknowledge medicine and drug development as the main application area, but there has been limited translational research from the HTE methodology to drug development. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration when translating the methodology with benchmark experiments on healthcare administrative claim data. Also, we want to use benchmark experiments to show how to interpret and evaluate the model when it is applied to healthcare research. By introducing the recent HTE techniques to a broad readership in biomedical informatics communities, we expect to promote the wide adoption of causal inference using machine learning. We also expect to provide the feasibility of HTE for personalized drug effectiveness.
    Leveraging the Graph Structure of Neural Network Training Dynamics. (arXiv:2111.05410v2 [cs.LG] UPDATED)
    Understanding the training dynamics of deep neural networks (DNNs) is important as it can lead to improved training efficiency and task performance. Recent works have demonstrated that representing the wirings of static graph cannot capture how DNNs change over the course of training. Thus, in this work, we propose a compact, expressive temporal graph framework that effectively captures the dynamics of many workhorse architectures in computer vision. Specifically, it extracts an informative summary of graph properties (e.g., eigenvector centrality) over a sequence of DNN graphs obtained during training. We demonstrate that our framework captures useful dynamics by accurately predicting trained, task performance when using a summary over early training epochs (<5) across four different architectures and two image datasets. Moreover, by using a novel, highly-scalable DNN graph representation, we also show that the proposed framework captures generalizable dynamics as summaries extracted from smaller-width networks are effective when evaluated on larger widths.
    Provable Copyright Protection for Generative Models. (arXiv:2302.10870v1 [cs.LG])
    There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data $C$ that was in their training set. We give a formal definition of $\textit{near access-freeness (NAF)}$ and prove bounds on the probability that a model satisfying this definition outputs a sample similar to $C$, even if $C$ is included in its training set. Roughly speaking, a generative model $p$ is $\textit{$k$-NAF}$ if for every potentially copyrighted data $C$, the output of $p$ diverges by at most $k$-bits from the output of a model $q$ that $\textit{did not access $C$ at all}$. We also give generative model learning algorithms, which efficiently modify the original generative model learning algorithm in a black box manner, that output generative models with strong bounds on the probability of sampling protected content. Furthermore, we provide promising experiments for both language (transformers) and image (diffusion) generative models, showing minimal degradation in output quality while ensuring strong protections against sampling protected content.
    ALANNO: An Active Learning Annotation System for Mortals. (arXiv:2211.06224v2 [cs.LG] UPDATED)
    Supervised machine learning has become the cornerstone of today's data-driven society, increasing the need for labeled data. However, the process of acquiring labels is often expensive and tedious. One possible remedy is to use active learning (AL) -- a special family of machine learning algorithms designed to reduce labeling costs. Although AL has been successful in practice, a number of practical challenges hinder its effectiveness and are often overlooked in existing AL annotation tools. To address these challenges, we developed ALANNO, an open-source annotation system for NLP tasks equipped with features to make AL effective in real-world annotation projects. ALANNO facilitates annotation management in a multi-annotator setup and supports a variety of AL methods and underlying models, which are easily configurable and extensible.  ( 2 min )
    TherapyView: Visualizing Therapy Sessions with Temporal Topic Modeling and AI-Generated Arts. (arXiv:2302.10845v1 [cs.CL])
    We present the TherapyView, a demonstration system to help therapists visualize the dynamic contents of past treatment sessions, enabled by the state-of-the-art neural topic modeling techniques to analyze the topical tendencies of various psychiatric conditions and deep learning-based image generation engine to provide a visual summary. The system incorporates temporal modeling to provide a time-series representation of topic similarities at a turn-level resolution and AI-generated artworks given the dialogue segments to provide a concise representations of the contents covered in the session, offering interpretable insights for therapists to optimize their strategies and enhance the effectiveness of psychotherapy. This system provides a proof of concept of AI-augmented therapy tools with e in-depth understanding of the patient's mental state and enabling more effective treatment.
    A Note on Noisy Reservoir Computation. (arXiv:2302.10862v1 [cs.LG])
    In this note we extend the definition of the Information Processing Capacity (IPC) by Dambre et al [1] to include the effects of stochastic reservoir dynamics. We quantify the degradation of the IPC in the presence of this noise. [1] Dambre et al. Scientific Reports 2, 514, (2012)
    Benchmarking sparse system identification with low-dimensional chaos. (arXiv:2302.10787v1 [cs.LG])
    Sparse system identification is the data-driven process of obtaining parsimonious differential equations that describe the evolution of a dynamical system, balancing model complexity and accuracy. There has been rapid innovation in system identification across scientific domains, but there remains a gap in the literature for large-scale methodological comparisons that are evaluated on a variety of dynamical systems. In this work, we systematically benchmark sparse regression variants by utilizing the dysts standardized database of chaotic systems. In particular, we demonstrate how this open-source tool can be used to quantitatively compare different methods of system identification. To illustrate how this benchmark can be utilized, we perform a large comparison of four algorithms for solving the sparse identification of nonlinear dynamics (SINDy) optimization problem, finding strong performance of the original algorithm and a recent mixed-integer discrete algorithm. In all cases, we used ensembling to improve the noise robustness of SINDy and provide statistical comparisons. In addition, we show very compelling evidence that the weak SINDy formulation provides significant improvements over the traditional method, even on clean data. Lastly, we investigate how Pareto-optimal models generated from SINDy algorithms depend on the properties of the equations, finding that the performance shows no significant dependence on a set of dynamical properties that quantify the amount of chaos, scale separation, degree of nonlinearity, and the syntactic complexity.
    Federated Gradient Matching Pursuit. (arXiv:2302.10755v1 [cs.LG])
    Traditional machine learning techniques require centralizing all training data on one server or data hub. Due to the development of communication technologies and a huge amount of decentralized data on many clients, collaborative machine learning has become the main interest while providing privacy-preserving frameworks. In particular, federated learning (FL) provides such a solution to learn a shared model while keeping training data at local clients. On the other hand, in a wide range of machine learning and signal processing applications, the desired solution naturally has a certain structure that can be framed as sparsity with respect to a certain dictionary. This problem can be formulated as an optimization problem with sparsity constraints and solving it efficiently has been one of the primary research topics in the traditional centralized setting. In this paper, we propose a novel algorithmic framework, federated gradient matching pursuit (FedGradMP), to solve the sparsity constrained minimization problem in the FL setting. We also generalize our algorithms to accommodate various practical FL scenarios when only a subset of clients participate per round, when the local model estimation at clients could be inexact, or when the model parameters are sparse with respect to general dictionaries. Our theoretical analysis shows the linear convergence of the proposed algorithms. A variety of numerical experiments are conducted to demonstrate the great potential of the proposed framework -- fast convergence both in communication rounds and computation time for many important scenarios without sophisticated parameter tuning.
    A New Baseline for GreenAI: Finding the Optimal Sub-Network via Layer and Channel Pruning. (arXiv:2302.10798v1 [cs.LG])
    The concept of Green AI has been gaining attention within the deep learning community given the recent trend of ever larger and more complex neural network models. Some large models have billions of parameters causing the training time to take up to hundreds of GPU/TPU-days. The estimated energy consumption can be comparable to the annual total energy consumption of a standard household. Existing solutions to reduce the computational burden usually involve pruning the network parameters, however, they often create extra overhead either by iterative training and fine-tuning for static pruning or repeated computation of a dynamic pruning graph. We propose a new parameter pruning strategy that finds the effective group of lightweight sub-networks that minimizes the energy cost while maintaining comparable performances to the full network on given downstream tasks. Our proposed pruning scheme is green-oriented, such that the scheme only requires one-off training to discover the optimal static sub-networks by dynamic pruning methods. The pruning scheme consists of a lightweight, differentiable, and binarized gating module and novel loss functions to uncover sub-networks with user-defined sparsity. Our method enables pruning and training simultaneously, which saves energy in both the training and inference phases and avoids extra computational overhead from gating modules at inference time. Our results on CIFAR-10 and CIFAR-100 suggest that our scheme can remove ~50% of connections in deep networks with <1% reduction in classification accuracy. Compared to other related pruning methods, our method has a lower accuracy drop for equivalent reductions in computational costs.
    Hybridization of K-means with improved firefly algorithm for automatic clustering in high dimension. (arXiv:2302.10765v1 [cs.LG])
    K-means Clustering is the most well-known partitioning algorithm among all clustering, by which we can partition the data objects very easily in to more than one clusters. However, for K-means to choose an appropriate number of clusters without any prior domain knowledge about the dataset is challenging, especially in high-dimensional data objects. Hence, we have implemented the Silhouette and Elbow methods with PCA to find an optimal number of clusters. Also, previously, so many meta-heuristic swarm intelligence algorithms inspired by nature have been employed to handle the automatic data clustering problem. Firefly is efficient and robust for automatic clustering. However, in the Firefly algorithm, the entire population is automatically subdivided into sub-populations that decrease the convergence rate speed and trapping to local minima in high-dimensional optimization problems. Thus, our study proposed an enhanced firefly, i.e., a hybridized K-means with an ODFA model for automatic clustering. The experimental part shows output and graphs of the Silhouette and Elbow methods as well as the Firefly algorithm
    SparCA: Sparse Compressed Agglomeration for Feature Extraction and Dimensionality Reduction. (arXiv:2302.10776v1 [cs.LG])
    The most effective dimensionality reduction procedures produce interpretable features from the raw input space while also providing good performance for downstream supervised learning tasks. For many methods, this requires optimizing one or more hyperparameters for a specific task, which can limit generalizability. In this study we propose sparse compressed agglomeration (SparCA), a novel dimensionality reduction procedure that involves a multistep hierarchical feature grouping, compression, and feature selection process. We demonstrate the characteristics and performance of the SparCA method across heterogenous synthetic and real-world datasets, including images, natural language, and single cell gene expression data. Our results show that SparCA is applicable to a wide range of data types, produces highly interpretable features, and shows compelling performance on downstream supervised learning tasks without the need for hyperparameter tuning.
    Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret. (arXiv:2302.10796v1 [quant-ph])
    While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited. In particular, it remains elusive how to design provably efficient quantum RL algorithms that can address the exploration-exploitation trade-off. To this end, we propose a novel UCRL-style algorithm that takes advantage of quantum computing for tabular Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$, and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case regret for it, where $T$ is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation and prove that it enjoys $\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the $\Omega(\sqrt{T})$-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.
    Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management. (arXiv:2302.10850v1 [cs.LG])
    Reinforcement learning (RL) has shown great promise for developing dialogue management (DM) agents that are non-myopic, conduct rich conversations, and maximize overall user satisfaction. Despite recent developments in RL and language models (LMs), using RL to power conversational chatbots remains challenging, in part because RL requires online exploration to learn effectively, whereas collecting novel human-bot interactions can be expensive and unsafe. This issue is exacerbated by the combinatorial action spaces facing these algorithms, as most LM agents generate responses at the word level. We develop a variety of RL algorithms, specialized to dialogue planning, that leverage recent Mixture-of-Expert Language Models (MoE-LMs) -- models that capture diverse semantics, generate utterances reflecting different intents, and are amenable for multi-turn DM. By exploiting MoE-LM structure, our methods significantly reduce the size of the action space and improve the efficacy of RL-based DM. We evaluate our methods in open-domain dialogue to demonstrate their effectiveness w.r.t.\ the diversity of intent in generated utterances and overall DM performance.
    A General-Purpose Transferable Predictor for Neural Architecture Search. (arXiv:2302.10835v1 [cs.LG])
    Understanding and modelling the performance of neural architectures is key to Neural Architecture Search (NAS). Performance predictors have seen widespread use in low-cost NAS and achieve high ranking correlations between predicted and ground truth performance in several NAS benchmarks. However, existing predictors are often designed based on network encodings specific to a predefined search space and are therefore not generalizable to other search spaces or new architecture families. In this paper, we propose a general-purpose neural predictor for NAS that can transfer across search spaces, by representing any given candidate Convolutional Neural Network (CNN) with a Computation Graph (CG) that consists of primitive operators. We further combine our CG network representation with Contrastive Learning (CL) and propose a graph representation learning procedure that leverages the structural information of unlabeled architectures from multiple families to train CG embeddings for our performance predictor. Experimental results on NAS-Bench-101, 201 and 301 demonstrate the efficacy of our scheme as we achieve strong positive Spearman Rank Correlation Coefficient (SRCC) on every search space, outperforming several Zero-Cost Proxies, including Synflow and Jacov, which are also generalizable predictors across search spaces. Moreover, when using our proposed general-purpose predictor in an evolutionary neural architecture search algorithm, we can find high-performance architectures on NAS-Bench-101 and find a MobileNetV3 architecture that attains 79.2% top-1 accuracy on ImageNet.
    Localizing the Origin of Idiopathic Ventricular Arrhythmia from ECG Using an Attention-Based Recurrent Convolutional Neural Network. (arXiv:2302.10824v1 [eess.SP])
    Idiopathic ventricular arrhythmia (IVAs) is extra abnormal heartbeats disturbing the regular heart rhythm that can become fatal if left untreated. Cardiac catheter ablation is the standard approach to treat IVAs, however, a crucial prerequisite for the ablation is the localization of IVAs' origin. The current IVA localization techniques are invasive, rely on expert interpretation, or are inaccurate. In this study, we developed a new deep-learning algorithm that can automatically identify the origin of IVAs from ECG signals without the need for expert manual analysis. Our developed deep learning algorithm was comprised of a spatial fusion to extract the most informative features from multichannel ECG data, temporal modeling to capture the evolving pattern of the ECG time series, and an attention mechanism to weigh the most important temporal features and improve the model interpretability. The algorithm was validated on a 12-lead ECG dataset collected from 334 patients (230 females) who experienced IVA and successfully underwent a catheter ablation procedure that determined IVA's exact origins. The proposed method achieved an area under the curve of 93%, an accuracy of 94%, a sensitivity of 97%, a precision of 95%, and an F1 score of 96% in locating the origin of IVAs and outperformed existing automatic and semi-automatic algorithms. The proposed method shows promise toward automatic and noninvasive evaluation of IVA patients before cardiac catheter ablation.
    Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models. (arXiv:2209.13325v3 [cs.LG] UPDATED)
    Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that $\boldsymbol \gamma$ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration migrates the outlier amplifier to subsequent modules in an equivalent transformation, contributing to a more quantization-friendly model without any extra burden. The Token-Wise Clipping takes advantage of the large variance of token range and designs a token-wise coarse-to-fine pipeline, obtaining a clipping range with minimal final quantization loss in an efficient way. This framework effectively suppresses the outliers and can be used in a plug-and-play mode. Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit post-training BERT quantization to the full-precision (FP) level. Our code is available at https://github.com/wimh966/outlier_suppression.
    Backtracking Counterfactuals. (arXiv:2211.00472v2 [cs.AI] UPDATED)
    Counterfactual reasoning -- envisioning hypothetical scenarios, or possible worlds, where some circumstances are different from what (f)actually occurred (counter-to-fact) -- is ubiquitous in human cognition. Conventionally, counterfactually-altered circumstances have been treated as "small miracles" that locally violate the laws of nature while sharing the same initial conditions. In Pearl's structural causal model (SCM) framework this is made mathematically rigorous via interventions that modify the causal laws while the values of exogenous variables are shared. In recent years, however, this purely interventionist account of counterfactuals has increasingly come under scrutiny from both philosophers and psychologists. Instead, they suggest a backtracking account of counterfactuals, according to which the causal laws remain unchanged in the counterfactual world; differences to the factual world are instead "backtracked" to altered initial conditions (exogenous variables). In the present work, we explore and formalise this alternative mode of counterfactual reasoning within the SCM framework. Despite ample evidence that humans backtrack, the present work constitutes, to the best of our knowledge, the first general account and algorithmisation of backtracking counterfactuals. We discuss our backtracking semantics in the context of related literature and draw connections to recent developments in explainable artificial intelligence (XAI).
    Interpreting wealth distribution via poverty map inference using multimodal data. (arXiv:2302.10793v1 [cs.LG])
    Poverty maps are essential tools for governments and NGOs to track socioeconomic changes and adequately allocate infrastructure and services in places in need. Sensor and online crowd-sourced data combined with machine learning methods have provided a recent breakthrough in poverty map inference. However, these methods do not capture local wealth fluctuations, and are not optimized to produce accountable results that guarantee accurate predictions to all sub-populations. Here, we propose a pipeline of machine learning models to infer the mean and standard deviation of wealth across multiple geographically clustered populated places, and illustrate their performance in Sierra Leone and Uganda. These models leverage seven independent and freely available feature sources based on satellite images, and metadata collected via online crowd-sourcing and social media. Our models show that combined metadata features are the best predictors of wealth in rural areas, outperforming image-based models, which are the best for predicting the highest wealth quintiles. Our results recover the local mean and variation of wealth, and correctly capture the positive yet non-monotonous correlation between them. We further demonstrate the capabilities and limitations of model transfer across countries and the effects of data recency and other biases. Our methodology provides open tools to build towards more transparent and interpretable models to help governments and NGOs to make informed decisions based on data availability, urbanization level, and poverty thresholds.
    Eagle: Large-Scale Learning of Turbulent Fluid Dynamics with Mesh Transformers. (arXiv:2302.10803v1 [cs.LG])
    Estimating fluid dynamics is classically done through the simulation and integration of numerical models solving the Navier-Stokes equations, which is computationally complex and time-consuming even on high-end hardware. This is a notoriously hard problem to solve, which has recently been addressed with machine learning, in particular graph neural networks (GNN) and variants trained and evaluated on datasets of static objects in static scenes with fixed geometry. We attempt to go beyond existing work in complexity and introduce a new model, method and benchmark. We propose EAGLE, a large-scale dataset of 1.1 million 2D meshes resulting from simulations of unsteady fluid dynamics caused by a moving flow source interacting with nonlinear scene structure, comprised of 600 different scenes of three different types. To perform future forecasting of pressure and velocity on the challenging EAGLE dataset, we introduce a new mesh transformer. It leverages node clustering, graph pooling and global attention to learn long-range dependencies between spatially distant data points without needing a large number of iterations, as existing GNN methods do. We show that our transformer outperforms state-of-the-art performance on, both, existing synthetic and real datasets and on EAGLE. Finally, we highlight that our approach learns to attend to airflow, integrating complex information in a single iteration.
    Hyena Hierarchy: Towards Larger Convolutional Language Models. (arXiv:2302.10866v1 [cs.LG])
    Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.
    Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. (arXiv:2110.06309v3 [eess.AS] UPDATED)
    While Wav2Vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented. We show that V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy, further improves the performance on SER. We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations. Experiments show that P-TAPT performs better than TAPT, especially under low-resource settings. Compared to prior works in this literature, our top-line system achieved a 7.4\% absolute improvement in unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our code is publicly available.
    Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies. (arXiv:2210.01400v3 [cs.LG] UPDATED)
    We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as inexact versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.
    Evaluating the effect of data augmentation and BALD heuristics on distillation of Semantic-KITTI dataset. (arXiv:2302.10679v1 [cs.CV])
    Active Learning (AL) has remained relatively unexplored for LiDAR perception tasks in autonomous driving datasets. In this study we evaluate Bayesian active learning methods applied to the task of dataset distillation or core subset selection (subset with near equivalent performance as full dataset). We also study the effect of application of data augmentation (DA) within Bayesian AL based dataset distillation. We perform these experiments on the full Semantic-KITTI dataset. We extend our study over our existing work only on 1/4th of the same dataset. Addition of DA and BALD have a negative impact over the labeling efficiency and thus the capacity to distill datasets. We demonstrate key issues in designing a functional AL framework and finally conclude with a review of challenges in real world active learning.
    A Survey of Trustworthy Federated Learning with Perspectives on Security, Robustness, and Privacy. (arXiv:2302.10637v1 [cs.LG])
    Trustworthy artificial intelligence (AI) technology has revolutionized daily life and greatly benefited human society. Among various AI technologies, Federated Learning (FL) stands out as a promising solution for diverse real-world scenarios, ranging from risk evaluation systems in finance to cutting-edge technologies like drug discovery in life sciences. However, challenges around data isolation and privacy threaten the trustworthiness of FL systems. Adversarial attacks against data privacy, learning algorithm stability, and system confidentiality are particularly concerning in the context of distributed training in federated learning. Therefore, it is crucial to develop FL in a trustworthy manner, with a focus on security, robustness, and privacy. In this survey, we propose a comprehensive roadmap for developing trustworthy FL systems and summarize existing efforts from three key aspects: security, robustness, and privacy. We outline the threats that pose vulnerabilities to trustworthy federated learning across different stages of development, including data processing, model training, and deployment. To guide the selection of the most appropriate defense methods, we discuss specific technical solutions for realizing each aspect of Trustworthy FL (TFL). Our approach differs from previous work that primarily discusses TFL from a legal perspective or presents FL from a high-level, non-technical viewpoint.
    Clustered Data Sharing for Non-IID Federated Learning over Wireless Networks. (arXiv:2302.10747v1 [cs.LG])
    Federated Learning (FL) is a novel distributed machine learning approach to leverage data from Internet of Things (IoT) devices while maintaining data privacy. However, the current FL algorithms face the challenges of non-independent and identically distributed (non-IID) data, which causes high communication costs and model accuracy declines. To address the statistical imbalances in FL, we propose a clustered data sharing framework which spares the partial data from cluster heads to credible associates through device-to-device (D2D) communication. Moreover, aiming at diluting the data skew on nodes, we formulate the joint clustering and data sharing problem based on the privacy-preserving constrained graph. To tackle the serious coupling of decisions on the graph, we devise a distribution-based adaptive clustering algorithm (DACA) basing on three deductive cluster-forming conditions, which ensures the maximum yield of data sharing. The experiments show that the proposed framework facilitates FL on non-IID datasets with better convergence and model accuracy under a limited communication environment.
    MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation. (arXiv:2302.10872v1 [cs.AR])
    Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms. On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.
    Trading Off Privacy, Utility and Efficiency in Federated Learning. (arXiv:2209.00230v3 [cs.LG] UPDATED)
    Federated learning (FL) enables participating parties to collaboratively build a global model with boosted utility without disclosing private data information. Appropriate protection mechanisms have to be adopted to fulfill the opposing requirements in preserving \textit{privacy} and maintaining high model \textit{utility}. In addition, it is a mandate for a federated learning system to achieve high \textit{efficiency} in order to enable large-scale model training and deployment. We propose a unified federated learning framework that reconciles horizontal and vertical federated learning. Based on this framework, we formulate and quantify the trade-offs between privacy leakage, utility loss, and efficiency reduction, which leads us to the No-Free-Lunch (NFL) theorem for the federated learning system. NFL indicates that it is unrealistic to expect an FL algorithm to simultaneously provide excellent privacy, utility, and efficiency in certain scenarios. We then analyze the lower bounds for the privacy leakage, utility loss and efficiency reduction for several widely-adopted protection mechanisms including \textit{Randomization}, \textit{Homomorphic Encryption}, \textit{Secret Sharing} and \textit{Compression}. Our analysis could serve as a guide for selecting protection parameters to meet particular requirements.
    Growing Steerable Neural Cellular Automata. (arXiv:2302.10197v1 [cs.NE])
    Neural Cellular Automata (NCA) models have shown remarkable capacity for pattern formation and complex global behaviors stemming from local coordination. However, in the original implementation of NCA, cells are incapable of adjusting their own orientation, and it is the responsibility of the model designer to orient them externally. A recent isotropic variant of NCA (Growing Isotropic Neural Cellular Automata) makes the model orientation-independent - cells can no longer tell up from down, nor left from right - by removing its dependency on perceiving the gradient of spatial states in its neighborhood. In this work, we revisit NCA with a different approach: we make each cell responsible for its own orientation by allowing it to "turn" as determined by an adjustable internal state. The resulting Steerable NCA contains cells of varying orientation embedded in the same pattern. We observe how, while Isotropic NCA are orientation-agnostic, Steerable NCA have chirality: they have a predetermined left-right symmetry. We therefore show that we can train Steerable NCA in similar but simpler ways than their Isotropic variant by: (1) breaking symmetries using only two seeds, or (2) introducing a rotation-invariant training objective and relying on asynchronous cell updates to break the up-down symmetry of the system.
    Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation. (arXiv:2109.01369v6 [cs.LG] UPDATED)
    Deep neural networks have demonstrated remarkable performance in many data-driven and prediction-oriented applications, and sometimes even perform better than humans. However, their most significant drawback is the lack of interpretability, which makes them less attractive in many real-world applications. When relating to the moral problem or the environmental factors that are uncertain such as crime judgment, financial analysis, and medical diagnosis, it is essential to mine the evidence for the model's prediction (interpret model knowledge) to convince humans. Thus, investigating how to interpret model knowledge is of paramount importance for both academic research and real applications.
    Evaluating the Effectiveness of Pre-trained Language Models in Predicting the Helpfulness of Online Product Reviews. (arXiv:2302.10199v1 [cs.CL])
    Businesses and customers can gain valuable information from product reviews. The sheer number of reviews often necessitates ranking them based on their potential helpfulness. However, only a few reviews ever receive any helpfulness votes on online marketplaces. Sorting all reviews based on the few existing votes can cause helpful reviews to go unnoticed because of the limited attention span of readers. The problem of review helpfulness prediction is even more important for higher review volumes, and newly written reviews or launched products. In this work we compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews. The contributions of our work in relation to literature include extensively investigating the efficacy of state-of-the-art language models -- both monolingual and multilingual -- against a robust baseline, taking ranking metrics into account when assessing these approaches, and assessing multilingual models for the first time. We employ the Amazon review dataset for our experiments. According to our study on several product categories, multilingual and monolingual pre-trained language models outperform the baseline that utilizes random forest with handcrafted features as much as 23% in RMSE. Pre-trained language models reduce the need for complex text feature engineering. However, our results suggest that pre-trained multilingual models may not be used for fine-tuning only one language. We assess the performance of language models with and without additional features. Our results show that including additional features like product rating by the reviewer can further help the predictive methods.
    Can Large Language Models Change User Preference Adversarially?. (arXiv:2302.10291v1 [cs.CL])
    Pretrained large language models (LLMs) are becoming increasingly powerful and ubiquitous in mainstream applications such as being a personal assistant, a dialogue model, etc. As these models become proficient in deducing user preferences and offering tailored assistance, there is an increasing concern about the ability of these models to influence, modify and in the extreme case manipulate user preference adversarially. The issue of lack of interpretability in these models in adversarial settings remains largely unsolved. This work tries to study adversarial behavior in user preferences from the lens of attention probing, red teaming and white-box analysis. Specifically, it provides a bird's eye view of existing literature, offers red teaming samples for dialogue models like ChatGPT and GODEL and probes the attention mechanism in the latter for non-adversarial and adversarial settings.
    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. (arXiv:2302.09664v2 [cs.CL] UPDATED)
    We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy -- an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.
    VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge. (arXiv:2302.10248v1 [cs.SD])
    This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.
    Deep Reinforcement Learning for Cost-Effective Medical Diagnosis. (arXiv:2302.10261v1 [cs.LG])
    Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects lab test panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the $F_1$ score instead of the error rate. However, optimizing the non-concave $F_1$ score is not a classic RL problem, thus invalidates standard RL methods. To remedy this issue, we develop a reward shaping approach, leveraging properties of the $F_1$ score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained $F_1$ score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on diverse clinical tasks: ferritin abnormality detection, sepsis mortality prediction, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identifies all Pareto-front solutions. Across all tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to $85\%$ reduction in testing cost. The code is available at [https://github.com/Zheng321/Blood_Panel].
    AttentionMixer: An Accurate and Interpretable Framework for Process Monitoring. (arXiv:2302.10426v1 [cs.AI])
    An accurate and explainable automatic monitoring system is critical for the safety of high efficiency energy conversion plants that operate under extreme working condition. Nonetheless, currently available data-driven monitoring systems often fall short in meeting the requirements for either high-accuracy or interpretability, which hinders their application in practice. To overcome this limitation, a data-driven approach, AttentionMixer, is proposed under a generalized message passing framework, with the goal of establishing an accurate and interpretable radiation monitoring framework for energy conversion plants. To improve the model accuracy, the first technical contribution involves the development of spatial and temporal adaptive message passing blocks, which enable the capture of spatial and temporal correlations, respectively; the two blocks are cascaded through a mixing operator. To enhance the model interpretability, the second technical contribution involves the implementation of a sparse message passing regularizer, which eliminates spurious and noisy message passing routes. The effectiveness of the AttentionMixer approach is validated through extensive evaluations on a monitoring benchmark collected from the national radiation monitoring network for nuclear power plants, resulting in enhanced monitoring accuracy and interpretability in practice.
    DTAAD: Dual Tcn-Attention Networks for Anomaly Detection in Multivariate Time Series Data. (arXiv:2302.10753v1 [cs.LG])
    Anomaly detection techniques enable effective anomaly detection and diagnosis in multi-variate time series data, which are of major significance for today's industrial applications. However, establishing an anomaly detection system that can be rapidly and accurately located is a challenging problem due to the lack of outlier tags, the high dimensional complexity of the data, memory bottlenecks in the actual hardware, and the need for fast reasoning. We have proposed an anomaly detection and diagnosis model--DTAAD in this paper, based on Transformer and Dual TCN. Our overall model will be an integrated design in which AR combines AE structures, introducing scaling methods and feedback mechanisms to improve prediction accuracy and expand correlation differences. The Dual TCN-Attention Network(DTA) constructed by us only uses a single layer of Transformer encoder in our baseline experiment, which belongs to an ultra-lightweight model. Our extensive experiments on six publicly datasets validate that DTAAD exceeds current most advanced baseline methods in both detection and diagnostic performance. Specifically, DTAAD improved F1 scores by $8.38\%$, and reduced training time by $99\%$ compared to baseline. The code and training scripts are publicly on GitHub at https://github.com/Yu-Lingrui/DTAAD.
    Crop mapping in the small sample/no sample case: an approach using a two-level cascade classifier and integrating domain knowledge. (arXiv:2302.10270v1 [cs.CV])
    Mapping crops using remote sensing technology is important for food security and land management. Machine learning-based methods has become a popular approach for crop mapping in recent years. However, the key to machine learning, acquiring ample and accurate samples, is usually time-consuming and laborious. To solve this problem, a crop mapping method in the small sample/no sample case that integrating domain knowledge and using a cascaded classification framework that combine a weak classifier learned from samples with strong features and a strong classifier trained by samples with weak feature was proposed. First, based on the domain knowledge of various crops, a low-capacity classifier such as decision tree was applied to acquire those pixels with distinctive features and complete observation sequences as "strong feature" samples. Then, to improve the representativeness of these samples, sample augmentation strategy that artificially remove the observations of "strong feature" samples according to the average valid observation proportion in target area was applied. Finally, based on the original samples and augmented samples, a large-capacity classifier such as random forest was trained for crop mapping. The method achieved an overall accuracy of 82% in the MAP crop recognition competition held by Syngenta Group, China in 2021 (third prize, ranked fourth). This method integrates domain knowledge to overcome the difficulties of sample acquisition, providing a convenient, fast and accurate solution for crop mapping.
    A Comparative Analysis of CNN-Based Pretrained Models for the Detection and Prediction of Monkeypox. (arXiv:2302.10277v1 [cs.CV])
    Monkeypox is a rare disease that raised concern among medical specialists following the convi-19 pandemic. It's concerning since monkeypox is difficult to diagnose early on because of symptoms that are similar to chickenpox and measles. Furthermore, because this is a rare condition, there is a knowledge gap among healthcare professionals. As a result, there is an urgent need for a novel technique to combat and anticipate the disease in the early phases of individual virus infection. Multiple CNN-based pre-trained models, including VGG-16, VGG-19, Restnet50, Inception-V3, Densnet, Xception, MobileNetV2, Alexnet, Lenet, and majority Voting, were employed in classification in this study. For this study, multiple data sets were combined, such as monkeypox vs chickenpox, monkeypox versus measles, monkeypox versus normal, and monkeypox versus all diseases. Majority voting performed 97% in monkeypox vs chickenpox, Xception achieved 79% in monkeypox against measles, MobileNetV2 scored 96% in monkeypox vs normal, and Lenet performed 80% in monkeypox versus all.
    Scalable Batch-Mode Deep Bayesian Active Learning via Equivalence Class Annealing. (arXiv:2112.13737v3 [cs.LG] UPDATED)
    Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines insights from decision-theoretic active learning, combinatorial information measure, and diversity sampling. At its core, Batch-BALanCe relies on a novel decision-theoretic acquisition function that facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of hypotheses (e.g., posterior samples of deep neural networks) with similar predictions, and Batch-BALanCe adaptively adjusts the size of the equivalence classes as learning progresses. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel information measure defined through the acquisition function. We show that our algorithm can effectively handle realistic multi-class classification tasks, and achieves compelling performance on several benchmark datasets for active learning under both low- and large-batch regimes. Reference code is released at https://github.com/zhangrenyuuchicago/BALanCe.
    Understanding Edge-of-Stability Training Dynamics with a Minimalist Example. (arXiv:2210.03294v2 [cs.LG] UPDATED)
    Recently, researchers observed that gradient descent for deep neural networks operates in an ``edge-of-stability'' (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold $2/\eta$ (where $\eta$ is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below $2/\eta$. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and $2/\eta$. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to $2/\eta$. Globally we observe that the training dynamics for our example has an interesting bifurcating behavior, which was also observed in the training of neural nets.
    Are we certain it's anomalous?. (arXiv:2211.09224v2 [cs.LG] UPDATED)
    The progress in modelling time series and, more generally, sequences of structured-data has recently revamped research in anomaly detection. The task stands for identifying abnormal behaviours in financial series, IT systems, aerospace measurements, and the medical domain, where anomaly detection may aid in isolating cases of depression and attend the elderly. Anomaly detection in time series is a complex task since anomalies are rare due to highly non-linear temporal correlations and since the definition of anomalous is sometimes subjective. Here we propose the novel use of Hyperbolic uncertainty for Anomaly Detection (HypAD). HypAD learns self-supervisedly to reconstruct the input signal. We adopt best practices from the state-of-the-art to encode the sequence by an LSTM, jointly learnt with a decoder to reconstruct the signal, with the aid of GAN critics. Uncertainty is estimated end-to-end by means of a hyperbolic neural network. By using uncertainty, HypAD may assess whether it is certain about the input signal but it fails to reconstruct it because this is anomalous; or whether the reconstruction error does not necessarily imply anomaly, as the model is uncertain, e.g. a complex but regular input signal. The novel key idea is that a detectable anomaly is one where the model is certain but it predicts wrongly. HypAD outperforms the current state-of-the-art for univariate anomaly detection on established benchmarks based on data from NASA, Yahoo, Numenta, Amazon, Twitter. It also yields state-of-the-art performance on a multivariate dataset of anomaly activities in elderly home residences, and it outperforms the baseline on SWaT. Overall, HypAD yields the lowest false alarms at the best performance rate, thanks to successfully identifying detectable anomalies.
    A Review of Probabilistic Control and Majorization of Optimal Control. (arXiv:2205.03279v3 [cs.LG] UPDATED)
    In probabilistic control a controller is designed by matching modelled with some arbitrary but desired closed-loop system trajectory distribution. In thisworkwe reviewseveral productive approaches to measure the proximity between probable and desired behaviour. We then illustrate how the associated optimization problems solve into uncertain policies. Our main result is to show that these probabilistic control objectives majorize conventional, stochastic and risk sensitive, optimal control objectives. This observation allows us to identify two probabilistic fixed point iterations that converge to the deterministic optimal control policies. Based on these insights we discuss directions for future algorithmic development and point out some remaining challenges.
    Dateformer: Time-modeling Transformer for Longer-term Series Forecasting. (arXiv:2207.05397v2 [cs.LG] UPDATED)
    Transformers have demonstrated impressive strength in long-term series forecasting. Existing prediction research mostly focused on mapping past short sub-series (lookback window) to future series (forecast window). The longer training dataset time series will be discarded, once training is completed. Models can merely rely on lookback window information for inference, which impedes models from analyzing time series from a global perspective. And these windows used by Transformers are quite narrow because they must model each time-step therein. Under this point-wise processing style, broadening windows will rapidly exhaust their model capacity. This, for fine-grained time series, leads to a bottleneck in information input and prediction output, which is mortal to long-term series forecasting. To overcome the barrier, we propose a brand-new methodology to utilize Transformer for time series forecasting. Specifically, we split time series into patches by day and reform point-wise to patch-wise processing, which considerably enhances the information input and output of Transformers. To further help models leverage the whole training set's global information during inference, we distill the information, store it in time representations, and replace series with time representations as the main modeling entities. Our designed time-modeling Transformer -- Dateformer yields state-of-the-art accuracy on 7 real-world datasets with a 33.6\% relative improvement and extends the maximum forecast range to half-year.
    On Robust Numerical Solver for ODE via Self-Attention Mechanism. (arXiv:2302.10184v1 [cs.LG])
    With the development of deep learning techniques, AI-enhanced numerical solvers are expected to become a new paradigm for solving differential equations due to their versatility and effectiveness in alleviating the accuracy-speed trade-off in traditional numerical solvers. However, this paradigm still inevitably requires a large amount of high-quality data, whose acquisition is often very expensive in natural science and engineering problems. Therefore, in this paper, we explore training efficient and robust AI-enhanced numerical solvers with a small data size by mitigating intrinsic noise disturbances. We first analyze the ability of the self-attention mechanism to regulate noise in supervised learning and then propose a simple-yet-effective numerical solver, AttSolver, which introduces an additive self-attention mechanism to the numerical solution of differential equations based on the dynamical system perspective of the residual neural network. Our results on benchmarks, ranging from high-dimensional problems to chaotic systems, demonstrate the effectiveness of AttSolver in generally improving the performance of existing traditional numerical solvers without any elaborated model crafting. Finally, we analyze the convergence, generalization, and robustness of the proposed method experimentally and theoretically.
    Dual Representation Learning for One-Step Clustering of Multi-View Data. (arXiv:2208.14450v2 [cs.LG] UPDATED)
    Multi-view data are commonly encountered in data mining applications. Effective extraction of information from multi-view data requires specific design of clustering methods to cater for data with multiple views, which is non-trivial and challenging. In this paper, we propose a novel one-step multi-view clustering method by exploiting the dual representation of both the common and specific information of different views. The motivation originates from the rationale that multi-view data contain not only the consistent knowledge between views but also the unique knowledge of each view. Meanwhile, to make the representation learning more specific to the clustering task, a one-step learning framework is proposed to integrate representation learning and clustering partition as a whole. With this framework, the representation learning and clustering partition mutually benefit each other, which effectively improve the clustering performance. Results from extensive experiments conducted on benchmark multi-view datasets clearly demonstrate the superiority of the proposed method.
    Understanding the effect of varying amounts of replay per step. (arXiv:2302.10311v1 [cs.LG])
    Model-based reinforcement learning uses models to plan, where the predictions and policies of an agent can be improved by using more computation without additional data from the environment, thereby improving sample efficiency. However, learning accurate estimates of the model is hard. Subsequently, the natural question is whether we can get similar benefits as planning with model-free methods. Experience replay is an essential component of many model-free algorithms enabling sample-efficient learning and stability by providing a mechanism to store past experiences for further reuse in the gradient computational process. Prior works have established connections between models and experience replay by planning with the latter. This involves increasing the number of times a mini-batch is sampled and used for updates at each step (amount of replay per step). We attempt to exploit this connection by doing a systematic study on the effect of varying amounts of replay per step in a well-known model-free algorithm: Deep Q-Network (DQN) in the Mountain Car environment. We empirically show that increasing replay improves DQN's sample efficiency, reduces the variation in its performance, and makes it more robust to change in hyperparameters. Altogether, this takes a step toward a better algorithm for deployment.
    Transfer Ranking in Finance: Applications to Cross-Sectional Momentum with Data Scarcity. (arXiv:2208.09968v3 [q-fin.TR] UPDATED)
    Cross-sectional strategies are a classical and popular trading style, with recent high performing variants incorporating sophisticated neural architectures. While these strategies have been applied successfully to data-rich settings involving mature assets with long histories, deploying them on instruments with limited samples generally produce over-fitted models with degraded performance. In this paper, we introduce Fused Encoder Networks -- a novel and hybrid parameter-sharing transfer ranking model. The model fuses information extracted using an encoder-attention module operated on a source dataset with a similar but separate module focused on a smaller target dataset of interest. This mitigates the issue of models with poor generalisability that are a consequence of training on scarce target data. Additionally, the self-attention mechanism enables interactions among instruments to be accounted for, not just at the loss level during model training, but also at inference time. Focusing on momentum applied to the top ten cryptocurrencies by market capitalisation as a demonstrative use-case, the Fused Encoder Networks outperforms the reference benchmarks on most performance measures, delivering a three-fold boost in the Sharpe ratio over classical momentum as well as an improvement of approximately 50% against the best benchmark model without transaction costs. It continues outperforming baselines even after accounting for the high transaction costs associated with trading cryptocurrencies.
    Uncertainty-Aware Reward-based Deep Reinforcement Learning for Intent Analysis of Social Media Information. (arXiv:2302.10195v1 [cs.CL])
    Due to various and serious adverse impacts of spreading fake news, it is often known that only people with malicious intent would propagate fake news. However, it is not necessarily true based on social science studies. Distinguishing the types of fake news spreaders based on their intent is critical because it will effectively guide how to intervene to mitigate the spread of fake news with different approaches. To this end, we propose an intent classification framework that can best identify the correct intent of fake news. We will leverage deep reinforcement learning (DRL) that can optimize the structural representation of each tweet by removing noisy words from the input sequence when appending an actor to the long short-term memory (LSTM) intent classifier. Policy gradient DRL model (e.g., REINFORCE) can lead the actor to a higher delayed reward. We also devise a new uncertainty-aware immediate reward using a subjective opinion that can explicitly deal with multidimensional uncertainty for effective decision-making. Via 600K training episodes from a fake news tweets dataset with an annotated intent class, we evaluate the performance of uncertainty-aware reward in DRL. Evaluation results demonstrate that our proposed framework efficiently reduces the number of selected words to maintain a high 95\% multi-class accuracy.
    Hierarchical Perception Adversarial Learning Framework for Compressed Sensing MRI. (arXiv:2302.10309v1 [eess.IV])
    The long acquisition time has limited the accessibility of magnetic resonance imaging (MRI) because it leads to patient discomfort and motion artifacts. Although several MRI techniques have been proposed to reduce the acquisition time, compressed sensing in magnetic resonance imaging (CS-MRI) enables fast acquisition without compromising SNR and resolution. However, existing CS-MRI methods suffer from the challenge of aliasing artifacts. This challenge results in the noise-like textures and missing the fine details, thus leading to unsatisfactory reconstruction performance. To tackle this challenge, we propose a hierarchical perception adversarial learning framework (HP-ALF). HP-ALF can perceive the image information in the hierarchical mechanism: image-level perception and patch-level perception. The former can reduce the visual perception difference in the entire image, and thus achieve aliasing artifact removal. The latter can reduce this difference in the regions of the image, and thus recover fine details. Specifically, HP-ALF achieves the hierarchical mechanism by utilizing multilevel perspective discrimination. This discrimination can provide the information from two perspectives (overall and regional) for adversarial learning. It also utilizes a global and local coherent discriminator to provide structure information to the generator during training. In addition, HP-ALF contains a context-aware learning block to effectively exploit the slice information between individual images for better reconstruction performance. The experiments validated on three datasets demonstrate the effectiveness of HP-ALF and its superiority to the comparative methods.
    Unsupervised Learning on a DIET: Datum IndEx as Target Free of Self-Supervision, Reconstruction, Projector Head. (arXiv:2302.10260v1 [cs.AI])
    Costly, noisy, and over-specialized, labels are to be set aside in favor of unsupervised learning if we hope to learn cheap, reliable, and transferable models. To that end, spectral embedding, self-supervised learning, or generative modeling have offered competitive solutions. Those methods however come with numerous challenges \textit{e.g.} estimating geodesic distances, specifying projector architectures and anti-collapse losses, or specifying decoder architectures and reconstruction losses. In contrast, we introduce a simple explainable alternative -- coined \textbf{DIET} -- to learn representations from unlabeled data, free of those challenges. \textbf{DIET} is blatantly simple: take one's favorite classification setup and use the \textbf{D}atum \textbf{I}nd\textbf{E}x as its \textbf{T}arget class, \textit{i.e. each sample is its own class}, no further changes needed. \textbf{DIET} works without a decoder/projector network, is not based on positive pairs nor reconstruction, introduces no hyper-parameters, and works out-of-the-box across datasets and architectures. Despite \textbf{DIET}'s simplicity, the learned representations are of high-quality and often on-par with the state-of-the-art \textit{e.g.} using a linear classifier on top of DIET's learned representation reaches $71.4\%$ on CIFAR100 with a Resnet101, $52.5\%$ on TinyImagenet with a Resnext50.
    Neural Algorithmic Reasoning with Causal Regularisation. (arXiv:2302.10258v1 [cs.LG])
    Recent work on neural algorithmic reasoning has investigated the reasoning capabilities of neural networks, effectively demonstrating they can learn to execute classical algorithms on unseen data coming from the train distribution. However, the performance of existing neural reasoners significantly degrades on out-of-distribution (OOD) test data, where inputs have larger sizes. In this work, we make an important observation: there are many \emph{different} inputs for which an algorithm will perform certain intermediate computations \emph{identically}. This insight allows us to develop data augmentation procedures that, given an algorithm's intermediate trajectory, produce inputs for which the target algorithm would have \emph{exactly} the same next trajectory step. Then, we employ a causal framework to design a corresponding self-supervised objective, and we prove that it improves the OOD generalisation capabilities of the reasoner. We evaluate our method on the CLRS algorithmic reasoning benchmark, where we show up to 3$\times$ improvements on the OOD test data.
    MAC-PO: Multi-Agent Experience Replay via Collective Priority Optimization. (arXiv:2302.10418v1 [cs.LG])
    Experience replay is crucial for off-policy reinforcement learning (RL) methods. By remembering and reusing the experiences from past different policies, experience replay significantly improves the training efficiency and stability of RL algorithms. Many decision-making problems in practice naturally involve multiple agents and require multi-agent reinforcement learning (MARL) under centralized training decentralized execution paradigm. Nevertheless, existing MARL algorithms often adopt standard experience replay where the transitions are uniformly sampled regardless of their importance. Finding prioritized sampling weights that are optimized for MARL experience replay has yet to be explored. To this end, we propose \name, which formulates optimal prioritized experience replay for multi-agent problems as a regret minimization over the sampling weights of transitions. Such optimization is relaxed and solved using the Lagrangian multiplier approach to obtain the close-form optimal sampling weights. By minimizing the resulting policy regret, we can narrow the gap between the current policy and a nominal optimal policy, thus acquiring an improved prioritization scheme for multi-agent tasks. Our experimental results on Predator-Prey and StarCraft Multi-Agent Challenge environments demonstrate the effectiveness of our method, having a better ability to replay important transitions and outperforming other state-of-the-art baselines.  ( 2 min )
    Online Evolutionary Neural Architecture Search for Multivariate Non-Stationary Time Series Forecasting. (arXiv:2302.10347v1 [cs.LG])
    Time series forecasting (TSF) is one of the most important tasks in data science given the fact that accurate time series (TS) predictive models play a major role across a wide variety of domains including finance, transportation, health care, and power systems. Real-world utilization of machine learning (ML) typically involves (pre-)training models on collected, historical data and then applying them to unseen data points. However, in real-world applications, time series data streams are usually non-stationary and trained ML models usually, over time, face the problem of data or concept drift. To address this issue, models must be periodically retrained or redesigned, which takes significant human and computational resources. Additionally, historical data may not even exist to re-train or re-design model with. As a result, it is highly desirable that models are designed and trained in an online fashion. This work presents the Online NeuroEvolution-based Neural Architecture Search (ONE-NAS) algorithm, which is a novel neural architecture search method capable of automatically designing and dynamically training recurrent neural networks (RNNs) for online forecasting tasks. Without any pre-training, ONE-NAS utilizes populations of RNNs that are continuously updated with new network structures and weights in response to new multivariate input data. ONE-NAS is tested on real-world, large-scale multivariate wind turbine data as well as the univariate Dow Jones Industrial Average (DJIA) dataset. Results demonstrate that ONE-NAS outperforms traditional statistical time series forecasting methods, including online linear regression, fixed long short-term memory (LSTM) and gated recurrent unit (GRU) models trained online, as well as state-of-the-art, online ARIMA strategies.  ( 2 min )
    Take Me Home: Reversing Distribution Shifts using Reinforcement Learning. (arXiv:2302.10341v1 [cs.LG])
    Deep neural networks have repeatedly been shown to be non-robust to the uncertainties of the real world. Even subtle adversarial attacks and naturally occurring distribution shifts wreak havoc on systems relying on deep neural networks. In response to this, current state-of-the-art techniques use data-augmentation to enrich the training distribution of the model and consequently improve robustness to natural distribution shifts. We propose an alternative approach that allows the system to recover from distribution shifts online. Specifically, our method applies a sequence of semantic-preserving transformations to bring the shifted data closer in distribution to the training set, as measured by the Wasserstein distance. We formulate the problem of sequence selection as an MDP, which we solve using reinforcement learning. To aid in our estimates of Wasserstein distance, we employ dimensionality reduction through orthonormal projection. We provide both theoretical and empirical evidence that orthonormal projection preserves characteristics of the data at the distributional level. Finally, we apply our distribution shift recovery approach to the ImageNet-C benchmark for distribution shifts, targeting shifts due to additive noise and image histogram modifications. We demonstrate an improvement in average accuracy up to 14.21% across a variety of state-of-the-art ImageNet classifiers.  ( 2 min )
    Quantum Machine Learning hyperparameter search. (arXiv:2302.10298v1 [cs.LG])
    This paper presents a quantum-based Fourier-regression approach for machine learning hyperparameter optimization applied to a benchmark of models trained on a dataset related to a forecast problem in the airline industry. Our approach utilizes the Fourier series method to represent the hyperparameter search space, which is then optimized using quantum algorithms to find the optimal set of hyperparameters for a given machine learning model. Our study evaluates the proposed method on a benchmark of models trained to predict a forecast problem in the airline industry using a standard HyperParameter Optimizer (HPO). The results show that our approach outperforms traditional hyperparameter optimization methods in terms of accuracy and convergence speed for the given search space. Our study provides a new direction for future research in quantum-based machine learning hyperparameter optimization.  ( 2 min )
    Active Learning with Positive and Negative Pairwise Feedback. (arXiv:2302.10295v1 [cs.LG])
    In this paper, we propose a generic framework for active clustering with queries for pairwise similarities between objects. First, the pairwise similarities can be any positive or negative number, yielding full flexibility in the type of feedback that a user/annotator can provide. Second, the process of querying pairwise similarities is separated from the clustering algorithm, leading to more flexibility in how the query strategies can be constructed. Third, the queries are robust to noise by allowing multiple queries for the same pairwise similarity (i.e., a non-persistent noise model is assumed). Finally, the number of clusters is automatically identified based on the currently known pairwise similarities. In addition, we propose and analyze a number of novel query strategies suited to this active clustering framework. We demonstrate the effectiveness of our framework and the proposed query strategies via several experimental studies.  ( 2 min )
    Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions. (arXiv:2302.10282v1 [cs.CV])
    Existing language and vision models achieve impressive performance in image-text understanding. Yet, it is an open question to what extent they can be used for language understanding in 3D environments and whether they implicitly acquire 3D object knowledge, e.g. about different views of an object. In this paper, we investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object and identify canonical views of common objects based on text queries. We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints and evaluate them in terms of their similarity to natural language descriptions. We find that a pre-trained CLIP model performs poorly on most canonical views and that fine-tuning using hard negative sampling and random contrasting yields good results even under conditions with little available training data.  ( 2 min )
    Link Prediction on Latent Heterogeneous Graphs. (arXiv:2302.10432v1 [cs.LG])
    On graph data, the multitude of node or edge types gives rise to heterogeneous information networks (HINs). To preserve the heterogeneous semantics on HINs, the rich node/edge types become a cornerstone of HIN representation learning. However, in real-world scenarios, type information is often noisy, missing or inaccessible. Assuming no type information is given, we define a so-called latent heterogeneous graph (LHG), which carries latent heterogeneous semantics as the node/edge types cannot be observed. In this paper, we study the challenging and unexplored problem of link prediction on an LHG. As existing approaches depend heavily on type-based information, they are suboptimal or even inapplicable on LHGs. To address the absence of type information, we propose a model named LHGNN, based on the novel idea of semantic embedding at node and path levels, to capture latent semantics on and between nodes. We further design a personalization function to modulate the heterogeneous contexts conditioned on their latent semantics w.r.t. the target node, to enable finer-grained aggregation. Finally, we conduct extensive experiments on four benchmark datasets, and demonstrate the superior performance of LHGNN.  ( 2 min )
    Unsupervised Out-of-Distribution Detection with Diffusion Inpainting. (arXiv:2302.10326v1 [cs.CV])
    Unsupervised out-of-distribution detection (OOD) seeks to identify out-of-domain data by learning only from unlabeled in-domain data. We present a novel approach for this task - Lift, Map, Detect (LMD) - that leverages recent advancement in diffusion models. Diffusion models are one type of generative models. At their core, they learn an iterative denoising process that gradually maps a noisy image closer to their training manifolds. LMD leverages this intuition for OOD detection. Specifically, LMD lifts an image off its original manifold by corrupting it, and maps it towards the in-domain manifold with a diffusion model. For an out-of-domain image, the mapped image would have a large distance away from its original manifold, and LMD would identify it as OOD accordingly. We show through extensive experiments that LMD achieves competitive performance across a broad variety of datasets.  ( 2 min )
    Mean Parity Fair Regression in RKHS. (arXiv:2302.10409v1 [stat.ML])
    We study the fair regression problem under the notion of Mean Parity (MP) fairness, which requires the conditional mean of the learned function output to be constant with respect to the sensitive attributes. We address this problem by leveraging reproducing kernel Hilbert space (RKHS) to construct the functional space whose members are guaranteed to satisfy the fairness constraints. The proposed functional space suggests a closed-form solution for the fair regression problem that is naturally compatible with multiple sensitive attributes. Furthermore, by formulating the fairness-accuracy tradeoff as a relaxed fair regression problem, we derive a corresponding regression function that can be implemented efficiently and provides interpretable tradeoffs. More importantly, under some mild assumptions, the proposed method can be applied to regression problems with a covariance-based notion of fairness. Experimental results on benchmark datasets show the proposed methods achieve competitive and even superior performance compared with several state-of-the-art methods.  ( 2 min )
    Faster high-accuracy log-concave sampling via algorithmic warm starts. (arXiv:2302.10249v1 [math.ST])
    Understanding the complexity of sampling from a strongly log-concave and log-smooth distribution $\pi$ on $\mathbb{R}^d$ to high accuracy is a fundamental problem, both from a practical and theoretical standpoint. In practice, high-accuracy samplers such as the classical Metropolis-adjusted Langevin algorithm (MALA) remain the de facto gold standard; and in theory, via the proximal sampler reduction, it is understood that such samplers are key for sampling even beyond log-concavity (in particular, for distributions satisfying isoperimetric assumptions). In this work, we improve the dimension dependence of this sampling problem to $\tilde{O}(d^{1/2})$, whereas the previous best result for MALA was $\tilde{O}(d)$. This closes the long line of work on the complexity of MALA, and moreover leads to state-of-the-art guarantees for high-accuracy sampling under strong log-concavity and beyond (thanks to the aforementioned reduction). Our starting point is that the complexity of MALA improves to $\tilde{O}(d^{1/2})$, but only under a warm start (an initialization with constant R\'enyi divergence w.r.t. $\pi$). Previous algorithms took much longer to find a warm start than to use it, and closing this gap has remained an important open problem in the field. Our main technical contribution settles this problem by establishing the first $\tilde{O}(d^{1/2})$ R\'enyi mixing rates for the discretized underdamped Langevin diffusion. For this, we develop new differential-privacy-inspired techniques based on R\'enyi divergences with Orlicz--Wasserstein shifts, which allow us to sidestep longstanding challenges for proving fast convergence of hypocoercive differential equations.  ( 2 min )
    Heterogeneous Social Event Detection via Hyperbolic Graph Representations. (arXiv:2302.10362v1 [cs.SI])
    Social events reflect the dynamics of society and, here, natural disasters and emergencies receive significant attention. The timely detection of these events can provide organisations and individuals with valuable information to reduce or avoid losses. However, due to the complex heterogeneities of the content and structure of social media, existing models can only learn limited information; large amounts of semantic and structural information are ignored. In addition, due to high labour costs, it is rare for social media datasets to include high-quality labels, which also makes it challenging for models to learn information from social media. In this study, we propose two hyperbolic graph representation-based methods for detecting social events from heterogeneous social media environments. For cases where a dataset has labels, we designed a Hyperbolic Social Event Detection (HSED) model that converts complex social information into a unified social message graph. This model addresses the heterogeneity of social media, and, with this graph, the information in social media can be used to capture structural information based on the properties of hyperbolic space. For cases where the dataset is unlabelled, we designed an Unsupervised Hyperbolic Social Event Detection (UHSED). This model is based on the HSED model but includes graph contrastive learning to make it work in unlabelled scenarios. Extensive experiments demonstrate the superiority of the proposed approaches.  ( 2 min )
    Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance. (arXiv:2302.10305v1 [cs.CV])
    Recent years have witnessed astonishing advances in the field of multimodal representation learning, with contrastive learning being the cornerstone for major breakthroughs. Latest works delivered further improvements by incorporating different objectives such as masked modeling and captioning into the frameworks, but our understanding on how these objectives facilitate learning remains vastly incomplete. In this paper, we leverage the fact that classifier-guided diffusion models generate images that reflect the semantic signals provided by the classifier to study the characteristics of multimodal learning objectives. Specifically, we compare contrastive, matching and captioning loss in terms of their semantic signals, and introduce a simple baseline that not only supports our analyses but also improves the quality of generative guidance in a straightforward manner.  ( 2 min )
    Model-based feature selection for neural networks: A mixed-integer programming approach. (arXiv:2302.10344v1 [math.OC])
    In this work, we develop a novel input feature selection framework for ReLU-based deep neural networks (DNNs), which builds upon a mixed-integer optimization approach. While the method is generally applicable to various classification tasks, we focus on finding input features for image classification for clarity of presentation. The idea is to use a trained DNN, or an ensemble of trained DNNs, to identify the salient input features. The input feature selection is formulated as a sequence of mixed-integer linear programming (MILP) problems that find sets of sparse inputs that maximize the classification confidence of each category. These ''inverse'' problems are regularized by the number of inputs selected for each category and by distribution constraints. Numerical results on the well-known MNIST and FashionMNIST datasets show that the proposed input feature selection allows us to drastically reduce the size of the input to $\sim$15\% while maintaining a good classification accuracy. This allows us to design DNNs with significantly fewer connections, reducing computational effort and producing DNNs that are more robust towards adversarial attacks.  ( 2 min )
    On Function-Coupled Watermarks for Deep Neural Networks. (arXiv:2302.10296v1 [cs.CV])
    Well-performed deep neural networks (DNNs) generally require massive labelled data and computational resources for training. Various watermarking techniques are proposed to protect such intellectual properties (IPs), wherein the DNN providers implant secret information into the model so that they can later claim IP ownership by retrieving their embedded watermarks with some dedicated trigger inputs. While promising results are reported in the literature, existing solutions suffer from watermark removal attacks, such as model fine-tuning and model pruning. In this paper, we propose a novel DNN watermarking solution that can effectively defend against the above attacks. Our key insight is to enhance the coupling of the watermark and model functionalities such that removing the watermark would inevitably degrade the model's performance on normal inputs. To this end, unlike previous methods relying on secret features learnt from out-of-distribution data, our method only uses features learnt from in-distribution data. Specifically, on the one hand, we propose to sample inputs from the original training dataset and fuse them as watermark triggers. On the other hand, we randomly mask model weights during training so that the information of our embedded watermarks spreads in the network. By doing so, model fine-tuning/pruning would not forget our function-coupled watermarks. Evaluation results on various image classification tasks show a 100\% watermark authentication success rate under aggressive watermark removal attacks, significantly outperforming existing solutions. Code is available: https://github.com/cure-lab/Function-Coupled-Watermark.  ( 2 min )
    From seeing to remembering: Images with harder-to-reconstruct representations leave stronger memory traces. (arXiv:2302.10392v1 [q-bio.NC])
    Much of what we remember is not due to intentional selection, but simply a by-product of perceiving. This raises a foundational question about the architecture of the mind: How does perception interface with and influence memory? Here, inspired by a classic proposal relating perceptual processing to memory durability, the level-of-processing theory, we present a sparse coding model for compressing feature embeddings of images, and show that the reconstruction residuals from this model predict how well images are encoded into memory. In an open memorability dataset of scene images, we show that reconstruction error not only explains memory accuracy but also response latencies during retrieval, subsuming, in the latter case, all of the variance explained by powerful vision-only models. We also confirm a prediction of this account with 'model-driven psychophysics'. This work establishes reconstruction error as a novel signal interfacing perception and memory, possibly through adaptive modulation of perceptual processing.  ( 2 min )
    Adaptive Sparse Gaussian Process. (arXiv:2302.10325v1 [cs.LG])
    Adaptive learning is necessary for non-stationary environments where the learning machine needs to forget past data distribution. Efficient algorithms require a compact model update to not grow in computational burden with the incoming data and with the lowest possible computational cost for online parameter updating. Existing solutions only partially cover these needs. Here, we propose the first adaptive sparse Gaussian Process (GP) able to address all these issues. We first reformulate a variational sparse GP algorithm to make it adaptive through a forgetting factor. Next, to make the model inference as simple as possible, we propose updating a single inducing point of the sparse GP model together with the remaining model parameters every time a new sample arrives. As a result, the algorithm presents a fast convergence of the inference process, which allows an efficient model update (with a single inference iteration) even in highly non-stationary environments. Experimental results demonstrate the capabilities of the proposed algorithm and its good performance in modeling the predictive posterior in mean and confidence interval estimation compared to state-of-the-art approaches.  ( 2 min )
    Route, Interpret, Repeat: Blurring the Line Between Post hoc Explainability and Interpretable Models. (arXiv:2302.10289v1 [cs.LG])
    The current approach to ML model design is either to choose a flexible Blackbox model and explain it post hoc or to start with an interpretable model. Blackbox models are flexible but difficult to explain, whereas interpretable models are designed to be explainable. However, developing interpretable models necessitates extensive ML knowledge, and the resulting models tend to be less flexible, offering potentially subpar performance compared to their Blackbox equivalents. This paper aims to blur the distinction between a post hoc explanation of a BlackBox and constructing interpretable models. We propose beginning with a flexible BlackBox model and gradually \emph{carving out} a mixture of interpretable models and a \emph{residual network}. Our design identifies a subset of samples and \emph{routes} them through the interpretable models. The remaining samples are routed through a flexible residual network. We adopt First Order Logic (FOL) as the interpretable model's backbone, which provides basic reasoning on concepts retrieved from the BlackBox model. On the residual network, we repeat the method until the proportion of data explained by the residual network falls below a desired threshold. Our approach offers several advantages. First, the mixture of interpretable and flexible residual networks results in almost no compromise in performance. Second, the route, interpret, and repeat approach yields a highly flexible interpretable model. Our extensive experiment demonstrates the performance of the model on various datasets. We show that by editing the FOL model, we can fix the shortcut learned by the original BlackBox model. Finally, our method provides a framework for a hybrid symbolic-connectionist network that is simple to train and adaptable to many applications.  ( 2 min )
    Criminal Investigation Tracker with Suspect Prediction using Machine Learning. (arXiv:2302.10423v1 [cs.LG])
    An automated approach to identifying offenders in Sri Lanka would be better than the current system. Obtaining information from eyewitnesses is one of the less reliable approaches and procedures still in use today. Automated criminal identification has the ability to save lives, notwithstanding Sri Lankan culture's lack of awareness of the issue. Using cutting-edge technology like biometrics to finish this task would be the most accurate strategy. The most notable outcomes will be obtained by applying fingerprint and face recognition as biometric techniques. The main responsibilities will be image optimization and criminality. CCTV footage may be used to identify a person's fingerprint, identify a person's face, and identify crimes involving weapons. Additionally, we unveil a notification system and condense the police report to Additionally, to make it simpler for police officers to understand the essential points of the crime, we develop a notification system and condense the police report. Additionally, if an incident involving a weapon is detected, an automated notice of the crime with all the relevant facts is sent to the closest police station. The summarization of the police report is what makes this the most original. In order to improve the efficacy of the overall image, the system will quickly and precisely identify the full crime scene, identify, and recognize the suspects using their faces and fingerprints, and detect firearms. This study provides a novel approach for crime prediction based on real-world data, and criminality incorporation. A crime or occurrence should be reported to the appropriate agencies, and the suggested web application should be improved further to offer a workable channel of communication.  ( 3 min )
    Hadamard Layer to Improve Semantic Segmentation. (arXiv:2302.10318v1 [cs.CV])
    The Hadamard Layer, a simple and computationally efficient way to improve results in semantic segmentation tasks, is presented. This layer has no free parameters that require to be trained. Therefore it does not increase the number of model parameters, and the extra computational cost is marginal. Experimental results show that the new Hadamard layer substantially improves the performance of the investigated models (variants of the Pix2Pix model). The performance's improvement can be explained by the Hadamard layer forcing the network to produce an internal encoding of the classes so that all bins are active. Therefore, the network computation is more distributed. In a sort that the Hadamard layer requires that to change the predicted class, it is necessary to modify $2^{k-1}$ bins, assuming $k$ bins in the encoding. A specific loss function allows a stable and fast training convergence.  ( 2 min )
    Gaussian processes at the Helm(holtz): A more fluid model for ocean currents. (arXiv:2302.10364v1 [stat.ME])
    Oceanographers are interested in predicting ocean currents and identifying divergences in a current vector field based on sparse observations of buoy velocities. Since we expect current dynamics to be smooth but highly non-linear, Gaussian processes (GPs) offer an attractive model. But we show that applying a GP with a standard stationary kernel directly to buoy data can struggle at both current prediction and divergence identification -- due to some physically unrealistic prior assumptions. To better reflect known physical properties of currents, we propose to instead put a standard stationary kernel on the divergence and curl-free components of a vector field obtained through a Helmholtz decomposition. We show that, because this decomposition relates to the original vector field just via mixed partial derivatives, we can still perform inference given the original data with only a small constant multiple of additional computational expense. We illustrate the benefits of our method on synthetic and real ocean data.  ( 2 min )
    Computation of conditional expectations with guarantees. (arXiv:2112.01804v3 [stat.CO] UPDATED)
    Theoretically, the conditional expectation of a square-integrable random variable $Y$ given a $d$-dimensional random vector $X$ can be obtained by minimizing the mean squared distance between $Y$ and $f(X)$ over all Borel measurable functions $f \colon \mathbb{R}^d \to \mathbb{R}$. However, in many applications this minimization problem cannot be solved exactly, and instead, a numerical method which computes an approximate minimum over a suitable subfamily of Borel functions has to be used. The quality of the result depends on the adequacy of the subfamily and the performance of the numerical method. In this paper, we derive an expected value representation of the minimal mean squared distance which in many applications can efficiently be approximated with a standard Monte Carlo average. This enables us to provide guarantees for the accuracy of any numerical approximation of a given conditional expectation. We illustrate the method by assessing the quality of approximate conditional expectations obtained by linear, polynomial and neural network regression in different concrete examples.
    Context-Aware Timewise VAEs for Real-Time Vehicle Trajectory Prediction. (arXiv:2302.10873v1 [cs.CV])
    Real-time, accurate prediction of human steering behaviors has wide applications, from developing intelligent traffic systems to deploying autonomous driving systems in both real and simulated worlds. In this paper, we present ContextVAE, a context-aware approach for multi-modal vehicle trajectory prediction. Built upon the backbone architecture of a timewise variational autoencoder, ContextVAE employs a dual attention mechanism for observation encoding that accounts for the environmental context information and the dynamic agents' states in a unified way. By utilizing features extracted from semantic maps during agent state encoding, our approach takes into account both the social features exhibited by agents on the scene and the physical environment constraints to generate map-compliant and socially-aware trajectories. We perform extensive testing on the nuScenes prediction challenge, Lyft Level 5 dataset and Waymo Open Motion Dataset to show the effectiveness of our approach and its state-of-the-art performance. In all tested datasets, ContextVAE models are fast to train and provide high-quality multi-modal predictions in real-time.
    KG-Hub -- Building and Exchanging Biological Knowledge Graphs. (arXiv:2302.10800v1 [q-bio.QM])
    Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of knowledge graphs is lacking. Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of knowledge graphs. Features include a simple, modular extract-transform-load (ETL) pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate knowledge graphs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph machine learning, including node embeddings and training of models for link prediction and node classification.
    A Generative Adversarial Network for Climate Tipping Point Discovery (TIP-GAN). (arXiv:2302.10274v1 [cs.LG])
    We propose a new Tipping Point Generative Adversarial Network (TIP-GAN) for better characterizing potential climate tipping points in Earth system models. We describe an adversarial game to explore the parameter space of these models, detect upcoming tipping points, and discover the drivers of tipping points. In this setup, a set of generators learn to construct model configurations that will invoke a climate tipping point. The discriminator learns to identify which generators are generating each model configuration and whether a given configuration will lead to a tipping point. The discriminator is trained using an oracle (a surrogate climate model) to test if a generated model configuration leads to a tipping point or not. We demonstrate the application of this GAN to invoke the collapse of the Atlantic Meridional Overturning Circulation (AMOC). We share experimental results of modifying the loss functions and the number of generators to exploit the area of uncertainty in model state space near a climate tipping point. In addition, we show that our trained discriminator can predict AMOC collapse with a high degree of accuracy without the use of the oracle. This approach could generalize to other tipping points, and could augment climate modeling research by directing users interested in studying tipping points to parameter sets likely to induce said tipping points in their computationally intensive climate models.
    Diagnosis of Covid-19 Via Patient Breath Data Using Artificial Intelligence. (arXiv:2302.10180v1 [cs.LG])
    Using machine learning algorithms for the rapid diagnosis and detection of the COVID-19 pandemic and isolating the patients from crowded environments are very important to controlling the epidemic. This study aims to develop a point-of-care testing (POCT) system that can detect COVID-19 by detecting volatile organic compounds (VOCs) in a patient's exhaled breath using the Gradient Boosted Trees Learner Algorithm. 294 breath samples were collected from 142 patients at Istanbul Medipol Mega Hospital between December 2020 and March 2021. 84 cases out of 142 resulted in negatives, and 58 cases resulted in positives. All these breath samples have been converted into numeric values through five air sensors. 10% of the data have been used for the validation of the model, while 75% of the test data have been used for training an AI model to predict the coronavirus presence. 25% have been used for testing. The SMOTE oversampling method was used to increase the training set size and reduce the imbalance of negative and positive classes in training and test data. Different machine learning algorithms have also been tried to develop the e-nose model. The test results have suggested that the Gradient Boosting algorithm created the best model. The Gradient Boosting model provides 95% recall when predicting COVID-19 positive patients and 96% accuracy when predicting COVID-19 negative patients.
    Spatio-Temporal Denoising Graph Autoencoders with Data Augmentation for Photovoltaic Timeseries Data Imputation. (arXiv:2302.10860v1 [cs.LG])
    The integration of the global Photovoltaic (PV) market with real time data-loggers has enabled large scale PV data analytical pipelines for power forecasting and long-term reliability assessment of PV fleets. Nevertheless, the performance of PV data analysis heavily depends on the quality of PV timeseries data. This paper proposes a novel Spatio-Temporal Denoising Graph Autoencoder (STD-GAE) framework to impute missing PV Power Data. STD-GAE exploits temporal correlation, spatial coherence, and value dependencies from domain knowledge to recover missing data. Experimental results show that STD-GAE can achieve a gain of 43.14% in imputation accuracy and remains less sensitive to missing rate, different seasons, and missing scenarios, compared with state-of-the-art data imputation methods such as MIDA and LRTC-TNN.
    A Novel Noise Injection-based Training Scheme for Better Model Robustness. (arXiv:2302.10802v1 [cs.LG])
    Noise injection-based method has been shown to be able to improve the robustness of artificial neural networks in previous work. In this work, we propose a novel noise injection-based training scheme for better model robustness. Specifically, we first develop a likelihood ratio method to estimate the gradient with respect to both synaptic weights and noise levels for stochastic gradient descent training. Then, we design an approximation for the vanilla noise injection-based training method to reduce memory and improve computational efficiency. Next, we apply our proposed scheme to spiking neural networks and evaluate the performance of classification accuracy and robustness on MNIST and Fashion-MNIST datasets. Experiment results show that our proposed method achieves a much better performance on adversarial robustness and slightly better performance on original accuracy, compared with the conventional gradient-based training method.
    Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice. (arXiv:2210.03709v4 [cs.HC] UPDATED)
    Recent years have seen growing interest among both researchers and practitioners in user-engaged approaches to algorithm auditing, which directly engage users in detecting problematic behaviors in algorithmic systems. However, we know little about industry practitioners' current practices and challenges around user-engaged auditing, nor what opportunities exist for them to better leverage such approaches in practice. To investigate, we conducted a series of interviews and iterative co-design activities with practitioners who employ user-engaged auditing approaches in their work. Our findings reveal several challenges practitioners face in appropriately recruiting and incentivizing user auditors, scaffolding user audits, and deriving actionable insights from user-engaged audit reports. Furthermore, practitioners shared organizational obstacles to user-engaged auditing, surfacing a complex relationship between practitioners and user auditors. Based on these findings, we discuss opportunities for future HCI research to help realize the potential (and the mitigate risks) of user-engaged auditing in industry practice.
    Intrinsic fluctuations of reinforcement learning promote cooperation. (arXiv:2209.01013v2 [cs.LG] UPDATED)
    In this work, we ask for and answer what makes classical temporal-difference reinforcement learning with epsilon-greedy strategies cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. We use the iterated Prisoner's dilemma with one-period memory as a testbed. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.
    Some Fundamental Aspects about Lipschitz Continuity of Neural Network Functions. (arXiv:2302.10886v1 [cs.LG])
    Lipschitz continuity is a simple yet pivotal functional property of any predictive model that lies at the core of its robustness, generalisation, and adversarial vulnerability. Our aim is to thoroughly investigate and characterise the Lipschitz behaviour of the functions learned via neural networks. Despite the significant tightening of the bounds in the recent years, precisely estimating the Lipschitz constant continues to be a practical challenge and tight theoretical analyses, similarly, remain intractable. Therefore, we shift our perspective and instead attempt to uncover insights about the nature of Lipschitz constant of neural networks functions -- by relying on the simplest and most general upper and lower bounds. We carry out an empirical investigation in a range of different settings (architectures, losses, optimisers, label noise, etc.), which reveals several fundamental and intriguing traits of the Lipschitz continuity of neural networks functions, In particular, we identify a remarkable double descent trend in both upper and lower bounds to the Lipschitz constant which tightly aligns with the typical double descent trend in the test loss.
    On the Importance of Sign Labeling: The Hamburg Sign Language Notation System Case Study. (arXiv:2302.10768v1 [cs.LG])
    Labeling is the cornerstone of supervised machine learning, which has been exploited in a plethora of various applications, with sign language recognition being one of them. However, such algorithms must be fed with a huge amount of consistently labeled data during the training process to elaborate a well-generalizing model. In addition, there is a great need for an automated solution that works with any nationally diversified sign language. Although there are language-agnostic transcription systems, such as the Hamburg Sign Language Notation System (HamNoSys) that describe the signer's initial position and body movement instead of the glosses' meanings, there are still issues with providing accurate and reliable labels for every real-world use case. In this context, the industry relies heavily on manual attribution and labeling of the available video data. In this work, we tackle this issue and thoroughly analyze the HamNoSys labels provided by various maintainers of open sign language corpora in five sign languages, in order to examine the challenges encountered in labeling video data. We also investigate the consistency and objectivity of HamNoSys-based labels for the purpose of training machine learning models. Our findings provide valuable insights into the limitations of the current labeling methods and pave the way for future research on developing more accurate and efficient solutions for sign language recognition.  ( 2 min )
    Repeated Bilateral Trade Against a Smoothed Adversary. (arXiv:2302.10805v1 [cs.LG])
    We study repeated bilateral trade where an adaptive $\sigma$-smooth adversary generates the valuations of sellers and buyers. We provide a complete characterization of the regret regimes for fixed-price mechanisms under different feedback models in the two cases where the learner can post either the same or different prices to buyers and sellers. We begin by showing that the minimax regret after $T$ rounds is of order $\sqrt{T}$ in the full-feedback scenario. Under partial feedback, any algorithm that has to post the same price to buyers and sellers suffers worst-case linear regret. However, when the learner can post two different prices at each round, we design an algorithm enjoying regret of order $T^{3/4}$ ignoring log factors. We prove that this rate is optimal by presenting a surprising $T^{3/4}$ lower bound, which is the main technical contribution of the paper.  ( 2 min )
    Potential Penetrative Pass (P3). (arXiv:2302.10760v1 [cs.LG])
    To score goals in football, a team needs to move forward on the pitch and there are various ways to do so. Depending on the game plan & philosophy; some teams prefer to play long balls from either wings or defense. Others, prefer to penetrate in depth with passes and outplay the opponent players. To objectively & in an automated way evaluate how teams play penetrative passes compared to the number of times they had the potential to do so, the "Potential Penetrative Pass (P3)" concept is presented here.  ( 2 min )
    Online estimation methods for irregular autoregressive models. (arXiv:2302.10785v1 [cs.LG])
    In the last decades, due to the huge technological growth observed, it has become increasingly common that a collection of temporal data rapidly accumulates in vast amounts. This provides an opportunity for extracting valuable information through the estimation of increasingly precise models. But at the same time it imposes the challenge of continuously updating the models as new data become available. Currently available methods for addressing this problem, the so-called online learning methods, use current parameter estimations and novel data to update the estimators. These approaches avoid using the full raw data and speeding up the computations. In this work we consider three online learning algorithms for parameters estimation in the context of time series models. In particular, the methods implemented are: gradient descent, Newton-step and Kalman filter recursions. These algorithms are applied to the recently developed irregularly observed autoregressive (iAR) model. The estimation accuracy of the proposed methods is assessed by means of Monte Carlo experiments. The results obtained show that the proposed online estimation methods allow for a precise estimation of the parameters that generate the data both for the regularly and irregularly observed time series. These online approaches are numerically efficient, allowing substantial computational time savings. Moreover, we show that the proposed methods are able to adapt the parameter estimates quickly when the time series behavior changes, unlike batch estimation methods.  ( 2 min )
    Effect of temporal resolution on the reproduction of chaotic dynamics via reservoir computing. (arXiv:2302.10761v1 [cs.LG])
    Reservoir computing is a machine learning paradigm that uses a structure called a reservoir, which has nonlinearities and short-term memory. In recent years, reservoir computing has expanded to new functions such as the autonomous generation of chaotic time series, as well as time series prediction and classification. Furthermore, novel possibilities have been demonstrated, such as inferring the existence of previously unseen attractors. Sampling, in contrast, has a strong influence on such functions. Sampling is indispensable in a physical reservoir computer that uses an existing physical system as a reservoir because the use of an external digital system for the data input is usually inevitable. This study analyzes the effect of sampling on the ability of reservoir computing to autonomously regenerate chaotic time series. We found, as expected, that excessively coarse sampling degrades the system performance, but also that excessively dense sampling is unsuitable. Based on quantitative indicators that capture the local and global characteristics of attractors, we identify a suitable window of the sampling frequency and discuss its underlying mechanisms.  ( 2 min )
    Managing multi-facet bias in collaborative filtering recommender systems. (arXiv:2302.10575v1 [cs.IR])
    Due to the extensive growth of information available online, recommender systems play a more significant role in serving people's interests. Traditional recommender systems mostly use an accuracy-focused approach to produce recommendations. Today's research suggests that this single-dimension approach can lead the system to be biased against a series of items with certain attributes. Biased recommendations across groups of items can endanger the interests of item providers along with causing user dissatisfaction with the system. This study aims to manage a new type of intersectional bias regarding the geographical origin and popularity of items in the output of state-of-the-art collaborative filtering recommender algorithms. We introduce an algorithm called MFAIR, a multi-facet post-processing bias mitigation algorithm to alleviate these biases. Extensive experiments on two real-world datasets of movies and books, enriched with the items' continents of production, show that the proposed algorithm strikes a reasonable balance between accuracy and both types of the mentioned biases. According to the results, our proposed approach outperforms a well-known competitor with no or only a slight loss of efficiency.  ( 2 min )
    Regret Analysis of Online LQR Control via Trajectory Prediction and Tracking: Extended Version. (arXiv:2302.10411v1 [math.OC])
    In this paper, we propose and analyze a new method for online linear quadratic regulator (LQR) control with a priori unknown time-varying cost matrices. The cost matrices are revealed sequentially with the potential for future values to be previewed over a short window. Our novel method involves using the available cost matrices to predict the optimal trajectory, and a tracking controller to drive the system towards it. We adopted the notion of dynamic regret to measure the performance of this proposed online LQR control method, with our main result being that the (dynamic) regret of our method is upper bounded by a constant. Moreover, the regret upper bound decays exponentially with the preview window length, and is extendable to systems with disturbances. We show in simulations that our proposed method offers improved performance compared to other previously proposed online LQR methods.  ( 2 min )
    The Gaussian kernel on the circle and spaces that admit isometric embeddings of the circle. (arXiv:2302.10623v1 [cs.LG])
    On Euclidean spaces, the Gaussian kernel is one of the most widely used kernels in applications. It has also been used on non-Euclidean spaces, where it is known that there may be (and often are) scale parameters for which it is not positive definite. Hope remains that this kernel is positive definite for many choices of parameter. However, we show that the Gaussian kernel is not positive definite on the circle for any choice of parameter. This implies that on metric spaces in which the circle can be isometrically embedded, such as spheres, projective spaces and Grassmannians, the Gaussian kernel is not positive definite for any parameter.  ( 2 min )
    Deep reinforced learning heuristic tested on spin-glass ground states: The larger picture. (arXiv:2302.10848v1 [cond-mat.dis-nn])
    In Changjun Fan et al. [Nature Communications https://doi.org/10.1038/s41467-023-36363-w (2023)], the authors present a deep reinforced learning approach to augment combinatorial optimization heuristics. In particular, they present results for several spin glass ground state problems, for which instances on non-planar networks are generally NP-hard, in comparison with several Monte Carlo based methods, such as simulated annealing (SA) or parallel tempering (PT). Indeed, those results demonstrate that the reinforced learning improves the results over those obtained with SA or PT, or at least allows for reduced runtimes for the heuristics before results of comparable quality have been obtained relative to those other methods. To facilitate the conclusion that their method is ''superior'', the authors pursue two basic strategies: (1) A commercial GUROBI solver is called on to procure a sample of exact ground states as a testbed to compare with, and (2) a head-to-head comparison between the heuristics is given for a sample of larger instances where exact ground states are hard to ascertain. Here, we put these studies into a larger context, showing that the claimed superiority is at best marginal for smaller samples and becomes essentially irrelevant with respect to any sensible approximation of true ground states in the larger samples. For example, this method becomes irrelevant as a means to determine stiffness exponents $\theta$ in $d>2$, as mentioned by the authors, where the problem is not only NP-hard but requires the subtraction of two almost equal ground-state energies and systemic errors in each of $\approx 1\%$ found here are unacceptable. This larger picture on the method arises from a straightforward finite-size corrections study over the spin glass ensembles the authors employ, using data that has been available for decades.  ( 3 min )
    Characterizing the Optimal 0-1 Loss for Multi-class Classification with a Test-time Attacker. (arXiv:2302.10722v1 [cs.LG])
    Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model for a given data distribution and comparing it to that achieved by state-of-the-art training methods is thus an important diagnostic tool. In this paper, we find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset. We provide a general framework for finding the optimal 0-1 loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints. We further define other variants of the attacker-classifier game that determine the range of the optimal loss more efficiently than the full-fledged hypergraph construction. Our evaluation shows, for the first time, an analysis of the gap to optimal robustness for classifiers in the multi-class setting on benchmark datasets.  ( 2 min )
    MaskedKD: Efficient Distillation of Vision Transformers with Masked Images. (arXiv:2302.10494v1 [cs.LG])
    Knowledge distillation is a popular and effective regularization technique for training lightweight models, but it also adds significant overhead to the training cost. The drawback is most pronounced when we use large-scale models as our teachers, such as vision transformers (ViTs). We present MaskedKD, a simple yet effective method for reducing the training cost of ViT distillation. MaskedKD masks a fraction of image patch tokens fed to the teacher to save the teacher inference cost. The tokens to mask are determined based on the last layer attention score of the student model, to which we provide the full image. Without requiring any architectural change of the teacher or making sacrifices in the student performance, MaskedKD dramatically reduces the computations and time required for distilling ViTs. We demonstrate that MaskedKD can save up to $50\%$ of the cost of running inference on the teacher model without any performance drop on the student, leading to approximately $28\%$ drop in the teacher and student compute combined.  ( 2 min )
    Variational Boosted Soft Trees. (arXiv:2302.10706v1 [cs.LG])
    Gradient boosting machines (GBMs) based on decision trees consistently demonstrate state-of-the-art results on regression and classification tasks with tabular data, often outperforming deep neural networks. However, these models do not provide well-calibrated predictive uncertainties, which prevents their use for decision making in high-risk applications. The Bayesian treatment is known to improve predictive uncertainty calibration, but previously proposed Bayesian GBM methods are either computationally expensive, or resort to crude approximations. Variational inference is often used to implement Bayesian neural networks, but is difficult to apply to GBMs, because the decision trees used as weak learners are non-differentiable. In this paper, we propose to implement Bayesian GBMs using variational inference with soft decision trees, a fully differentiable alternative to standard decision trees introduced by Irsoy et al. Our experiments demonstrate that variational soft trees and variational soft GBMs provide useful uncertainty estimates, while retaining good predictive performance. The proposed models show higher test likelihoods when compared to the state-of-the-art Bayesian GBMs in 7/10 tabular regression datasets and improved out-of-distribution detection in 5/10 datasets.  ( 2 min )
    MalProtect: Stateful Defense Against Adversarial Query Attacks in ML-based Malware Detection. (arXiv:2302.10739v1 [cs.LG])
    ML models are known to be vulnerable to adversarial query attacks. In these attacks, queries are iteratively perturbed towards a particular class without any knowledge of the target model besides its output. The prevalence of remotely-hosted ML classification models and Machine-Learning-as-a-Service platforms means that query attacks pose a real threat to the security of these systems. To deal with this, stateful defenses have been proposed to detect query attacks and prevent the generation of adversarial examples by monitoring and analyzing the sequence of queries received by the system. Several stateful defenses have been proposed in recent years. However, these defenses rely solely on similarity or out-of-distribution detection methods that may be effective in other domains. In the malware detection domain, the methods to generate adversarial examples are inherently different, and therefore we find that such detection mechanisms are significantly less effective. Hence, in this paper, we present MalProtect, which is a stateful defense against query attacks in the malware detection domain. MalProtect uses several threat indicators to detect attacks. Our results show that it reduces the evasion rate of adversarial query attacks by 80+\% in Android and Windows malware, across a range of attacker scenarios. In the first evaluation of its kind, we show that MalProtect outperforms prior stateful defenses, especially under the peak adversarial threat.  ( 2 min )
    Utilizing Domain Knowledge: Robust Machine Learning for Building Energy Prediction with Small, Inconsistent Datasets. (arXiv:2302.10784v1 [cs.LG])
    The demand for a huge amount of data for machine learning (ML) applications is currently a bottleneck in an empirically dominated field. We propose a method to combine prior knowledge with data-driven methods to significantly reduce their data dependency. In this study, component-based machine learning (CBML) as the knowledge-encoded data-driven method is examined in the context of energy-efficient building engineering. It encodes the abstraction of building structural knowledge as semantic information in the model organization. We design a case experiment to understand the efficacy of knowledge-encoded ML in sparse data input (1% - 0.0125% sampling rate). The result reveals its three advanced features compared with pure ML methods: 1. Significant improvement in the robustness of ML to extremely small-size and inconsistent datasets; 2. Efficient data utilization from different entities' record collections; 3. Characteristics of accepting incomplete data with high interpretability and reduced training time. All these features provide a promising path to alleviating the deployment bottleneck of data-intensive methods and contribute to efficient real-world data usage. Moreover, four necessary prerequisites are summarized in this study that ensures the target scenario benefits by combining prior knowledge and ML generalization.  ( 2 min )
    Distributed Learning in Heterogeneous Environment: federated learning with adaptive aggregation and computation reduction. (arXiv:2302.10757v1 [cs.LG])
    Although federated learning has achieved many breakthroughs recently, the heterogeneous nature of the learning environment greatly limits its performance and hinders its real-world applications. The heterogeneous data, time-varying wireless conditions and computing-limited devices are three main challenges, which often result in an unstable training process and degraded accuracy. Herein, we propose strategies to address these challenges. Targeting the heterogeneous data distribution, we propose a novel adaptive mixing aggregation (AMA) scheme that mixes the model updates from previous rounds with current rounds to avoid large model shifts and thus, maintain training stability. We further propose a novel staleness-based weighting scheme for the asynchronous model updates caused by the dynamic wireless environment. Lastly, we propose a novel CPU-friendly computation-reduction scheme based on transfer learning by sharing the feature extractor (FES) and letting the computing-limited devices update only the classifier. The simulation results show that the proposed framework outperforms existing state-of-the-art solutions and increases the test accuracy, and training stability by up to 2.38%, 93.10% respectively. Additionally, the proposed framework can tolerate communication delay of up to 15 rounds under a moderate delay environment without significant accuracy degradation.  ( 2 min )
    Exploring the Effect of Multi-step Ascent in Sharpness-Aware Minimization. (arXiv:2302.10181v1 [cs.LG])
    Recently, Sharpness-Aware Minimization (SAM) has shown state-of-the-art performance by seeking flat minima. To minimize the maximum loss within a neighborhood in the parameter space, SAM uses an ascent step, which perturbs the weights along the direction of gradient ascent with a given radius. While single-step or multi-step can be taken during ascent steps, previous studies have shown that multi-step ascent SAM rarely improves generalization performance. However, this phenomenon is particularly interesting because the multi-step ascent is expected to provide a better approximation of the maximum neighborhood loss. Therefore, in this paper, we analyze the effect of the number of ascent steps and investigate the difference between both single-step ascent SAM and multi-step ascent SAM. We identify the effect of the number of ascent on SAM optimization and reveal that single-step ascent SAM and multi-step ascent SAM exhibit distinct loss landscapes. Based on these observations, we finally suggest a simple modification that can mitigate the inefficiency of multi-step ascent SAM.  ( 2 min )
    LMPDNet: TOF-PET list-mode image reconstruction using model-based deep learning method. (arXiv:2302.10481v1 [eess.IV])
    The integration of Time-of-Flight (TOF) information in the reconstruction process of Positron Emission Tomography (PET) yields improved image properties. However, implementing the cutting-edge model-based deep learning methods for TOF-PET reconstruction is challenging due to the substantial memory requirements. In this study, we present a novel model-based deep learning approach, LMPDNet, for TOF-PET reconstruction from list-mode data. We address the issue of real-time parallel computation of the projection matrix for list-mode data, and propose an iterative model-based module that utilizes a dedicated network model for list-mode data. Our experimental results indicate that the proposed LMPDNet outperforms traditional iteration-based TOF-PET list-mode reconstruction algorithms. Additionally, we compare the spatial and temporal consumption of list-mode data and sinogram data in model-based deep learning methods, demonstrating the superiority of list-mode data in model-based TOF-PET reconstruction.  ( 2 min )
    Weather2K: A Multivariate Spatio-Temporal Benchmark Dataset for Meteorological Forecasting Based on Real-Time Observation Data from Ground Weather Stations. (arXiv:2302.10493v1 [cs.LG])
    Weather forecasting is one of the cornerstones of meteorological work. In this paper, we present a new benchmark dataset named Weather2K, which aims to make up for the deficiencies of existing weather forecasting datasets in terms of real-time, reliability, and diversity, as well as the key bottleneck of data quality. To be specific, our Weather2K is featured from the following aspects: 1) Reliable and real-time data. The data is hourly collected from 2,130 ground weather stations covering an area of 6 million square kilometers. 2) Multivariate meteorological variables. 20 meteorological factors and 3 constants for position information are provided with a length of 40,896 time steps. 3) Applicable to diverse tasks. We conduct a set of baseline tests on time series forecasting and spatio-temporal forecasting. To the best of our knowledge, our Weather2K is the first attempt to tackle weather forecasting task by taking full advantage of the strengths of observation data from ground weather stations. Based on Weather2K, we further propose Meteorological Factors based Multi-Graph Convolution Network (MFMGCN), which can effectively construct the intrinsic correlation among geographic locations based on meteorological factors. Sufficient experiments show that MFMGCN improves both the forecasting performance and temporal robustness. We hope our Weather2K can significantly motivate researchers to develop efficient and accurate algorithms to advance the task of weather forecasting. The dataset can be available at https://github.com/bycnfz/weather2k/.  ( 2 min )
    Replicable Clustering. (arXiv:2302.10359v1 [cs.LG])
    In this paper, we design replicable algorithms in the context of statistical clustering under the recently introduced notion of replicability. A clustering algorithm is replicable if, with high probability, it outputs the exact same clusters after two executions with datasets drawn from the same distribution when its internal randomness is shared across the executions. We propose such algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their combinatorial counterparts in a black-box manner. In particular, we demonstrate a replicable $O(1)$-approximation algorithm for statistical Euclidean $k$-medians ($k$-means) with $\operatorname{poly}(d)$ sample complexity. We also describe a $O(1)$-approximation algorithm with an additional $O(1)$-additive error for statistical Euclidean $k$-centers, albeit with $\exp(d)$ sample complexity.  ( 2 min )
    Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation. (arXiv:2302.10322v1 [cs.LG])
    Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.  ( 2 min )
    Potential-based reward shaping for learning to play text-based adventure games. (arXiv:2302.10720v1 [cs.LG])
    Text-based games are a popular testbed for language-based reinforcement learning (RL). In previous work, deep Q-learning is commonly used as the learning agent. Q-learning algorithms are challenging to apply to complex real-world domains due to, for example, their instability in training. Therefore, in this paper, we adapt the soft-actor-critic (SAC) algorithm to the text-based environment. To deal with sparse extrinsic rewards from the environment, we combine it with a potential-based reward shaping technique to provide more informative (dense) reward signals to the RL agent. We apply our method to play difficult text-based games. The SAC method achieves higher scores than the Q-learning methods on many games with only half the number of training steps. This shows that it is well-suited for text-based games. Moreover, we show that the reward shaping technique helps the agent to learn the policy faster and achieve higher scores. In particular, we consider a dynamically learned value function as a potential function for shaping the learner's original sparse reward signals.  ( 2 min )
    Multimodal Trajectory Prediction: A Survey. (arXiv:2302.10463v1 [cs.RO])
    Trajectory prediction is an important task to support safe and intelligent behaviours in autonomous systems. Many advanced approaches have been proposed over the years with improved spatial and temporal feature extraction. However, human behaviour is naturally multimodal and uncertain: given the past trajectory and surrounding environment information, an agent can have multiple plausible trajectories in the future. To tackle this problem, an essential task named multimodal trajectory prediction (MTP) has recently been studied, which aims to generate a diverse, acceptable and explainable distribution of future predictions for each agent. In this paper, we present the first survey for MTP with our unique taxonomies and comprehensive analysis of frameworks, datasets and evaluation metrics. In addition, we discuss multiple future directions that can help researchers develop novel multimodal trajectory prediction systems.  ( 2 min )
    Active Learning in Brain Tumor Segmentation with Uncertainty Sampling, Annotation Redundancy Restriction, and Data Initialization. (arXiv:2302.10185v1 [cs.CV])
    Deep learning models have demonstrated great potential in medical 3D imaging, but their development is limited by the expensive, large volume of annotated data required. Active learning (AL) addresses this by training a model on a subset of the most informative data samples without compromising performance. We compared different AL strategies and propose a framework that minimizes the amount of data needed for state-of-the-art performance. 638 multi-institutional brain tumor MRI images were used to train a 3D U-net model and compare AL strategies. We investigated uncertainty sampling, annotation redundancy restriction, and initial dataset selection techniques. Uncertainty estimation techniques including Bayesian estimation with dropout, bootstrapping, and margins sampling were compared to random query. Strategies to avoid annotation redundancy by removing similar images within the to-be-annotated subset were considered as well. We determined the minimum amount of data necessary to achieve similar performance to the model trained on the full dataset ({\alpha} = 0.1). A variance-based selection strategy using radiomics to identify the initial training dataset is also proposed. Bayesian approximation with dropout at training and testing showed similar results to that of the full data model with less than 20% of the training data (p=0.293) compared to random query achieving similar performance at 56.5% of the training data (p=0.814). Annotation redundancy restriction techniques achieved state-of-the-art performance at approximately 40%-50% of the training data. Radiomics dataset initialization had higher Dice with initial dataset sizes of 20 and 80 images, but improvements were not significant. In conclusion, we investigated various AL strategies with dropout uncertainty estimation achieving state-of-the-art performance with the least annotated data.  ( 3 min )
    Differentiable Bootstrap Particle Filters for Regime-Switching Models. (arXiv:2302.10319v1 [eess.SP])
    Differentiable particle filters are an emerging class of particle filtering methods that use neural networks to construct and learn parametric state-space models. In real-world applications, both the state dynamics and measurements can switch between a set of candidate models. For instance, in target tracking, vehicles can idle, move through traffic, or cruise on motorways, and measurements are collected in different geographical or weather conditions. This paper proposes a new differentiable particle filter for regime-switching state-space models. The method can learn a set of unknown candidate dynamic and measurement models and track the state posteriors. We evaluate the performance of the novel algorithm in relevant models, showing its great performance compared to other competitive algorithms.  ( 2 min )
    Optical Transformers. (arXiv:2302.10360v1 [cs.ET])
    The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for optical computing. To test this idea, we performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. Using simulations, validated by our experiments, we then explored the energy efficiency of optical implementations of Transformers and identified scaling laws for model performance with respect to optical energy usage. We found that the optical energy per multiply-accumulate (MAC) scales as $\frac{1}{d}$ where $d$ is the Transformer width, an asymptotic advantage over digital systems. We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a $100 \times$ energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a $>8,000\times$ energy-efficiency advantage over state-of-the-art digital-electronic processors that achieve 300 fJ/MAC. We analyzed how these results motivate and inform the construction of future optical accelerators along with optics-amenable deep-learning approaches. With assumptions about future improvements to electronics and Transformer quantization techniques (5$\times$ cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimated that optical computers' advantage against current 300-fJ/MAC digital processors could grow to $>100,000\times$.  ( 2 min )
    HierCat: Hierarchical Query Categorization from Weakly Supervised Data at Facebook Marketplace. (arXiv:2302.10527v1 [cs.IR])
    Query categorization at customer-to-customer e-commerce plat- forms like Facebook Marketplace is challenging due to the vague- ness of search intent, noise in real-world data, and imbalanced training data across languages. Its deployment also needs to con- sider challenges in scalability and downstream integration in order to translate modeling advances into better search result relevance. In this paper we present HierCat, the query categorization system at Facebook Marketplace. HierCat addresses these challenges by leveraging multi-task pre-training of dual-encoder architectures with a hierarchical inference step to effectively learn from weakly supervised training data mined from searcher engagement. We show that HierCat not only outperforms popular methods in offline experiments, but also leads to 1.4% improvement in NDCG and 4.3% increase in searcher engagement at Facebook Marketplace Search in online A/B testing.  ( 2 min )
    Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency. (arXiv:2302.10371v1 [cs.LG])
    Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$ regret, where $\sigma_k^2$ is the variance of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.  ( 2 min )
    Causal Razors. (arXiv:2302.10331v1 [cs.LG])
    When performing causal discovery, assumptions have to be made on how the true causal mechanism corresponds to the underlying joint probability distribution. These assumptions are labeled as causal razors in this work. We review numerous causal razors that appeared in the literature, and offer a comprehensive logical comparison of them. In particular, we scrutinize an unpopular causal razor, namely parameter minimality, in multinomial causal models and its logical relations with other well-studied causal razors. Our logical result poses a dilemma in selecting a reasonable scoring criterion for score-based casual search algorithms.  ( 2 min )
    A Dynamic Feedforward Control Strategy for Energy-efficient Building System Operation. (arXiv:2302.10179v1 [cs.LG])
    The development of current building energy system operation has benefited from: 1. Informational support from the optimal design through simulation or first-principles models; 2. System load and energy prediction through machine learning (ML). Through the literature review, we note that in current control strategies and optimization algorithms, most of them rely on receiving information from real-time feedback or using only predictive signals based on ML data fitting. They do not fully utilize dynamic building information. In other words, embedding dynamic prior knowledge from building system characteristics simultaneously for system control draws less attention. In this context, we propose an engineer-friendly control strategy framework. The framework is integrated with a feedforward loop that embedded a dynamic building environment with leading and lagging system information involved: The simulation combined with system characteristic information is imported to the ML predictive algorithms. ML generates step-ahead information by rolling-window feed-in of simulation output to minimize the errors of its forecasting predecessor in a loop and achieve an overall optimal. We tested it in a case for heating system control with typical control strategies, which shows our framework owns a further energy-saving potential of 15%.  ( 2 min )
    Meta-World Conditional Neural Processes. (arXiv:2302.10320v1 [cs.LG])
    We propose Meta-World Conditional Neural Processes (MW-CNP), a conditional world model generator that leverages sample efficiency and scalability of Conditional Neural Processes to enable an agent to sample from its own "hallucination". We intend to reduce the agent's interaction with the target environment at test time as much as possible. To reduce the number of samples required at test time, we first obtain a latent representation of the transition dynamics from a single rollout from the test environment with hidden parameters. Then, we obtain rollouts for few-shot learning by interacting with the "hallucination" generated by the meta-world model. Using the world model representation from MW-CNP, the meta-RL agent can adapt to an unseen target environment with significantly fewer samples collected from the target environment compared to the baselines. We emphasize that the agent does not have access to the task parameters throughout training and testing, and MW-CNP is trained on offline interaction data logged during meta-training.  ( 2 min )
    Classification with Trust: A Supervised Approach based on Sequential Ellipsoidal Partitioning. (arXiv:2302.10487v1 [cs.LG])
    Standard metrics of performance of classifiers, such as accuracy and sensitivity, do not reveal the trust or confidence in the predicted labels of data. While other metrics such as the computed probability of a label or the signed distance from a hyperplane can act as a trust measure, these are subjected to heuristic thresholds. This paper presents a convex optimization-based supervised classifier that sequentially partitions a dataset into several ellipsoids, where each ellipsoid contains nearly all points of the same label. By stating classification rules based on this partitioning, Bayes' formula is then applied to calculate a trust score to a label assigned to a test datapoint determined from these rules. The proposed Sequential Ellipsoidal Partitioning Classifier (SEP-C) exposes dataset irregularities, such as degree of overlap, without requiring a separate exploratory data analysis. The rules of classification, which are free of hyperparameters, are also not affected by class-imbalance, the underlying data distribution, or number of features. SEP-C does not require the use of non-linear kernels when the dataset is not linearly separable. The performance, and comparison with other methods, of SEP-C is demonstrated on the XOR-problem, circle dataset, and other open-source datasets.  ( 2 min )
    Multivariate Systemic Risk Measures and Deep Learning Algorithms. (arXiv:2302.10183v1 [cs.LG])
    In this work we propose deep learning-based algorithms for the computation of systemic shortfall risk measures defined via multivariate utility functions. We discuss the key related theoretical aspects, with a particular focus on the fairness properties of primal optima and associated risk allocations. The algorithms we provide allow for learning primal optimizers, optima for the dual representation and corresponding fair risk allocations. We test our algorithms by comparison to a benchmark model, based on a paired exponential utility function, for which we can provide explicit formulas. We also show evidence of convergence in a case for which explicit formulas are not available.  ( 2 min )
  • Open

    Noise-Augmented $\ell_0$ Regularization of Tensor Regression with Tucker Decomposition. (arXiv:2302.10775v1 [stat.ML])
    Tensor data are multi-dimension arrays. Low-rank decomposition-based regression methods with tensor predictors exploit the structural information in tensor predictors while significantly reducing the number of parameters in tensor regression. We propose a method named NA$_0$CT$^2$ (Noise Augmentation for $\ell_0$ regularization on Core Tensor in Tucker decomposition) to regularize the parameters in tensor regression (TR), coupled with Tucker decomposition. We establish theoretically that NA$_0$CT$^2$ achieves exact $\ell_0$ regularization in linear TR and generalized linear TR on the core tensor from the Tucker decomposition. To our knowledge, NA$_0$CT$^2$ is the first Tucker decomposition-based regularization method in TR to achieve $\ell_0$ in core tensor. NA$_0$CT$^2$ is implemented through an iterative procedure and involves two simple steps in each iteration -- generating noisy data based on the core tensor from the Tucker decomposition of the updated parameter estimate and running a regular GLM on noise-augmented data on vectorized predictors. We demonstrate the implementation of NA$_0$CT$^2$ and its $\ell_0$ regularization effect in both simulation studies and real data applications. The results suggest that NA$_0$CT$^2$ improves predictions compared to other decomposition-based TR approaches, with or without regularization and it also helps to identify important predictors though not designed for that purpose.  ( 2 min )
    Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency. (arXiv:2302.10371v1 [cs.LG])
    Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$ regret, where $\sigma_k^2$ is the variance of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.  ( 2 min )
    Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space. (arXiv:2302.10667v1 [cs.LG])
    In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the \emph{diameter} $D$ of the MDP is $\Omega(S^S)$, where $S$ is the number of states. Therefore, the existing lower and upper bounds on the regret at time$T$, of order $O(\sqrt{DSAT})$ for MDPs with $S$ states and $A$ actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm {\sc Ucrl2} is in fact upper bounded by $\tilde{\mathcal{O}}(\sqrt{E_2AT})$ where $E_2$ is related to the weighted second moment of the stationary measure of a reference policy. Importantly, $E_2$ is bounded independently of $S$. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.  ( 2 min )
    Density Ratio Estimation and Neyman Pearson Classification with Missing Data. (arXiv:2302.10655v1 [stat.ML])
    Density Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with missing not at random (MNAR) data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this MNAR setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance both synthetic data and real-world data with simulated missingness.
    When are Post-hoc Conceptual Explanantions Identifiable?. (arXiv:2206.13872v3 [stat.ML] UPDATED)
    Interest in understanding and factorizing learned embedding spaces through conceptual explanations is steadily growing. When no human concept labels are available, concept discovery methods search trained embedding spaces for interpretable concepts like object shape or color that can be used to provide post-hoc explanations for decisions. Unlike previous work, we argue that concept discovery should be identifiable, meaning that a number of known concepts can be provably recovered to guarantee reliability of the explanations. As a starting point, we explicitly make the connection between concept discovery and classical methods like Principal Component Analysis and Independent Component Analysis by showing that they can recover independent concepts with non-Gaussian distributions. For dependent concepts, we propose two novel approaches that exploit functional compositionality properties of image-generating processes. Our provably identifiable concept discovery methods substantially outperform competitors on a battery of experiments including hundreds of trained models and dependent concepts, where they exhibit up to 29 % better alignment with the ground truth. Our results provide a rigorous foundation for reliable concept discovery without human labels.
    Meta-Uncertainty in Bayesian Model Comparison. (arXiv:2210.07278v3 [stat.ML] UPDATED)
    Bayesian model comparison (BMC) offers a principled probabilistic approach to study and rank competing models. In standard BMC, we construct a discrete probability distribution over the set of possible models, conditional on the observed data of interest. These posterior model probabilities (PMPs) are measures of uncertainty, but -- when derived from a finite number of observations -- are also uncertain themselves. In this paper, we conceptualize distinct levels of uncertainty which arise in BMC. We explore a fully probabilistic framework for quantifying meta-uncertainty, resulting in an applied method to enhance any BMC workflow. Drawing on both Bayesian and frequentist techniques, we represent the uncertainty over the uncertain PMPs via meta-models which combine simulated and observed data into a predictive distribution for PMPs on new data. We demonstrate the utility of the proposed method in the context of conjugate Bayesian regression, likelihood-based inference with Markov chain Monte Carlo, and simulation-based inference with neural networks.
    Diversify and Disambiguate: Learning From Underspecified Data. (arXiv:2202.03418v3 [cs.LG] UPDATED)
    Many datasets are underspecified: there exist multiple equally viable solutions to a given task. Underspecification can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus produce widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.
    Some Fundamental Aspects about Lipschitz Continuity of Neural Network Functions. (arXiv:2302.10886v1 [cs.LG])
    Lipschitz continuity is a simple yet pivotal functional property of any predictive model that lies at the core of its robustness, generalisation, and adversarial vulnerability. Our aim is to thoroughly investigate and characterise the Lipschitz behaviour of the functions learned via neural networks. Despite the significant tightening of the bounds in the recent years, precisely estimating the Lipschitz constant continues to be a practical challenge and tight theoretical analyses, similarly, remain intractable. Therefore, we shift our perspective and instead attempt to uncover insights about the nature of Lipschitz constant of neural networks functions -- by relying on the simplest and most general upper and lower bounds. We carry out an empirical investigation in a range of different settings (architectures, losses, optimisers, label noise, etc.), which reveals several fundamental and intriguing traits of the Lipschitz continuity of neural networks functions, In particular, we identify a remarkable double descent trend in both upper and lower bounds to the Lipschitz constant which tightly aligns with the typical double descent trend in the test loss.
    Exploring Local Norms in Exp-concave Statistical Learning. (arXiv:2302.10726v1 [cs.LG])
    We consider the problem of stochastic convex optimization with exp-concave losses using Empirical Risk Minimization in a convex class. Answering a question raised in several prior works, we provide a $O( d / n + \log( 1 / \delta) / n )$ excess risk bound valid for a wide class of bounded exp-concave losses, where $d$ is the dimension of the convex reference set, $n$ is the sample size, and $\delta$ is the confidence level. Our result is based on a unified geometric assumption on the gradient of losses and the notion of local norms.
    Generalization Bounds for Adversarial Contrastive Learning. (arXiv:2302.10633v1 [cs.LG])
    Deep networks are well-known to be fragile to adversarial attacks, and adversarial training is one of the most popular methods used to train a robust model. To take advantage of unlabeled data, recent works have applied adversarial training to contrastive learning (Adversarial Contrastive Learning; ACL for short) and obtain promising robust performance. However, the theory of ACL is not well understood. To fill this gap, we leverage the Rademacher complexity to analyze the generalization performance of ACL, with a particular focus on linear models and multi-layer neural networks under $\ell_p$ attack ($p \ge 1$). Our theory shows that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task. The experimental results validate our theory.
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v2 [cs.LG] UPDATED)
    Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.
    Wassmap: Wasserstein Isometric Mapping for Image Manifold Learning. (arXiv:2204.06645v3 [cs.LG] UPDATED)
    In this paper, we propose Wasserstein Isometric Mapping (Wassmap), a nonlinear dimensionality reduction technique that provides solutions to some drawbacks in existing global nonlinear dimensionality reduction algorithms in imaging applications. Wassmap represents images via probability measures in Wasserstein space, then uses pairwise Wasserstein distances between the associated measures to produce a low-dimensional, approximately isometric embedding. We show that the algorithm is able to exactly recover parameters of some image manifolds including those generated by translations or dilations of a fixed generating measure. Additionally, we show that a discrete version of the algorithm retrieves parameters from manifolds generated from discrete measures by providing a theoretical bridge to transfer recovery results from functional data to discrete data. Testing of the proposed algorithms on various image data manifolds show that Wassmap yields good embeddings compared with other global and local techniques.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v4 [cs.LG] UPDATED)
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with pretrained token embeddings. To form these associations, a modern Hopfield network stores these token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    SurvLIMEpy: A Python package implementing SurvLIME. (arXiv:2302.10571v1 [stat.ML])
    In this paper we present SurvLIMEpy, an open-source Python package that implements the SurvLIME algorithm. This method allows to compute local feature importance for machine learning algorithms designed for modelling Survival Analysis data. Our implementation takes advantage of the parallelisation paradigm as all computations are performed in a matrix-wise fashion which speeds up execution time. Additionally, SurvLIMEpy assists the user with visualization tools to better understand the result of the algorithm. The package supports a wide variety of survival models, from the Cox Proportional Hazards Model to deep learning models such as DeepHit or DeepSurv. Two types of experiments are presented in this paper. First, by means of simulated data, we study the ability of the algorithm to capture the importance of the features. Second, we use three open source survival datasets together with a set of survival algorithms in order to demonstrate how SurvLIMEpy behaves when applied to different models.
    Generalized Gumbel-Softmax Gradient Estimator for Various Discrete Random Variables. (arXiv:2003.01847v3 [cs.LG] UPDATED)
    Estimating the gradients of stochastic nodes is one of the crucial research questions in the deep generative modeling community, which enables the gradient descent optimization on neural network parameters. This estimation problem becomes further complex when we regard the stochastic nodes to be discrete because pathwise derivative techniques cannot be applied. Hence, the stochastic gradient estimation of discrete distributions requires either a score function method or continuous relaxation of the discrete random variables. This paper proposes a general version of the Gumbel-Softmax estimator with continuous relaxation, and this estimator is able to relax the discreteness of probability distributions including more diverse types, other than categorical and Bernoulli. In detail, we utilize the truncation of discrete random variables and the Gumbel-Softmax trick with a linear transformation for the relaxed reparameterization. The proposed approach enables the relaxed discrete random variable to be reparameterized and to backpropagated through a large scale stochastic computational graph. Our experiments consist of (1) synthetic data analyses, which show the efficacy of our methods; and (2) applications on VAE and topic model, which demonstrate the value of the proposed estimation in practices.
    Entire Space Counterfactual Learning: Tuning, Analytical Properties and Industrial Applications. (arXiv:2210.11039v2 [cs.LG] UPDATED)
    As a basic research problem for building effective recommender systems, post-click conversion rate (CVR) estimation has long been plagued by sample selection bias and data sparsity issues. To address the data sparsity issue, prevalent methods based on entire space multi-task model leverage the sequential pattern of user actions, i.e. exposure $\rightarrow$ click $\rightarrow$ conversion to construct auxiliary learning tasks. However, they still fall short of guaranteeing the unbiasedness of CVR estimates. This paper theoretically demonstrates two defects of these entire space multi-task models: (1) inherent estimation bias (IEB) for CVR estimation, where the CVR estimate is inherently higher than the ground truth; (2) potential independence priority (PIP) for CTCVR estimation, where the causality from click to conversion might be overlooked. This paper further proposes a principled method named entire space counterfactual multi-task model (ESCM$^2$), which employs a counterfactual risk minimizer to handle both IEB and PIP issues at once. To demonstrate the effectiveness of the proposed method, this paper explores its parameter tuning in practice, derives its analytic properties, and showcases its effectiveness in industrial CVR estimation, where ESCM$^2$ can effectively alleviate the intrinsic IEB and PIP issues and outperform baseline models.
    Causal Razors. (arXiv:2302.10331v1 [cs.LG])
    When performing causal discovery, assumptions have to be made on how the true causal mechanism corresponds to the underlying joint probability distribution. These assumptions are labeled as causal razors in this work. We review numerous causal razors that appeared in the literature, and offer a comprehensive logical comparison of them. In particular, we scrutinize an unpopular causal razor, namely parameter minimality, in multinomial causal models and its logical relations with other well-studied causal razors. Our logical result poses a dilemma in selecting a reasonable scoring criterion for score-based casual search algorithms.
    Mean Parity Fair Regression in RKHS. (arXiv:2302.10409v1 [stat.ML])
    We study the fair regression problem under the notion of Mean Parity (MP) fairness, which requires the conditional mean of the learned function output to be constant with respect to the sensitive attributes. We address this problem by leveraging reproducing kernel Hilbert space (RKHS) to construct the functional space whose members are guaranteed to satisfy the fairness constraints. The proposed functional space suggests a closed-form solution for the fair regression problem that is naturally compatible with multiple sensitive attributes. Furthermore, by formulating the fairness-accuracy tradeoff as a relaxed fair regression problem, we derive a corresponding regression function that can be implemented efficiently and provides interpretable tradeoffs. More importantly, under some mild assumptions, the proposed method can be applied to regression problems with a covariance-based notion of fairness. Experimental results on benchmark datasets show the proposed methods achieve competitive and even superior performance compared with several state-of-the-art methods.
    Neural Collapse Inspired Attraction-Repulsion-Balanced Loss for Imbalanced Learning. (arXiv:2204.08735v3 [cs.LG] UPDATED)
    Class imbalance distribution widely exists in real-world engineering. However, the mainstream optimization algorithms that seek to minimize error will trap the deep learning model in sub-optimums when facing extreme class imbalance. It seriously harms the classification precision, especially on the minor classes. The essential reason is that the gradients of the classifier weights are imbalanced among the components from different classes. In this paper, we propose Attraction-Repulsion-Balanced Loss (ARB-Loss) to balance the different components of the gradients. We perform experiments on the large-scale classification and segmentation datasets and our ARB-Loss can achieve state-of-the-art performance via only one-stage training instead of 2-stage learning like nowadays SOTA works.
    Backtracking Counterfactuals. (arXiv:2211.00472v2 [cs.AI] UPDATED)
    Counterfactual reasoning -- envisioning hypothetical scenarios, or possible worlds, where some circumstances are different from what (f)actually occurred (counter-to-fact) -- is ubiquitous in human cognition. Conventionally, counterfactually-altered circumstances have been treated as "small miracles" that locally violate the laws of nature while sharing the same initial conditions. In Pearl's structural causal model (SCM) framework this is made mathematically rigorous via interventions that modify the causal laws while the values of exogenous variables are shared. In recent years, however, this purely interventionist account of counterfactuals has increasingly come under scrutiny from both philosophers and psychologists. Instead, they suggest a backtracking account of counterfactuals, according to which the causal laws remain unchanged in the counterfactual world; differences to the factual world are instead "backtracked" to altered initial conditions (exogenous variables). In the present work, we explore and formalise this alternative mode of counterfactual reasoning within the SCM framework. Despite ample evidence that humans backtrack, the present work constitutes, to the best of our knowledge, the first general account and algorithmisation of backtracking counterfactuals. We discuss our backtracking semantics in the context of related literature and draw connections to recent developments in explainable artificial intelligence (XAI).
    Dual Representation Learning for One-Step Clustering of Multi-View Data. (arXiv:2208.14450v2 [cs.LG] UPDATED)
    Multi-view data are commonly encountered in data mining applications. Effective extraction of information from multi-view data requires specific design of clustering methods to cater for data with multiple views, which is non-trivial and challenging. In this paper, we propose a novel one-step multi-view clustering method by exploiting the dual representation of both the common and specific information of different views. The motivation originates from the rationale that multi-view data contain not only the consistent knowledge between views but also the unique knowledge of each view. Meanwhile, to make the representation learning more specific to the clustering task, a one-step learning framework is proposed to integrate representation learning and clustering partition as a whole. With this framework, the representation learning and clustering partition mutually benefit each other, which effectively improve the clustering performance. Results from extensive experiments conducted on benchmark multi-view datasets clearly demonstrate the superiority of the proposed method.
    Deterministic training of generative autoencoders using invertible layers. (arXiv:2205.09546v4 [stat.ML] UPDATED)
    In this work, we provide a deterministic alternative to the stochastic variational training of generative autoencoders. We refer to these new generative autoencoders as AutoEncoders within Flows (AEF), since the encoder and decoder are defined as affine layers of an overall invertible architecture. This results in a deterministic encoding of the data, as opposed to the stochastic encoding of VAEs. The paper introduces two related families of AEFs. The first family relies on a partition of the ambient space and is trained by exact maximum-likelihood. The second family exploits a deterministic expansion of the ambient space and is trained by maximizing the log-probability in this extended space. This latter case leaves complete freedom in the choice of encoder, decoder and prior architectures, making it a drop-in replacement for the training of existing VAEs and VAE-style models. We show that these AEFs can have strikingly higher performance than architecturally identical VAEs in terms of log-likelihood and sample quality, especially for low dimensional latent spaces. Importantly, we show that AEF samples are substantially sharper than VAE samples.  ( 2 min )
    Robust Mean Estimation Without a Mean: Dimension-Independent Error in Polynomial Time for Symmetric Distributions. (arXiv:2302.10844v1 [cs.DS])
    In this work, we study the problem of robustly estimating the mean/location parameter of distributions without moment bounds. For a large class of distributions satisfying natural symmetry constraints we give a sequence of algorithms that can efficiently estimate its location without incurring dimension-dependent factors in the error. Concretely, suppose an adversary can arbitrarily corrupt an $\varepsilon$-fraction of the observed samples. For every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ such that the dependence of the error on the corruption level $\varepsilon$ is an additive factor of $O(\varepsilon^{1-\frac{1}{2k}})$. The dependence on other problem parameters is also nearly optimal. Our class contains products of arbitrary symmetric one-dimensional distributions as well as elliptical distributions, a vast generalization of the Gaussian distribution. Examples include product Cauchy distributions and multi-variate $t$-distributions. In particular, even the first moment might not exist. We provide the first efficient algorithms for this class of distributions. Previously, such results where only known under boundedness assumptions on the moments of the distribution and in particular, are provably impossible in the absence of symmetry [KSS18, CTBJ22]. For the class of distributions we consider, all previous estimators either require exponential time or incur error depending on the dimension. Our algorithms are based on a generalization of the filtering technique [DK22]. We show how this machinery can be combined with Huber-loss-based approach to work with projections of the noise. Moreover, we show how sum-of-squares proofs can be used to obtain algorithmic guarantees even for distributions without first moment. We believe that this approach may find other application in future works.  ( 2 min )
    Provable Copyright Protection for Generative Models. (arXiv:2302.10870v1 [cs.LG])
    There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data $C$ that was in their training set. We give a formal definition of $\textit{near access-freeness (NAF)}$ and prove bounds on the probability that a model satisfying this definition outputs a sample similar to $C$, even if $C$ is included in its training set. Roughly speaking, a generative model $p$ is $\textit{$k$-NAF}$ if for every potentially copyrighted data $C$, the output of $p$ diverges by at most $k$-bits from the output of a model $q$ that $\textit{did not access $C$ at all}$. We also give generative model learning algorithms, which efficiently modify the original generative model learning algorithm in a black box manner, that output generative models with strong bounds on the probability of sampling protected content. Furthermore, we provide promising experiments for both language (transformers) and image (diffusion) generative models, showing minimal degradation in output quality while ensuring strong protections against sampling protected content.  ( 2 min )
    Valid Inference for Machine Learning Model Parameters. (arXiv:2302.10840v1 [stat.ML])
    The parameters of a machine learning model are typically learned by minimizing a loss function on a set of training data. However, this can come with the risk of overtraining; in order for the model to generalize well, it is of great importance that we are able to find the optimal parameter for the model on the entire population -- not only on the given training sample. In this paper, we construct valid confidence sets for this optimal parameter of a machine learning model, which can be generated using only the training data without any knowledge of the population. We then show that studying the distribution of this confidence set allows us to assign a notion of confidence to arbitrary regions of the parameter space, and we demonstrate that this distribution can be well-approximated using bootstrapping techniques.  ( 2 min )
    Boosting the Power of Kernel Two-Sample Tests. (arXiv:2302.10687v1 [stat.ME])
    The kernel two-sample test based on the maximum mean discrepancy (MMD) is one of the most popular methods for detecting differences between two distributions over general metric spaces. In this paper we propose a method to boost the power of the kernel test by combining MMD estimates over multiple kernels using their Mahalanobis distance. We derive the asymptotic null distribution of the proposed test statistic and use a multiplier bootstrap approach to efficiently compute the rejection region. The resulting test is universally consistent and, since it is obtained by aggregating over a collection of kernels/bandwidths, is more powerful in detecting a wide range of alternatives in finite samples. We also derive the distribution of the test statistic for both fixed and local contiguous alternatives. The latter, in particular, implies that the proposed test is statistically efficient, that is, it has non-trivial asymptotic (Pitman) efficiency. Extensive numerical experiments are performed on both synthetic and real-world datasets to illustrate the efficacy of the proposed method over single kernel tests. Our asymptotic results rely on deriving the joint distribution of MMD estimates using the framework of multiple stochastic integrals, which is more broadly useful, specifically, in understanding the efficiency properties of recently proposed adaptive MMD tests based on kernel aggregation.  ( 2 min )
    Minimax-Bayes Reinforcement Learning. (arXiv:2302.10831v1 [cs.LG])
    While the Bayesian decision-theoretic framework offers an elegant solution to the problem of decision making under uncertainty, one question is how to appropriately select the prior distribution. One idea is to employ a worst-case prior. However, this is not as easy to specify in sequential decision making as in simple statistical estimation problems. This paper studies (sometimes approximate) minimax-Bayes solutions for various reinforcement learning problems to gain insights into the properties of the corresponding priors and policies. We find that while the worst-case prior depends on the setting, the corresponding minimax policies are more robust than those that assume a standard (i.e. uniform) prior.  ( 2 min )
    Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret. (arXiv:2302.10796v1 [quant-ph])
    While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited. In particular, it remains elusive how to design provably efficient quantum RL algorithms that can address the exploration-exploitation trade-off. To this end, we propose a novel UCRL-style algorithm that takes advantage of quantum computing for tabular Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$, and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case regret for it, where $T$ is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation and prove that it enjoys $\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the $\Omega(\sqrt{T})$-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.  ( 2 min )
    GDBN: a Graph Neural Network Approach to Dynamic Bayesian Network. (arXiv:2302.10804v1 [cs.LG])
    Identifying causal relations among multi-variate time series is one of the most important elements towards understanding the complex mechanisms underlying the dynamic system. It provides critical tools for forecasting, simulations and interventions in science and business analytics. In this paper, we proposed a graph neural network approach with score-based method aiming at learning a sparse DAG that captures the causal dependencies in a discretized time temporal graph. We demonstrate methods with graph neural network significantly outperformed other state-of-the-art methods with dynamic bayesian networking inference. In addition, from the experiments, the structural causal model can be more accurate than a linear SCM discovered by the methods such as Notears.  ( 2 min )
    Understanding Edge-of-Stability Training Dynamics with a Minimalist Example. (arXiv:2210.03294v2 [cs.LG] UPDATED)
    Recently, researchers observed that gradient descent for deep neural networks operates in an ``edge-of-stability'' (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold $2/\eta$ (where $\eta$ is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below $2/\eta$. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and $2/\eta$. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to $2/\eta$. Globally we observe that the training dynamics for our example has an interesting bifurcating behavior, which was also observed in the training of neural nets.  ( 2 min )
    Don't guess what's true: choose what's optimal. A probability transducer for machine-learning classifiers. (arXiv:2302.10578v1 [cs.LG])
    In fields such as medicine and drug discovery, the ultimate goal of a classification is not to guess a class, but to choose the optimal course of action among a set of possible ones, usually not in one-one correspondence with the set of classes. This decision-theoretic problem requires sensible probabilities for the classes. Probabilities conditional on the features are computationally almost impossible to find in many important cases. The main idea of the present work is to calculate probabilities conditional not on the features, but on the trained classifier's output. This calculation is cheap, needs to be made only once, and provides an output-to-probability "transducer" that can be applied to all future outputs of the classifier. In conjunction with problem-dependent utilities, the probabilities of the transducer allow us to find the optimal choice among the classes or among a set of more general decisions, by means of expected-utility maximization. This idea is demonstrated in a simplified drug-discovery problem with a highly imbalanced dataset. The transducer and utility maximization together always lead to improved results, sometimes close to theoretical maximum, for all sets of problem-dependent utilities. The one-time-only calculation of the transducer also provides, automatically: (i) a quantification of the uncertainty about the transducer itself; (ii) the expected utility of the augmented algorithm (including its uncertainty), which can be used for algorithm selection; (iii) the possibility of using the algorithm in a "generative mode", useful if the training dataset is biased.  ( 2 min )
    Estimating long-term causal effects from short-term experiments and long-term observational data with unobserved confounding. (arXiv:2302.10625v1 [stat.ML])
    Understanding and quantifying cause and effect is an important problem in many domains. The generally-agreed solution to this problem is to perform a randomised controlled trial. However, even when randomised controlled trials can be performed, they usually have relatively short duration's due to cost considerations. This makes learning long-term causal effects a very challenging task in practice, since the long-term outcome is only observed after a long delay. In this paper, we study the identification and estimation of long-term treatment effects when both experimental and observational data are available. Previous work provided an estimation strategy to determine long-term causal effects from such data regimes. However, this strategy only works if one assumes there are no unobserved confounders in the observational data. In this paper, we specifically address the challenging case where unmeasured confounders are present in the observational data. Our long-term causal effect estimator is obtained by combining regression residuals with short-term experimental outcomes in a specific manner to create an instrumental variable, which is then used to quantify the long-term causal effect through instrumental variable regression. We prove this estimator is unbiased, and analytically study its variance. In the context of the front-door causal structure, this provides a new causal estimator, which may be of independent interest. Finally, we empirically test our approach on synthetic-data, as well as real-data from the International Stroke Trial.  ( 2 min )
    On Calibrating Diffusion Probabilistic Models. (arXiv:2302.10688v1 [cs.LG])
    Recently, diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. A typical DPM framework includes a forward process that gradually diffuses the data distribution and a reverse process that recovers the data distribution from time-dependent data scores. In this work, we observe that the stochastic reverse process of data scores is a martingale, from which concentration bounds and the optional stopping theorem for data scores can be derived. Then, we discover a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can consequently be increased. We provide general calibration guidelines under various model parametrizations. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling. We conduct experiments on multiple datasets to empirically validate our proposal. Our code is at https://github.com/thudzj/Calibrated-DPMs.  ( 2 min )
    Scalable Infomin Learning. (arXiv:2302.10701v1 [cs.LG])
    The task of infomin learning aims to learn a representation with high utility while being uninformative about a specified target, with the latter achieved by minimising the mutual information between the representation and the target. It has broad applications, ranging from training fair prediction models against protected attributes, to unsupervised learning with disentangled representations. Recent works on infomin learning mainly use adversarial training, which involves training a neural network to estimate mutual information or its proxy and thus is slow and difficult to optimise. Drawing on recent advances in slicing techniques, we propose a new infomin learning approach, which uses a novel proxy metric to mutual information. We further derive an accurate and analytically computable approximation to this proxy metric, thereby removing the need of constructing neural network-based mutual information estimators. Experiments on algorithmic fairness, disentangled representation learning and domain adaptation verify that our method can effectively remove unwanted information with limited time budget.  ( 2 min )
    Understanding new tasks through the lens of training data via exponential tilting. (arXiv:2205.13577v2 [cs.LG] UPDATED)
    Deploying machine learning models to new tasks is a major challenge despite the large size of the modern training datasets. However, it is conceivable that the training data can be reweighted to be more representative of the new (target) task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.  ( 2 min )
    Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation. (arXiv:2302.10322v1 [cs.LG])
    Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.  ( 2 min )
    Variational Boosted Soft Trees. (arXiv:2302.10706v1 [cs.LG])
    Gradient boosting machines (GBMs) based on decision trees consistently demonstrate state-of-the-art results on regression and classification tasks with tabular data, often outperforming deep neural networks. However, these models do not provide well-calibrated predictive uncertainties, which prevents their use for decision making in high-risk applications. The Bayesian treatment is known to improve predictive uncertainty calibration, but previously proposed Bayesian GBM methods are either computationally expensive, or resort to crude approximations. Variational inference is often used to implement Bayesian neural networks, but is difficult to apply to GBMs, because the decision trees used as weak learners are non-differentiable. In this paper, we propose to implement Bayesian GBMs using variational inference with soft decision trees, a fully differentiable alternative to standard decision trees introduced by Irsoy et al. Our experiments demonstrate that variational soft trees and variational soft GBMs provide useful uncertainty estimates, while retaining good predictive performance. The proposed models show higher test likelihoods when compared to the state-of-the-art Bayesian GBMs in 7/10 tabular regression datasets and improved out-of-distribution detection in 5/10 datasets.  ( 2 min )
    Gaussian processes at the Helm(holtz): A more fluid model for ocean currents. (arXiv:2302.10364v1 [stat.ME])
    Oceanographers are interested in predicting ocean currents and identifying divergences in a current vector field based on sparse observations of buoy velocities. Since we expect current dynamics to be smooth but highly non-linear, Gaussian processes (GPs) offer an attractive model. But we show that applying a GP with a standard stationary kernel directly to buoy data can struggle at both current prediction and divergence identification -- due to some physically unrealistic prior assumptions. To better reflect known physical properties of currents, we propose to instead put a standard stationary kernel on the divergence and curl-free components of a vector field obtained through a Helmholtz decomposition. We show that, because this decomposition relates to the original vector field just via mixed partial derivatives, we can still perform inference given the original data with only a small constant multiple of additional computational expense. We illustrate the benefits of our method on synthetic and real ocean data.  ( 2 min )
    Faster high-accuracy log-concave sampling via algorithmic warm starts. (arXiv:2302.10249v1 [math.ST])
    Understanding the complexity of sampling from a strongly log-concave and log-smooth distribution $\pi$ on $\mathbb{R}^d$ to high accuracy is a fundamental problem, both from a practical and theoretical standpoint. In practice, high-accuracy samplers such as the classical Metropolis-adjusted Langevin algorithm (MALA) remain the de facto gold standard; and in theory, via the proximal sampler reduction, it is understood that such samplers are key for sampling even beyond log-concavity (in particular, for distributions satisfying isoperimetric assumptions). In this work, we improve the dimension dependence of this sampling problem to $\tilde{O}(d^{1/2})$, whereas the previous best result for MALA was $\tilde{O}(d)$. This closes the long line of work on the complexity of MALA, and moreover leads to state-of-the-art guarantees for high-accuracy sampling under strong log-concavity and beyond (thanks to the aforementioned reduction). Our starting point is that the complexity of MALA improves to $\tilde{O}(d^{1/2})$, but only under a warm start (an initialization with constant R\'enyi divergence w.r.t. $\pi$). Previous algorithms took much longer to find a warm start than to use it, and closing this gap has remained an important open problem in the field. Our main technical contribution settles this problem by establishing the first $\tilde{O}(d^{1/2})$ R\'enyi mixing rates for the discretized underdamped Langevin diffusion. For this, we develop new differential-privacy-inspired techniques based on R\'enyi divergences with Orlicz--Wasserstein shifts, which allow us to sidestep longstanding challenges for proving fast convergence of hypocoercive differential equations.  ( 2 min )
    Active Learning with Positive and Negative Pairwise Feedback. (arXiv:2302.10295v1 [cs.LG])
    In this paper, we propose a generic framework for active clustering with queries for pairwise similarities between objects. First, the pairwise similarities can be any positive or negative number, yielding full flexibility in the type of feedback that a user/annotator can provide. Second, the process of querying pairwise similarities is separated from the clustering algorithm, leading to more flexibility in how the query strategies can be constructed. Third, the queries are robust to noise by allowing multiple queries for the same pairwise similarity (i.e., a non-persistent noise model is assumed). Finally, the number of clusters is automatically identified based on the currently known pairwise similarities. In addition, we propose and analyze a number of novel query strategies suited to this active clustering framework. We demonstrate the effectiveness of our framework and the proposed query strategies via several experimental studies.  ( 2 min )
    Variance reduced Shapley value estimation for trustworthy data valuation. (arXiv:2210.16835v4 [stat.ML] UPDATED)
    Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.  ( 2 min )
  • Open

    I Convinced ChatGPT that Elon Musk is its Creator!
    submitted by /u/HEAL3D [link] [comments]  ( 6 min )
  • Open

    Why do papers seem to focus much more on number of episodes rather than runtime? Can anyone share papers that compare by runtime instead?
    I figure some algorithms differ greatly in computational complexity so # episodes isn't necessarily a fair comparison. Anyone have sources to compare by runtime they can share? submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Building AI Agents with Generally Intelligent
    submitted by /u/thejashGI [link] [comments]  ( 40 min )

  • Open

    An update on APEX…
    submitted by /u/Littlebigmaker [link] [comments]  ( 40 min )
    AI Cloning: The Threat to Your Voice
    submitted by /u/GodGivenRx [link] [comments]  ( 40 min )
    Martin Ciupa - Bing, ChatGPT & Artificial Intelligence
    submitted by /u/timothy-ventura [link] [comments]  ( 41 min )
    Revolutionize Your Ad Creation With AdCreative.Ai – AI-Powered Ad Evolution Software
    submitted by /u/Moneyguy2323 [link] [comments]  ( 47 min )
    Generative music AI API? I have an idea for a fun audio site...
    I'm looking to launch a fun music AI site. I am trying to determine the best platform/program/API to use to build it. Set some inputs and parameters, generate said song. Any recommendations? Thank you! submitted by /u/ridingbikesrules [link] [comments]  ( 41 min )
    GPT for Forms: Free Addon to Generate Forms Questions with AI (gptforforms.app)
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    Tech Addictions: A Growing Problem with Potentially Serious Consequences
    ​ https://preview.redd.it/6ru5i6leksja1.png?width=1568&format=png&auto=webp&s=1726eb36a2b1919c3fdab3101271eba3a5b5724a The Short Version: I am writing this because I am seriously concerned for everybody in the world. And I hope this simple message helps people take a step back and evaluate their relationship with their technology, so they can live a healthier, more balanced life. In our modern era, technology is progressing faster than ever, bringing with it an array of addictive things that can draw us away from reality and health. To avoid the negative effects of tech addiction, it's crucial to limit usage, practice self-care, make time for Jesus, eat better, get exercise, prioritize real-life relationships, and get healthy rest, etc. We are living in a tech and entertainment EXPLOSI…  ( 45 min )
    Ask Seneca: Learn about Stoicism from the most popular stoic philosopher (based on GPT-3)
    submitted by /u/dcastm [link] [comments]  ( 41 min )
    Artifical intelligence research project
    Hello, I'm a swedish student, I'm writing a research paper about AI used in design industry. And I need help figuring out a good thesis, right now I have this; Use of Artificial intelligence in graphic designing. Artificial intelligence can be used as a tool to help designers create multiple designs in a short time span. But is also comes with its flaws. Right now, there is a problem making an AI system that contributes usefully to any work no matter the given business. I feel like that may not be the main problem/ too wide problem to work with. Any suggestions is helpful:) submitted by /u/__elias__1 [link] [comments]  ( 41 min )
    Can not access Openai because authentication is not correct or something.
    So I live in a country where Openai don't support, so i have to use VPN (proton VPN) and an sms website (smspool.net) to get a foreign phone number. Paid a bit over half a dollar to get the number and signed up to openai. Since all the free numbers are always taken. All is good. Then later, I log in to openai again, it says authentication is wrong or sth and wants me to re-enter my phone number. But the thing is the phone number has expired, turns out smspool only hold onto that number for 1-2 hours before they flush it out of their system. Now I can still enter the phone number I bought, but it doesn't show me anymore sms with activation codes. It only shows the first code when I sign up for Openai. in other words, it's not receiving any more sms sent from Openai. So now I'm stuck. I can't be paying half a dollar for a fake number to log on to Openai every time. Is it because of the VPN server I used? Do I need to remember which sever I used when I got into openai and use that same exact server every time I want to log in? How do I fix this? submitted by /u/JohnTEGS [link] [comments]  ( 42 min )
    AI Dream 126 - New Incredible AI Palette - Wild Wednesday
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Does anyone know of work being done to use AI for removing laugh tracks?
    I've always hated laugh tracks, or forced studio live laughing for that matter, and I would love to watch for example Friends or HIMYM without them. Has anyone heard of work being done to strip it from audio tracks? submitted by /u/7734128 [link] [comments]  ( 43 min )
    Microsoft Co-Founder Bill Gates: The Rise Of AI Like ChatGPT Poses a Threat to Google’s Search…
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Access fine-tuned GPT Models by Community members via API
    Hi! I've been working on a platform that gives users API access to fine-tuned GPT models made by others in the community. At this point, we are looking at releasing soon, hopefully in a week or two and so I've put up a waitlist to give some users early access and to get a sense of the interest as well as a discord for user feedback. Also, we are looking for model creators that are looking to share their fine-tuned GPT models with others as well as monetize their work. We think there is a market for high-quality fine-tuned GPT models that users can easily access without doing all of the hard work fine-tuning themselves and that model creators can earn a reasonable amount or at least enough to offset their API calls to OpenAI. We are in the early stages and so we will personally be onboarding all model creators. For early access join our waitlist: https://www.modeltune.co If you're a model creator looking to reach out: [hamsa@modeltune.co](mailto:hamsa@modeltune.co) If you have any questions I'd be happy to answer! submitted by /u/Aggravating_Art_173 [link] [comments]  ( 41 min )
    What is the difference in performance Between ChatGPT and DaVinci03?
    A discussion on the performance comparison between ChatGPT and DaVinci03, two natural language processing models. submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    AI Learns to Walk, Hop, and Roll
    submitted by /u/Ziinxx [link] [comments]  ( 40 min )
    Spotify debuts a new AI DJ to offer personalized music with commentary in a realistic voice
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    Do you know any Free AI to use locally to sort photos?
    They group! I'm new here and not sure if this is the best place to ask. I got a domestic challenge, that might be solved by an AI. My wife has a 500G external hard drive full of pics and videos without any kind of order, and we also have a 1TB cloud drive with ~300gb more photos and videos, and some files might be duplicated on the same drive/between the drives. Since years ago she knows that she needs to sort them but the task is so big that cannot find the energy to do it. So i'm no dev, but work on the sysadmin side of things. The last year i've done a python script to at least get the hash of all files and delete all duplicated files with same hash, but was not nearly enough to help her. ​ So i thought, what if there is an AI that maybe with our assistance training it, we could sort it out by approximate age of people in the pics/videos, or by who is in which pic? Is this possible? I got a pc with a gtx 2060 super that might help us out? ​ Our two requisites would be, this AI should be free, and we should be able to deploy it locally, don't want to give them to another third party (other than the cloud itself). ​ Thanks! submitted by /u/DaegurthMiddnight [link] [comments]  ( 42 min )
    bloop: AI-powered code search engine - Search local and remote repositories with natural language, regex and filtered queries.
    submitted by /u/wyem [link] [comments]  ( 41 min )
    Here comes the flood
    Here comes the flood I wrote a piece recently on why I don’t believe in a flood of AI-content, not because there would be no mass of synthetic culture produced with generative AI, but because that flood of stuff just has no impact, lacks effort and emotionality to grab your attention, and is, as art and culture, just mediocre and flat. And while I want to emphasize that while writing i was not thinking about bureaucratic systems, but human psychology and perception, I was wrong regarding systems and institutions. I have a piece coming up in a tech magazine about how generative AI might overwhelm systems of rights management organizations and collection societies for holders of copyright, and that the current copyrights are not up for the task, even when you can’t claim a copyright on sy…  ( 48 min )
    My poem generating bot can now take up to 4 images and a text instruction as an input (link in comment)
    submitted by /u/red3vil96 [link] [comments]  ( 42 min )
    I made this infographic about some artificial intelligence statistics you may want to know.
    submitted by /u/TatianaW [link] [comments]  ( 41 min )
  • Open

    Boomi uses BYOC on Amazon SageMaker Studio to scale custom Markov chain implementation
    This post is co-written with Swagata Ashwani, Senior Data Scientist at Boomi. Boomi is an enterprise-level software as a service (SaaS) independent software vendor (ISV) that creates developer enablement tooling for software engineers. These tools integrate via API into Boomi’s core service offering. In this post, we discuss how Boomi used the bring-your-own-container (BYOC) approach […]  ( 8 min )
  • Open

    [D] Open source version of Flamingo
    At this point we have open source LLM's, text-to-image models, and CLIP-like models but nothing similar to Flamingo. I am guessing some groups have already started working on this, but I just don't know them. Does anyone know? Looks like a great fit for LAION. Also, I have some experience in this area and wouldn't mind lending a hand if that's possible. I really want to get my hands on a Flamingo-like large, multi-modal, few-shot model to see how it performs on vision-language compositionally tasks like Winoground. I am guessing these models might do a lot better than their smaller counterparts owing better generalization and reasoning capabilities of LLMs. submitted by /u/chigur86 [link] [comments]  ( 43 min )
    [N] U.S. Copyright Office decides that Kris Kashtanova's AI-involved graphic novel will remain copyright registered, but the copyright protection will be limited to the text and the whole work as a compilation
    Letter from the U.S. Copyright Office (PDF file). Blog post from Kris Kashtanova's lawyer. We received the decision today relative to Kristina Kashtanova's case about the comic book Zarya of the Dawn. Kris will keep the copyright registration, but it will be limited to the text and the whole work as a compilation. In one sense this is a success, in that the registration is still valid and active. However, it is the most limited a copyright registration can be and it doesn't resolve the core questions about copyright in AI-assisted works. Those works may be copyrightable, but the USCO did not find them so in this case. My previous post about this case. submitted by /u/Wiskkey [link] [comments]  ( 46 min )
    [N] Crowdsourcing better names for the Catch22 time series features
    Dear Colleagues This posting may be of interest to folks that use Catch22 for their time series research. What is the problem? Catch22 is a wonderfully useful tool for time series... But the names of the features, for example SC_FluctAnal_2_dfa_50_1_2_logi_prop_r1 or SB_TransitionMatrix_3ac_sumdiagcov are awkward to use and have little mnemonic value. Moreover, some of the names are very easy to confuse, such as: DN_OutlierInclude_n_001_mdrmd and DN_OutlierInclude_p_001_mdrmd This makes Catch22 awkward to use with a conversational agent, or many explainability/interpretability techniques etc. Their long length means it is even awkward to discuss features in a two-column paper format. Thus, we propose to find a set of new meaningful names for the features. Design principles The name should reflect what a feature is sensitive to. Ideal names would be one word, for example: noise, spike, symmetric, step, falling, periodic, simple, smooth, linear etc. However, given that it is likely to be rare a single feature has such specificity, the name could be a compound word, for example: uniform-noise, localized-noise, positive-spike, negative-spike etc. Compound words with three parts might be acceptable, i.e. fall-then-rise, however beyond three parts would be undesirable. In [a] we have a visual summary of the above, and one tentative worked example. We look forward to the community’s input. Many thanks Keogh's Lab [a] PDF: https://www.dropbox.com/s/n1aybeps5p2ho5k/Finding%20Better%20Names%20for%20the%20Catch22%20Features.pdf?dl=0 PPT: https://www.dropbox.com/s/kxodalw2beyz86j/Finding%20Better%20Names%20for%20the%20Catch22%20Features.pptx?dl=0 submitted by /u/eamonnkeogh [link] [comments]  ( 43 min )
    [P] MIT Introduction to Data-Centric AI
    Announcing the first-ever course on Data-Centric AI. Learn how to train better ML models by improving the data. Course homepage | Lecture videos on YouTube | Lab Assignments The course covers: Data-Centric AI vs. Model-Centric AI Label Errors Dataset Creation and Curation Data-centric Evaluation of ML Models Class Imbalance, Outliers, and Distribution Shift Growing or Compressing Datasets Interpretability in Data-Centric ML Encoding Human Priors: Data Augmentation and Prompt Engineering Data Privacy and Security MIT, like most universities, has many courses on machine learning (6.036, 6.867, and many others). Those classes teach techniques to produce effective models for a given dataset, and the classes focus heavily on the mathematical details of models rather than practical applications. However, in real-world applications of ML, the dataset is not fixed, and focusing on improving the data often gives better results than improving the model. We’ve personally seen this time and time again in our applied ML work as well as our research. Data-Centric AI (DCAI) is an emerging science that studies techniques to improve datasets in a systematic/algorithmic way — given that this topic wasn’t covered in the standard curriculum, we (a group of PhD candidates and grads) thought that we should put together a new class! We taught this intensive 2-week course in January over MIT’s IAP term, and we’ve just published all the course material, including lecture videos, lecture notes, hands-on lab assignments, and lab solutions, in hopes that people outside the MIT community would find these resources useful. We’d be happy to answer any questions related to the class or DCAI in general, and we’d love to hear any feedback on how we can improve the course material. Introduction to Data-Centric AI is open-source opencourseware, so feel free to make improvements directly: https://github.com/dcai-course/dcai-course. submitted by /u/anishathalye [link] [comments]  ( 44 min )
    [D] Faster Flan-T5 inference
    What's the best way to improve the inference speed of a Flan-T5 model? Onnx runtime doesn't seem to work for T5 models & Torchscript also doesn't seem to help speed it up (not sure why!) submitted by /u/_learn_faster_ [link] [comments]  ( 43 min )
    [R] Provable Copyright Protection for Generative Models
    Hi everyone, in a new paper we give a way to certify that a generative model does not infringe on the copyright of data that was in its training set. Twitter thread: https://twitter.com/boazbaraktcs/status/1628219647651729409 Blogpost: https://windowsontheory.org/2023/02/21/provable-copyright-protection-for-generative-models/ Paper: https://arxiv.org/abs/2302.10870 Abstract: There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data C that was in their training set. We give a formal definition of near access-freeness (NAF) and prove bounds on the probability that a model satisfying this definition outputs a sample similar to C, even if C is included in its training set. Roughly speaking, a generative model p is k-NAF if for every potentially copyrighted data C, the output of p diverges by at most k-bits from the output of a model q that did not access C at all. We also give generative model learning algorithms, which efficiently modify the original generative model learning algorithm in a black box manner, that output generative models with strong bounds on the probability of sampling protected content. Furthermore, we provide promising experiments for both language (transformers) and image (diffusion) generative models, showing minimal degradation in output quality while ensuring strong protections against sampling protected content. submitted by /u/vyasnikhil96 [link] [comments]  ( 48 min )
    [P] Discretization: equal-width trumps equal-frequency?
    So it seems in this test of the four popular scikit-learn datasets. The test uses as judging criteria the accuracy reported by a special classifier. In two of the datasets (iris and digits) the equal-width method markedly outperforms equal-frequency. In the other two datasets evaluated the differences are much narrower and could be considered as a tie result. The observations appear to be rather consistent when varying the number of bins used to discretize the attribute values. This seems counter-intuitive; equal-frequency should have an advantage by providing better immunity in the presence of outliers. Any thoughts? The used classifier, "deodel", discretizes continuous attributes using one of the two methods. After discretization, it behaves like a Hamming distance nearest neighbor…  ( 45 min )
    [D] Visualizing layer weights
    I was reading this paper, and I really liked the visualization of the conv layer weights in Figure 5. It's similar to the figures in this talk at Microsoft at 11:25. Does anyone know what this visualization is called and/or methods to use it? submitted by /u/like_a_tensor [link] [comments]  ( 43 min )
    [D] "Deep learning is the only thing that currently works at scale"
    "Deep learning is the only thing that currently works at scale it's the only class of algorithms that is able to discover arbitrary functions in a reasonable amount of time." https://www.youtube.com/watch?v=p-OYPRhqRCg I know of the universal approximation theorem. But is there any mathematical formulation of this statement? submitted by /u/GraciousReformer [link] [comments]  ( 50 min )
    [R] Running evolution as an optimization process on yeast cells
    Not published in an open journal sadly. Press release. TL;DR they set up a loss function (fastest growing survives) and evolved a bunch of yeast cells towards that loss function. This is a classic experiment, but they sequenced the DNA at each step and got a lot of cool data. The yeast cells converged much like you'd expect from an optimizer: The results of the experiment showed that in a controlled environment, evolutionary contingency led to convergence rather than divergence at the fitness level. Simply put, while the various yeast strains did mutate in different ways, they all arrived at a similar evolutionary endpoint regardless of their mutations. I wonder if you could do this more quickly using gradient descent or other algorithms from machine learning. Since they're already sequencing the DNA at each step, they could have estimated the gradient and edited it back into the yeast. It would likely converge on similar solutions, but faster. submitted by /u/currentscurrents [link] [comments]  ( 44 min )
  • Open

    Research Focus: Week of February 20, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Many real-world applications require sequential decision making, where an agent interacts with a stochastic environment to perform a task. For example, a navigating robot is expected to […] The post Research Focus: Week of February 20, 2023 appeared first on Microsoft Research.  ( 11 min )
  • Open

    AI Learns to Walk, Hop, and Roll...I guess?
    submitted by /u/Ziinxx [link] [comments]  ( 40 min )
    Artificial intelligence (AI) - The system needs new structures -
    Artificial intelligence (AI) - The system needs new structures - "I thought the whole idea of strong AI is that we don't have to know how the brain works in order to know how the mind works." (John Searle: "Minds, Brains, and Programs." 2000, p. 146) Construction 1 This is "Construction 1" of my entire essay "The system needs new structures - not only for/against Artificial Intelligence (AI)" and forms the conclusion to the trilogy of "philosophy of science" (https://philosophies.de/index.php/category/wissenschaftstheorie/). The 5 basic theses for a "new science" - the current state of current AI-development This first part of the essay deals with the 5 basic theses on a "new science" as structural change and the current status of current AI development published on: https://philosophies.de/index.php/2021/08/14/das-system-braucht-neue-strukturen/ There is an orange translation button "Translate>>" at the bottom left. submitted by /u/philosophiesde [link] [comments]  ( 41 min )
  • Open

    Sample Factory with VizDoom (Doom) (Deep Reinforcement Learning Course by Hugging Face 🤗)
    Hey there, We just wrote a tutorial on how to train agents playing Doom with Sample-Factory 🔫 🔥 You'll learn a new library: Sample Factory and you’ll train a PPO agent to play DOOM 🔫 🔥 Sounds fun? Start learning now 👉 https://huggingface.co/deep-rl-course/unit8/introduction-sf https://preview.redd.it/kje06rq7orja1.png?width=1920&format=png&auto=webp&s=1b556f35f779c8d5ba7d7feccc0f1c111d79b5d9 You didn’t start the course yet? You can do this tutorial as a standalone or start from the beginning, we wrote a guide to help you get started: https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course We also wrote an introduction unit to help you get started. You can start learning now 👉 https://huggingface.co/deep-rl-course/unit0/introduction If you have questions or feedback I would love to answer them. Keep Learning stay awesome submitted by /u/cranthir_ [link] [comments]  ( 41 min )
    DQN not learning after changing training and sampling scheme
    I am working on a project using DQN. Some hyperparameters I think will be relevant to my issue are as follow: target network update frequency = 5000 experience replay capacity = 200000 batch size = 64 exploration greedy epsilon decreases linearly from 1 to 0.05 over 50000 iterations Previously, in the warmup stage I would fill the replay buffer by playing 50 games (=50 complete trajectories) from the training set, each game would generate around 300 tuples. Once the training phase is formally started after the warmup stage, every 200 training iterations I would again generate 50 complete trajectories and put into the replay buffer (I also tried before that generate 2 complete trajectories every 10 training iterations, it also worked). This version worked, the metric was improving (my objective is to minimize a given metric, and the metric is decreasing well on validation set) although the loss did not decrease (I read somewhere that for reinforcement learning the loss does not really matter?) Now, I change the training and sampling scheme to the following: in the warmup stage, I play games to generate 50000 tuples to fill the replay buffer. Once the warmup stage is done, in between every training iteration I would play games to generate 64 tuples to be put into the buffer, so it is not a complete trajectory. I think this is what most people would do in contrast to my previous training and sampling scheme. However, after changing my framework to this scheme, my model is not learning, the metric on my validation set fluctuates even though hyperparameters, network structure and everything else stay the same. I tried changing target network update frequency, learning rate, changed from generating 64 tuples in between training iteration to 32 tuples, exploration epsilon decay rate, it is still not learning. Any idea why or what I can attempt to see what‘s the culprit? submitted by /u/butterJM [link] [comments]  ( 43 min )
    Unity-ML PPO is not solving the environment
    I've been trying to train a Robot arm to grab boxes and put them inside a Container, using Observations from the camera. however, the tensorboard outputs seem to indicate that the policy is not learning at all (I left it for 2 days on my PC"CPU") so before I try to leave it any longer I thought to apply curriculum learning or alter reards functions.anyone has any Idea what's the right step to take here? the Reward functions I used are : -0.1 time penalty distance reward = 1-(distance)^0.4, Velocity Reward = 1- max(velocity,0.1)^0.4 [Used to handle Speed of object while being put in container] total reward = distance reward * velocity reward some spare read => putting a box in container +100, putting all boxes inside container +1000, throwing boxes on ground -100, and terminating the episode I started with a smaller network PPO and tried to alter the configuration , here are the final ones I used for this training run ppo hyperparameters: batch_size: 1024 buffer_size: 10240 learning_rate: 3e-05 beta: 0.005 epsilon: 0.2 lambd: 0.95 num_epoch: 3 learning_rate_schedule: linear beta_schedule: linear epsilon_schedule: linear network_settings: normalize: True hidden_units: 512 num_layers: 5 vis_encode_type: simple memory: None goal_conditioning_type: hyper deterministic: False reward_signals: extrinsic: gamma: 0.99 strength: 1.0 network_settings: normalize: False hidden_units: 128 num_layers: 2 vis_encode_type: simple memory: None goal_conditioning_type: hyper deterministic: False init_path: None keep_checkpoints: 5 checkpoint_interval: 500000 max_steps: 5000000 time_horizon: 64 summary_freq: 10000 threaded: False self_play: None behavioral_cloning: None ​ ​ https://preview.redd.it/ddzg4pe2xqja1.png?width=815&format=png&auto=webp&s=96bec39f86f2f0cf3e869ac08e33537ddbd2c0ab submitted by /u/Smart_Reward3471 [link] [comments]  ( 43 min )
    Convolutional Dueling Q Network Ripping Snake
    submitted by /u/auto_mata [link] [comments]  ( 41 min )
    I used RL to teach an AI to walk and hop
    submitted by /u/Stochastic_Machine [link] [comments]  ( 6 min )
  • Open

    Suppressing quantum errors by scaling a surface code logical qubit
    Posted by Hartmut Neven, VP of Engineering, and Julian Kelly, Director of Quantum Hardware, on behalf of the Google Quantum AI Team Many years from today, scientists will be able to use fault-tolerant quantum computers for large-scale computations with applications across science and industry. These quantum computers will be much bigger than today, consisting of millions of coherent quantum bits, or qubits. But there’s a catch — these basic building blocks must be good enough or the systems will be overrun with errors. Currently, the error rates of the qubits on our 3rd generation Sycamore processor are typically between 1 in 10,000 to 1 in 100. Through our work and that of others, we understand that developing large-scale quantum computers will require far lower error rates. We will n…  ( 94 min )
  • Open

    New NVIDIA Studio Laptops Powered by GeForce RTX 4070, 4060, 4050 Laptop GPUs Boost On-the-Go Content Creation
    Laptops equipped with NVIDIA GeForce RTX 4070, 4060 and 4050 GPUs are now available. The new lineup — including NVIDIA Studio-validated laptops from ASUS, GIGABYTE and Samsung — gives creators more options to create from anywhere with lighter, thinner devices that dramatically exceed the performance of the last generation.  ( 8 min )
  • Open

    Sorry, There Are No Shortcuts To Transformation
    Alan Morrison, contributor at Data Science Central, recently integrated two of my blogs (one recent and many moons ago) into an interesting perspective that he shared on the Data Science Central email distribution list (get on it if you are not already). Alan’s key points are this: Sorry, but there are no shortcuts if you… Read More »Sorry, There Are No Shortcuts To Transformation The post Sorry, There Are No Shortcuts To Transformation appeared first on Data Science Central.  ( 19 min )
  • Open

    Divisibility by base + 1
    To test whether a number is divisible by 11, add every other digit together and subtract the rest of the digits. The result is divisible by 11 if and only if the original number is divisible by 11. For example, start with n = 31425. Add 3, 4, and 5, and subtract 1 and 2. […] Divisibility by base + 1 first appeared on John D. Cook.  ( 5 min )
  • Open

    Discriminative Clustering with Representation Learning with any Ratio of Labeled to Unlabeled Data. (arXiv:1912.12979v2 [stat.ML] UPDATED)
    We present a discriminative clustering approach in which the feature representation can be learned from data and moreover leverage labeled data. Representation learning can give a similarity-based clustering method the ability to automatically adapt to an underlying, yet hidden, geometric structure of the data. The proposed approach augments the DIFFRAC method with a representation learning capability, using a gradient-based stochastic training algorithm and an optimal transport algorithm with entropic regularization to perform the cluster assignment step. The resulting method is evaluated on several real datasets when varying the ratio of labeled data to unlabeled data and thereby interpolating between the fully unsupervised regime and the fully supervised regime. The experimental results suggest that the proposed method can learn powerful feature representations even in the fully unsupervised regime and can leverage even small amounts of labeled data to improve the feature representations and to obtain better clusterings of complex datasets.  ( 2 min )
    SAITS: Self-Attention-based Imputation for Time Series. (arXiv:2202.08516v3 [cs.LG] UPDATED)
    Missing data in time series is a pervasive problem that puts obstacles in the way of advanced analysis. A popular solution is imputation, where the fundamental challenge is to determine what values should be filled in. This paper proposes SAITS, a novel method based on the self-attention mechanism for missing value imputation in multivariate time series. Trained by a joint-optimization approach, SAITS learns missing values from a weighted combination of two diagonally-masked self-attention (DMSA) blocks. DMSA explicitly captures both the temporal dependencies and feature correlations between time steps, which improves imputation accuracy and training speed. Meanwhile, the weighted-combination design enables SAITS to dynamically assign weights to the learned representations from two DMSA blocks according to the attention map and the missingness information. Extensive experiments quantitatively and qualitatively demonstrate that SAITS outperforms the state-of-the-art methods on the time-series imputation task efficiently and reveal SAITS' potential to improve the learning performance of pattern recognition models on incomplete time-series data from the real world.  ( 2 min )
    Unsupervised Task Graph Generation from Instructional Video Transcripts. (arXiv:2302.09173v1 [cs.AI])
    This work explores the problem of generating task graphs of real-world activities. Different from prior formulations, we consider a setting where text transcripts of instructional videos performing a real-world activity (e.g., making coffee) are provided and the goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps. We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components to generate accurate task graphs in a completely unsupervised manner. We show that the proposed approach generates more accurate task graphs compared to a supervised learning approach on tasks from the ProceL and CrossTask datasets.  ( 2 min )
    The Unfairness of Fair Machine Learning: Levelling down and strict egalitarianism by default. (arXiv:2302.02404v2 [cs.AI] UPDATED)
    In recent years fairness in machine learning (ML) has emerged as a highly active area of research and development. Most define fairness in simple terms, where fairness means reducing gaps in performance or outcomes between demographic groups while preserving as much of the accuracy of the original system as possible. This oversimplification of equality through fairness measures is troubling. Many current fairness measures suffer from both fairness and performance degradation, or "levelling down," where fairness is achieved by making every group worse off, or by bringing better performing groups down to the level of the worst off. When fairness can only be achieved by making everyone worse off in material or relational terms through injuries of stigma, loss of solidarity, unequal concern, and missed opportunities for substantive equality, something would appear to have gone wrong in translating the vague concept of 'fairness' into practice. This paper examines the causes and prevalence of levelling down across fairML, and explore possible justifications and criticisms based on philosophical and legal theories of equality and distributive justice, as well as equality law jurisprudence. We find that fairML does not currently engage in the type of measurement, reporting, or analysis necessary to justify levelling down in practice. We propose a first step towards substantive equality in fairML: "levelling up" systems by design through enforcement of minimum acceptable harm thresholds, or "minimum rate constraints," as fairness constraints. We likewise propose an alternative harms-based framework to counter the oversimplified egalitarian framing currently dominant in the field and push future discussion more towards substantive equality opportunities and away from strict egalitarianism by default. N.B. Shortened abstract, see paper for full abstract.  ( 2 min )
    Efficient Data Analytics on Augmented Similarity Triplets. (arXiv:1912.12064v3 [cs.LG] UPDATED)
    Data analysis require a pairwise proximity measure over objects. Recent work has extended this to situations where the distance information between objects is given as comparison results of distances between three objects (triplets). Humans find the comparison tasks much easier than the exact distance computation and such data can be easily obtained in big quantity via crowd-sourcing. In this work, we propose triplets augmentation, an efficient method to extend the triplets data by inferring the hidden implicit information form the existing data. Triplets augmentation improves the quality of kernel-based and kernel-free data analytics. We also propose a novel set of algorithms for common data analysis tasks based on triplets. These methods work directly with triplets and avoid kernel evaluations, thus are scalable to big data. We demonstrate that our methods outperform the current best-known techniques and are robust to noisy data.  ( 2 min )
    On the Relation between Sensitivity and Accuracy in In-context Learning. (arXiv:2209.07661v2 [cs.CL] UPDATED)
    In-context learning (ICL) suffers from oversensitivity to the prompt, making it unreliable in real-world scenarios. We study the sensitivity of ICL with respect to multiple perturbation types. First, we find that label bias obscures the true sensitivity, and therefore prior work may have significantly underestimated ICL sensitivity. Second, we observe a strong negative correlation between ICL sensitivity and accuracy: predictions sensitive to perturbations are less likely to be correct. Motivated by these findings, we propose \textsc{SenSel}, a few-shot selective prediction method that abstains from sensitive predictions. Experiments on ten classification datasets show that \textsc{SenSel} consistently outperforms two commonly used confidence-based and entropy-based baselines on abstention decisions.  ( 2 min )
    Learning Language Representations with Logical Inductive Bias. (arXiv:2302.09458v1 [cs.CL])
    Transformer architectures have achieved great success in solving natural language tasks, which learn strong language representations from large-scale unlabeled texts. In this paper, we seek to go further beyond and explore a new logical inductive bias for better language representation learning. Logic reasoning is known as a formal methodology to reach answers from given knowledge and facts. Inspired by such a view, we develop a novel neural architecture named FOLNet (First-Order Logic Network), to encode this new inductive bias. We construct a set of neural logic operators as learnable Horn clauses, which are further forward-chained into a fully differentiable neural architecture (FOLNet). Interestingly, we find that the self-attention module in transformers can be composed by two of our neural logic operators, which probably explains their strong reasoning performance. Our proposed FOLNet has the same input and output interfaces as other pretrained models and thus could be pretrained/finetuned by using similar losses. It also allows FOLNet to be used in a plug-and-play manner when replacing other pretrained models. With our logical inductive bias, the same set of ``logic deduction skills'' learned through pretraining are expected to be equally capable of solving diverse downstream tasks. For this reason, FOLNet learns language representations that have much stronger transfer capabilities. Experimental results on several language understanding tasks show that our pretrained FOLNet model outperforms the existing strong transformer-based approaches.  ( 2 min )
    ET-AL: Entropy-Targeted Active Learning for Bias Mitigation in Materials Data. (arXiv:2211.07881v4 [cond-mat.mtrl-sci] UPDATED)
    Growing materials data and data-driven informatics drastically promote the discovery and design of materials. While there are significant advancements in data-driven models, the quality of data resources is less studied despite its huge impact on model performance. In this work, we focus on data bias arising from uneven coverage of materials families in existing knowledge. Observing different diversities among crystal systems in common materials databases, we propose an information entropy-based metric for measuring this bias. To mitigate the bias, we develop an entropy-targeted active learning (ET-AL) framework, which guides the acquisition of new data to improve the diversity of underrepresented crystal systems. We demonstrate the capability of ET-AL for bias mitigation and the resulting improvement in downstream machine learning models. This approach is broadly applicable to data-driven materials discovery, including autonomous data acquisition and dataset trimming to reduce bias, as well as data-driven informatics in other scientific domains.  ( 2 min )
    Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs. (arXiv:2210.12283v3 [cs.AI] UPDATED)
    The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems.  ( 2 min )
    Probabilistic forecasts of extreme heatwaves using convolutional neural networks in a regime of lack of data. (arXiv:2208.00971v2 [physics.ao-ph] UPDATED)
    Understanding extreme events and their probability is key for the study of climate change impacts, risk assessment, adaptation, and the protection of living beings. Forecasting the occurrence probability of extreme heatwaves is a primary challenge for risk assessment and attribution, but also for fundamental studies about processes, dataset and model validation, and climate change studies. In this work we develop a methodology to build forecasting models which are based on convolutional neural networks, trained on extremely long climate model outputs. We demonstrate that neural networks have positive predictive skills, with respect to random climatological forecasts, for the occurrence of long-lasting 14-day heatwaves over France, up to 15 days ahead of time for fast dynamical drivers (500 hPa geopotential height fields), and also at much longer lead times for slow physical drivers (soil moisture). This forecast is made seamlessly in time and space, for fast hemispheric and slow local drivers. We find that the neural network selects extreme heatwaves associated with a North-Hemisphere wavenumber-3 pattern. The main scientific message is that most of the time, training neural networks for predicting extreme heatwaves occurs in a regime of lack of data. We suggest that this is likely to be the case for most other applications to large scale atmosphere and climate phenomena. For instance, using one hundred years-long training sets, a regime of drastic lack of data, leads to severely lower predictive skills and general inability to extract useful information available in the 500 hPa geopotential height field at a hemispheric scale in contrast to the dataset of several thousand years long. We discuss perspectives for dealing with the lack of data regime, for instance rare event simulations and how transfer learning may play a role in this latter task.  ( 3 min )
    Contrastive Learning as Goal-Conditioned Reinforcement Learning. (arXiv:2206.07568v2 [cs.LG] UPDATED)
    In reinforcement learning (RL), it is easier to solve a task if given a good representation. While deep RL should automatically acquire such good representations, prior work often finds that learning representations in an end-to-end fashion is unstable and instead equip RL algorithms with additional representation learning parts (e.g., auxiliary losses, data augmentation). How can we design RL algorithms that directly acquire good representations? In this paper, instead of adding representation learning parts to an existing RL algorithm, we show (contrastive) representation learning methods can be cast as RL algorithms in their own right. To do this, we build upon prior work and apply contrastive representation learning to action-labeled trajectories, in such a way that the (inner product of) learned representations exactly corresponds to a goal-conditioned value function. We use this idea to reinterpret a prior RL method as performing contrastive learning, and then use the idea to propose a much simpler method that achieves similar performance. Across a range of goal-conditioned RL tasks, we demonstrate that contrastive RL methods achieve higher success rates than prior non-contrastive methods, including in the offline RL setting. We also show that contrastive RL outperforms prior methods on image-based tasks, without using data augmentation or auxiliary objectives.  ( 2 min )
    Interpreting Embedding Spaces by Conceptualization. (arXiv:2209.00445v2 [cs.CL] UPDATED)
    One of the main methods for semantic interpretation of text is mapping it into a vector in some embedding space. Such vectors can then be used for a variety of text processing tasks. Recently, most embedding spaces are a product of training large language models. One major drawback of this type of representation is its incomprehensibility to humans. Understanding the embedding space is crucial for several important needs, including the need to explain the decision of a system that uses the embedding, the need to debug the embedding method and compare it to alternatives, and the need to detect biases hidden in the model. In this paper, we present a novel method of transforming any embedding space into a comprehensible conceptual space. We first present an algorithm for deriving a conceptual space with dynamic on-demand granularity. We then show a method for transferring any vector in the original incomprehensible space to an understandable vector in the conceptual space. We combine human tests with cross-model tests to show that the conceptualized vectors indeed represent the semantics of the original vectors. We also show how the conceptualized vectors can be used for various tasks including identifying weaknesses in the semantics underlying the original spaces and differences in the semantics of alternative models.  ( 2 min )
    When Personalization Harms: Reconsidering the Use of Group Attributes in Prediction. (arXiv:2206.02058v2 [stat.ML] UPDATED)
    Machine learning models are often personalized with categorical attributes that are protected, sensitive, self-reported, or costly to acquire. In this work, we show models that are personalized with group attributes can reduce performance at a group level. We propose formal conditions to ensure the "fair use" of group attributes in prediction tasks by training one additional model -- i.e., collective preference guarantees to ensure that each group who provides personal data will receive a tailored gain in performance in return. We present sufficient conditions to ensure fair use in empirical risk minimization and characterize failure modes that lead to fair use violations due to standard practices in model development and deployment. We present a comprehensive empirical study of fair use in clinical prediction tasks. Our results demonstrate the prevalence of fair use violations in practice and illustrate simple interventions to mitigate their harm.  ( 2 min )
    Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text Documents. (arXiv:2302.09324v1 [cs.CL])
    While humans can extract information from unstructured text with high precision and recall, this is often too time-consuming to be practical. Automated approaches, on the other hand, produce nearly-immediate results, but may not be reliable enough for high-stakes applications where precision is essential. In this work, we consider the benefits and drawbacks of various human-only, human-machine, and machine-only information extraction approaches. We argue for the utility of a human-in-the-loop approach in applications where high precision is required, but purely manual extraction is infeasible. We present a framework and an accompanying tool for information extraction using weak-supervision labelling with human validation. We demonstrate our approach on three criminal justice datasets. We find that the combination of computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms fully automated baselines in terms of precision.  ( 2 min )
    Implementing Neural Network-Based Equalizers in a Coherent Optical Transmission System Using Field-Programmable Gate Arrays. (arXiv:2212.04703v2 [eess.SP] UPDATED)
    In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 200G and 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.  ( 3 min )
    Learning to Increase the Power of Conditional Randomization Tests. (arXiv:2207.01022v2 [cs.LG] UPDATED)
    The model-X conditional randomization test is a generic framework for conditional independence testing, unlocking new possibilities to discover features that are conditionally associated with a response of interest while controlling type-I error rates. An appealing advantage of this test is that it can work with any machine learning model to design powerful test statistics. In turn, the common practice in the model-X literature is to form a test statistic using machine learning models, trained to maximize predictive accuracy with the hope to attain a test with good power. However, the ideal goal here is to drive the model (during training) to maximize the power of the test, not merely the predictive accuracy. In this paper, we bridge this gap by introducing, for the first time, novel model-fitting schemes that are designed to explicitly improve the power of model-X tests. This is done by introducing a new cost function that aims at maximizing the test statistic used to measure violations of conditional independence. Using synthetic and real data sets, we demonstrate that the combination of our proposed loss function with various base predictive models (lasso, elastic net, and deep neural networks) consistently increases the number of correct discoveries obtained, while maintaining type-I error rates under control.  ( 2 min )
    HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction. (arXiv:2212.12440v2 [q-bio.BM] UPDATED)
    Applying deep learning concepts from image detection and graph theory has greatly advanced protein-ligand binding affinity prediction, a challenge with enormous ramifications for both drug discovery and protein engineering. We build upon these advances by designing a novel deep learning architecture consisting of a 3-dimensional convolutional neural network utilizing channel-wise attention and two graph convolutional networks utilizing attention-based aggregation of node features. HAC-Net (Hybrid Attention-Based Convolutional Neural Network) obtains state-of-the-art results on the PDBbind v.2016 core set, the most widely recognized benchmark in the field. We extensively assess the generalizability of our model using multiple train-test splits, each of which maximizes differences between either protein structures, protein sequences, or ligand extended-connectivity fingerprints of complexes in the training and test sets. Furthermore, we perform 10-fold cross-validation with a similarity cutoff between SMILES strings of ligands in the training and test sets, and also evaluate the performance of HAC-Net on lower-quality data. We envision that this model can be extended to a broad range of supervised learning problems related to structure-based biomolecular property prediction. All of our software is available as open source at https://github.com/gregory-kyro/HAC-Net/, and the HACNet Python package is available through PyPI.  ( 2 min )
    Mimetic Muscle Rehabilitation Analysis Using Clustering of Low Dimensional 3D Kinect Data. (arXiv:2302.09295v1 [cs.CY])
    Facial nerve paresis is a severe complication that arises post-head and neck surgery; This results in articulation problems, facial asymmetry, and severe problems in non-verbal communication. To overcome the side effects of post-surgery facial paralysis, rehabilitation requires which last for several weeks. This paper discusses an unsupervised approach to rehabilitating patients who have temporary facial paralysis due to damage in mimetic muscles. The work aims to make the rehabilitation process objective compared to the current subjective approach, such as House-Brackmann (HB) scale. Also, the approach will assist clinicians by reducing their workload in assessing the improvement during rehabilitation. This paper focuses on the clustering approach to monitor the rehabilitation process. We compare the results obtained from different clustering algorithms on various forms of the same data set, namely dynamic form, data expressed as functional data using B-spline basis expansion, and by finding the functional principal components of the functional data. The study contains data set of 85 distinct patients with 120 measurements obtained using a Kinect stereo-vision camera. The method distinguish effectively between patients with the least and greatest degree of facial paralysis, however patients with adjacent degrees of paralysis provide some challenges. In addition, we compared the cluster results to the HB scale outputs.  ( 2 min )
    Newton-type Methods for Minimax Optimization. (arXiv:2006.14592v3 [cs.LG] UPDATED)
    Differential games, in particular two-player sequential zero-sum games (a.k.a. minimax optimization), have been an important modeling tool in applied science and received renewed interest in machine learning due to many recent applications, such as adversarial training, generative models and reinforcement learning. However, existing theory mostly focuses on convex-concave functions with few exceptions. In this work, we propose two novel Newton-type algorithms for nonconvex-nonconcave minimax optimization. We prove their local convergence at strict local minimax points, which are surrogates of global solutions. We argue that our Newton-type algorithms nicely complement existing ones in that (a) they converge faster to strict local minimax points; (b) they are much more effective when the problem is ill-conditioned; (c) their computational complexity remains similar. We verify the effectiveness of our Newton-type algorithms through experiments on training GANs which are intrinsically nonconvex and ill-conditioned. Our code is available at https://github.com/watml/min-max-2nd-order.  ( 2 min )
    Differentially Private Bayesian Neural Networks on Accuracy, Privacy and Reliability. (arXiv:2107.08461v2 [cs.LG] UPDATED)
    Bayesian neural network (BNN) allows for uncertainty quantification in prediction, offering an advantage over regular neural networks that has not been explored in the differential privacy (DP) framework. We fill this important gap by leveraging recent development in Bayesian deep learning and privacy accounting to offer a more precise analysis of the trade-off between privacy and accuracy in BNN. We propose three DP-BNNs that characterize the weight uncertainty for the same network architecture in distinct ways, namely DP-SGLD (via the noisy gradient method), DP-BBP (via changing the parameters of interest) and DP-MC Dropout (via the model architecture). Interestingly, we show a new equivalence between DP-SGD and DP-SGLD, implying that some non-Bayesian DP training naturally allows for uncertainty quantification. However, the hyperparameters such as learning rate and batch size, can have different or even opposite effects in DP-SGD and DP-SGLD. Extensive experiments are conducted to compare DP-BNNs, in terms of privacy guarantee, prediction accuracy, uncertainty quantification, calibration, computation speed, and generalizability to network architecture. As a result, we observe a new tradeoff between the privacy and the reliability. When compared to non-DP and non-Bayesian approaches, DP-SGLD is remarkably accurate under strong privacy guarantee, demonstrating the great potential of DP-BNN in real-world tasks.  ( 2 min )
    A kernel-based quantum random forest for improved classification. (arXiv:2210.02355v2 [quant-ph] UPDATED)
    The emergence of Quantum Machine Learning (QML) to enhance traditional classical learning methods has seen various limitations to its realisation. There is therefore an imperative to develop quantum models with unique model hypotheses to attain expressional and computational advantage. In this work we extend the linear quantum support vector machine (QSVM) with kernel function computed through quantum kernel estimation (QKE), to form a decision tree classifier constructed from a decision directed acyclic graph of QSVM nodes - the ensemble of which we term the quantum random forest (QRF). To limit overfitting, we further extend the model to employ a low-rank Nystr\"{o}m approximation to the kernel matrix. We provide generalisation error bounds on the model and theoretical guarantees to limit errors due to finite sampling on the Nystr\"{o}m-QKE strategy. In doing so, we show that we can achieve lower sampling complexity when compared to QKE. We numerically illustrate the effect of varying model hyperparameters and finally demonstrate that the QRF is able obtain superior performance over QSVMs, while also requiring fewer kernel estimations.  ( 2 min )
    Scalable Marked Point Processes for Exchangeable and Non-Exchangeable Event Sequences. (arXiv:2105.14574v3 [stat.ML] UPDATED)
    We adopt the interpretability offered by a parametric, Hawkes-process-inspired conditional probability mass function for the marks and apply variational inference techniques to derive a general and scalable inferential framework for marked point processes. The framework can handle both exchangeable and non-exchangeable event sequences with minimal tuning and without any pre-training. This contrasts with many parametric and non-parametric state-of-the-art methods that typically require pre-training and/or careful tuning, and can only handle exchangeable event sequences. The framework's competitive computational and predictive performance against other state-of-the-art methods are illustrated through real data experiments. Its attractiveness for large-scale applications is demonstrated through a case study involving all events occurring in an English Premier League season.  ( 2 min )
    Identifying Weight-Variant Latent Causal Models. (arXiv:2208.14153v5 [cs.LG] UPDATED)
    The task of causal representation learning aims to uncover latent higher-level causal representations that affect lower-level observations. Identifying true latent causal representations from observed data, while allowing instantaneous causal relations among latent variables, remains a challenge, however. To this end, we start from the analysis of three intrinsic properties in identifying latent space from observations: transitivity, permutation indeterminacy, and scaling indeterminacy. We find that transitivity acts as a key role in impeding the identifiability of latent causal representations. To address the unidentifiable issue due to transitivity, we introduce a novel identifiability condition where the underlying latent causal model satisfies a linear-Gaussian model, in which the causal coefficients and the distribution of Gaussian noise are modulated by an additional observed variable. Under some mild assumptions, we can show that the latent causal representations can be identified up to trivial permutation and scaling. Furthermore, based on this theoretical result, we propose a novel method, termed Structural caUsAl Variational autoEncoder, which directly learns latent causal representations and causal relationships among them, together with the mapping from the latent causal variables to the observed ones. We show that the proposed method learns the true parameters asymptotically. Experimental results on synthetic and real data demonstrate the identifiability and consistency results and the efficacy of the proposed method in learning latent causal representations.  ( 2 min )
    Dual-Domain Self-Supervised Learning for Accelerated Non-Cartesian MRI Reconstruction. (arXiv:2302.09244v1 [eess.IV])
    While enabling accelerated acquisition and improved reconstruction accuracy, current deep MRI reconstruction networks are typically supervised, require fully sampled data, and are limited to Cartesian sampling patterns. These factors limit their practical adoption as fully-sampled MRI is prohibitively time-consuming to acquire clinically. Further, non-Cartesian sampling patterns are particularly desirable as they are more amenable to acceleration and show improved motion robustness. To this end, we present a fully self-supervised approach for accelerated non-Cartesian MRI reconstruction which leverages self-supervision in both k-space and image domains. In training, the undersampled data are split into disjoint k-space domain partitions. For the k-space self-supervision, we train a network to reconstruct the input undersampled data from both the disjoint partitions and from itself. For the image-level self-supervision, we enforce appearance consistency obtained from the original undersampled data and the two partitions. Experimental results on our simulated multi-coil non-Cartesian MRI dataset demonstrate that DDSS can generate high-quality reconstruction that approaches the accuracy of the fully supervised reconstruction, outperforming previous baseline methods. Finally, DDSS is shown to scale to highly challenging real-world clinical MRI reconstruction acquired on a portable low-field (0.064 T) MRI scanner with no data available for supervised training while demonstrating improved image quality as compared to traditional reconstruction, as determined by a radiologist study.  ( 2 min )
    Learning Diversified Feature Representations for Facial Expression Recognition in the Wild. (arXiv:2210.09381v2 [cs.CV] UPDATED)
    Diversity of the features extracted by deep neural networks is important for enhancing the model generalization ability and accordingly its performance in different learning tasks. Facial expression recognition in the wild has attracted interest in recent years due to the challenges existing in this area for extracting discriminative and informative features from occluded images in real-world scenarios. In this paper, we propose a mechanism to diversify the features extracted by CNN layers of state-of-the-art facial expression recognition architectures for enhancing the model capacity in learning discriminative features. To evaluate the effectiveness of the proposed approach, we incorporate this mechanism in two state-of-the-art models to (i) diversify local/global features in an attention-based model and (ii) diversify features extracted by different learners in an ensemble-based model. Experimental results on three well-known facial expression recognition in-the-wild datasets, AffectNet, FER+, and RAF-DB, show the effectiveness of our method, achieving the state-of-the-art performance of 89.99% on RAF-DB, 89.34% on FER+ and the competitive accuracy of 60.02% on AffectNet dataset.  ( 2 min )
    Deep Selector-JPEG: Adaptive JPEG Image Compression for Computer Vision in Image classification with Human Vision Criteria. (arXiv:2302.09560v1 [eess.IV])
    With limited storage/bandwidth resources, input images to Computer Vision (CV) applications that use Deep Neural Networks (DNNs) are often encoded with JPEG that is tailored to Human Vision (HV). This paper presents Deep Selector-JPEG, an adaptive JPEG compression method that targets image classification while satisfying HV criteria. For each image, Deep Selector-JPEG selects adaptively a Quality Factor (QF) to compress the image so that a good trade-off between the Compression Ratio (CR) and DNN classifier Accuracy (Rate-Accuracy performance) can be achieved over a set of images for a variety of DNN classifiers while the MS-SSIM of such compressed image is greater than a threshold value predetermined by HV with a high probability. Deep Selector-JPEG is designed via light-weighted or heavy-weighted selector architectures. Experimental results show that in comparison with JPEG at the same CR, Deep Selector-JPEG achieves better Rate-Accuracy performance over the ImageNet validation set for all tested DNN classifiers with gains in classification accuracy between 0.2% and 1% at the same CRs while satisfying HV constraints. Deep Selector-JPEG can also roughly provide the original classification accuracy at higher CRs.  ( 2 min )
    Exploration into Translation-Equivariant Image Quantization. (arXiv:2112.00384v2 [cs.CV] UPDATED)
    This is an exploratory study that discovers the current image quantization (vector quantization) do not satisfy translation equivariance in the quantized space due to aliasing. Instead of focusing on anti-aliasing, we propose a simple yet effective way to achieve translation-equivariant image quantization by enforcing orthogonality among the codebook embeddings. To explore the advantages of translation-equivariant image quantization, we conduct three proof-of-concept experiments with a carefully controlled dataset: (1) text-to-image generation, where the quantized image indices are the target to predict, (2) image-to-text generation, where the quantized image indices are given as a condition, (3) using a smaller training set to analyze sample efficiency. From the strictly controlled experiments, we empirically verify that the translation-equivariant image quantizer improves not only sample efficiency but also the accuracy over VQGAN up to +11.9% in text-to-image generation and +3.9% in image-to-text generation.  ( 2 min )
    CPPE-5: Medical Personal Protective Equipment Dataset. (arXiv:2112.09569v2 [cs.CV] UPDATED)
    We present a new challenging dataset, CPPE - 5 (Medical Personal Protective Equipment), with the goal to allow the study of subordinate categorization of medical personal protective equipments, which is not possible with other popular data sets that focus on broad-level categories (such as PASCAL VOC, ImageNet, Microsoft COCO, OpenImages, etc). To make it easy for models trained on this dataset to be used in practical scenarios in complex scenes, our dataset mainly contains images that show complex scenes with several objects in each scene in their natural context. The image collection for this dataset focuses on: obtaining as many non-iconic images as possible and making sure all the images are real-life images, unlike other existing datasets in this area. Our dataset includes 5 object categories (coveralls, face shields, gloves, masks, and goggles), and each image is annotated with a set of bounding boxes and positive labels. We present a detailed analysis of the dataset in comparison to other popular broad category datasets as well as datasets focusing on personal protective equipments, we also find that at present there exist no such publicly available datasets. Finally, we also analyze performance and compare model complexities on baseline and state-of-the-art models for bounding box results. Our code, data, and trained models are available at https://git.io/cppe5-dataset.  ( 2 min )
    Adversarial examples within the training distribution: A widespread challenge. (arXiv:2106.16198v2 [cs.CV] UPDATED)
    Despite a plethora of proposed theories, understanding why deep neural networks are susceptible to adversarial attacks remains an open question. A promising recent strand of research investigates adversarial attacks within the training data distribution, providing a more stringent and worrisome definition for these attacks. These theories posit that the key issue is that in high dimensional datasets, most data points are close to the ground-truth class boundaries. This has been shown in theory for some simple data distributions, but it is unclear if this theory is relevant in practice. Here, we demonstrate the existence of in-distribution adversarial examples for object recognition. This result provides evidence supporting theories attributing adversarial examples to the proximity of data to ground-truth class boundaries, and calls into question other theories which do not account for this more stringent definition of adversarial attacks. These experiments are enabled by our novel gradient-free, evolutionary strategies (ES) based approach for finding in-distribution adversarial examples in 3D rendered objects, which we call CMA-Search.  ( 2 min )
    Towards Radar Emitter Recognition in Changing Environments with Domain Generalization. (arXiv:2302.09359v1 [cs.LG])
    Analyzing radar signals from complex Electronic Warfare (EW) environment is a non-trivial task.However, in the real world, the changing EW environment results in inconsistent signal distribution, such as the pulse repetition interval (PRI) mismatch between different detected scenes.In this paper, we propose a novel domain generalization framework to improve the adaptability of signal recognition in changing environments.Specifically, we first design several noise generators to simulate varied scenes. Different from conventional augmentation methods, our introduced generators carefully enhance the diversity of the detected signals and meanwhile maintain the semantic features of the signals. Moreover, we propose a signal scene domain classifier that works in the manner of adversarial learning. The proposed classifier guarantees the signal predictor to generalize to different scenes. Extensive comparative experiments prove the proposed method's superiority.  ( 2 min )
    Improving Training Stability for Multitask Ranking Models in Recommender Systems. (arXiv:2302.09178v1 [cs.LG])
    Recommender systems play an important role in many content platforms. While most recommendation research is dedicated to designing better models to improve user experience, we found that research on stabilizing the training for such models is severely under-explored. As recommendation models become larger and more sophisticated, they are more susceptible to training instability issues, \emph{i.e.}, loss divergence, which can make the model unusable, waste significant resources and block model developments. In this paper, we share our findings and best practices we learned for improving the training stability of a real-world multitask ranking model for YouTube recommendations. We show some properties of the model that lead to unstable training and conjecture on the causes. Furthermore, based on our observations of training dynamics near the point of training instability, we hypothesize why existing solutions would fail, and propose a new algorithm to mitigate the limitations of existing solutions. Our experiments on YouTube production dataset show the proposed algorithm can significantly improve training stability while not compromising convergence, comparing with several commonly used baseline methods.  ( 2 min )
    Machine Learning for Cutting Planes in Integer Programming: A Survey. (arXiv:2302.09166v1 [math.OC])
    We survey recent work on machine learning (ML) techniques for selecting cutting planes (or cuts) in mixed-integer linear programming (MILP). Despite the availability of various classes of cuts, the task of choosing a set of cuts to add to the linear programming (LP) relaxation at a given node of the branch-and-bound (B&B) tree has defied both formal and heuristic solutions to date. ML offers a promising approach for improving the cut selection process by using data to identify promising cuts that accelerate the solution of MILP instances. This paper presents an overview of the topic, highlighting recent advances in the literature, common approaches to data collection, evaluation, and ML model architectures. We analyze the empirical results in the literature in an attempt to quantify the progress that has been made and conclude by suggesting avenues for future research.  ( 2 min )
    Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. (arXiv:2209.03430v2 [cs.LG] UPDATED)
    Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity, connections, and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.  ( 2 min )
    Understanding how the use of AI decision support tools affect critical thinking and over-reliance on technology by drug dispensers in Tanzania. (arXiv:2302.09487v1 [cs.HC])
    The use of AI in healthcare is designed to improve care delivery and augment the decisions of providers to enhance patient outcomes. When deployed in clinical settings, the interaction between providers and AI is a critical component for measuring and understanding the effectiveness of these digital tools on broader health outcomes. Even in cases where AI algorithms have high diagnostic accuracy, healthcare providers often still rely on their experience and sometimes gut feeling to make a final decision. Other times, providers rely unquestioningly on the outputs of the AI models, which leads to a concern about over-reliance on the technology. The purpose of this research was to understand how reliant drug shop dispensers were on AI-powered technologies when determining a differential diagnosis for a presented clinical case vignette. We explored how the drug dispensers responded to technology that is framed as always correct in an attempt to measure whether they begin to rely on it without any critical thought of their own. We found that dispensers relied on the decision made by the AI 25 percent of the time, even when the AI provided no explanation for its decision.  ( 2 min )
    To Switch or not to Switch: Predicting the Benefit of Switching between Algorithms based on Trajectory Features. (arXiv:2302.09075v1 [cs.AI])
    Dynamic algorithm selection aims to exploit the complementarity of multiple optimization algorithms by switching between them during the search. While these kinds of dynamic algorithms have been shown to have potential to outperform their component algorithms, it is still unclear how this potential can best be realized. One promising approach is to make use of landscape features to enable a per-run trajectory-based switch. Here, the samples seen by the first algorithm are used to create a set of features which describe the landscape from the perspective of the algorithm. These features are then used to predict what algorithm to switch to. In this work, we extend this per-run trajectory-based approach to consider a wide variety of potential points at which to perform the switch. We show that using a sliding window to capture the local landscape features contains information which can be used to predict whether a switch at that point would be beneficial to future performance. By analyzing the resulting models, we identify what features are most important to these predictions. Finally, by evaluating the importance of features and comparing these values between multiple algorithms, we show clear differences in the way the second algorithm interacts with the local landscape features found before the switch.  ( 2 min )
    Approximate Thompson Sampling via Epistemic Neural Networks. (arXiv:2302.09205v1 [cs.LG])
    Thompson sampling (TS) is a popular heuristic for action selection, but it requires sampling from a posterior distribution. Unfortunately, this can become computationally intractable in complex environments, such as those modeled using neural networks. Approximate posterior samples can produce effective actions, but only if they reasonably approximate joint predictive distributions of outputs across inputs. Notably, accuracy of marginal predictive distributions does not suffice. Epistemic neural networks (ENNs) are designed to produce accurate joint predictive distributions. We compare a range of ENNs through computational experiments that assess their performance in approximating TS across bandit and reinforcement learning environments. The results indicate that ENNs serve this purpose well and illustrate how the quality of joint predictive distributions drives performance. Further, we demonstrate that the \textit{epinet} -- a small additive network that estimates uncertainty -- matches the performance of large ensembles at orders of magnitude lower computational cost. This enables effective application of TS with computation that scales gracefully to complex environments.  ( 2 min )
    Structural Neural Additive Models: Enhanced Interpretable Machine Learning. (arXiv:2302.09275v1 [cs.LG])
    Deep neural networks (DNNs) have shown exceptional performances in a wide range of tasks and have become the go-to method for problems requiring high-level predictive power. There has been extensive research on how DNNs arrive at their decisions, however, the inherently uninterpretable networks remain up to this day mostly unobservable "black boxes". In recent years, the field has seen a push towards interpretable neural networks, such as the visually interpretable Neural Additive Models (NAMs). We propose a further step into the direction of intelligibility beyond the mere visualization of feature effects and propose Structural Neural Additive Models (SNAMs). A modeling framework that combines classical and clearly interpretable statistical methods with the predictive power of neural applications. Our experiments validate the predictive performances of SNAMs. The proposed framework performs comparable to state-of-the-art fully connected DNNs and we show that SNAMs can even outperform NAMs while remaining inherently more interpretable.  ( 2 min )
    Deep learning for inverse problems with unknown operator. (arXiv:2108.02744v2 [stat.ML] UPDATED)
    We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which current methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of "many" training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to "fit" the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.  ( 2 min )
    Smoothly Giving up: Robustness for Simple Models. (arXiv:2302.09114v1 [cs.LG])
    There is a growing need for models that are interpretable and have reduced energy and computational cost (e.g., in health care analytics and federated learning). Examples of algorithms to train such models include logistic regression and boosting. However, one challenge facing these algorithms is that they provably suffer from label noise; this has been attributed to the joint interaction between oft-used convex loss functions and simpler hypothesis classes, resulting in too much emphasis being placed on outliers. In this work, we use the margin-based $\alpha$-loss, which continuously tunes between canonical convex and quasi-convex losses, to robustly train simple models. We show that the $\alpha$ hyperparameter smoothly introduces non-convexity and offers the benefit of "giving up" on noisy training examples. We also provide results on the Long-Servedio dataset for boosting and a COVID-19 survey dataset for logistic regression, highlighting the efficacy of our approach across multiple relevant domains.  ( 2 min )
    Benchmark for Models Predicting Human Behavior in Gap Acceptance Scenarios. (arXiv:2211.05455v2 [cs.RO] UPDATED)
    Autonomous vehicles currently suffer from a time-inefficient driving style caused by uncertainty about human behavior in traffic interactions. Accurate and reliable prediction models enabling more efficient trajectory planning could make autonomous vehicles more assertive in such interactions. However, the evaluation of such models is commonly oversimplistic, ignoring the asymmetric importance of prediction errors and the heterogeneity of the datasets used for testing. We examine the potential of recasting interactions between vehicles as gap acceptance scenarios and evaluating models in this structured environment. To that end, we develop a framework aiming to facilitate the evaluation of any model, by any metric, and in any scenario. We then apply this framework to state-of-the-art prediction models, which all show themselves to be unreliable in the most safety-critical situations.  ( 2 min )
    Minimax risk classifiers with 0-1 loss. (arXiv:2201.06487v5 [stat.ML] UPDATED)
    Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minize the worst-case 0-1 loss with respect to uncertainty sets of distributions that can include the underlying distribution, with a tunable confidence. We show that MRCs can provide tight performance guarantees at learning and are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice.  ( 2 min )
    Unsupervised Diffusion and Volume Maximization-Based Clustering of Hyperspectral Images. (arXiv:2203.09992v3 [cs.CV] UPDATED)
    Hyperspectral images taken from aircraft or satellites contain information from hundreds of spectral bands, within which lie latent lower-dimensional structures that can be exploited for classifying vegetation and other materials. A disadvantage of working with hyperspectral images is that, due to an inherent trade-off between spectral and spatial resolution, they have a relatively coarse spatial scale, meaning that single pixels may correspond to spatial regions containing multiple materials. This article introduces the Diffusion and Volume maximization-based Image Clustering (D-VIC) algorithm for unsupervised material clustering to address this problem. By directly incorporating pixel purity into its labeling procedure, D-VIC gives greater weight to pixels that correspond to a spatial region containing just a single material. D-VIC is shown to outperform comparable state-of-the-art methods in extensive experiments on a range of hyperspectral images, including land-use maps and highly mixed forest health surveys (in the context of ash dieback disease), implying that it is well-equipped for unsupervised material clustering of spectrally-mixed hyperspectral datasets.  ( 2 min )
    Euler State Networks: Non-dissipative Reservoir Computing. (arXiv:2203.09382v2 [cs.LG] UPDATED)
    Inspired by the numerical solution of ordinary differential equations, in this paper we propose a novel Reservoir Computing (RC) model, called the Euler State Network (EuSN). The introduced approach makes use of forward Euler discretization and antisymmetric recurrent matrices to design reservoir dynamics that are both stable and non-dissipative by construction. Our mathematical analysis shows that the resulting model is biased towards unitary effective spectral radius and zero local Lyapunov exponents, intrinsically operating at the edge of stability. Experiments on synthetic tasks indicate the marked superiority of the proposed approach, compared to standard RC models, in tasks requiring long-term memorization skills. Furthermore, results on real-world time series classification benchmarks point out that EuSN is capable of matching (or even surpassing) the level of accuracy of trainable Recurrent Neural Networks, while allowing up to 100-fold savings in computation time and energy consumption.  ( 2 min )
    HOPE: Human-Centric Off-Policy Evaluation for E-Learning and Healthcare. (arXiv:2302.09212v1 [cs.LG])
    Reinforcement learning (RL) has been extensively researched for enhancing human-environment interactions in various human-centric tasks, including e-learning and healthcare. Since deploying and evaluating policies online are high-stakes in such tasks, off-policy evaluation (OPE) is crucial for inducing effective policies. In human-centric environments, however, OPE is challenging because the underlying state is often unobservable, while only aggregate rewards can be observed (students' test scores or whether a patient is released from the hospital eventually). In this work, we propose a human-centric OPE (HOPE) to handle partial observability and aggregated rewards in such environments. Specifically, we reconstruct immediate rewards from the aggregated rewards considering partial observability to estimate expected total returns. We provide a theoretical bound for the proposed method, and we have conducted extensive experiments in real-world human-centric tasks, including sepsis treatments and an intelligent tutoring system. Our approach reliably predicts the returns of different policies and outperforms state-of-the-art benchmarks using both standard validation methods and human-centric significance tests.  ( 2 min )
    Pseudo Contrastive Learning for Graph-based Semi-supervised Learning. (arXiv:2302.09532v1 [cs.LG])
    Pseudo Labeling is a technique used to improve the performance of semi-supervised Graph Neural Networks (GNNs) by generating additional pseudo-labels based on confident predictions. However, the quality of generated pseudo-labels has long been a concern due to the sensitivity of the classification objective to given labels. To avoid the untrustworthy classification supervision indicating ``a node belongs to a specific class,'' we favor the fault-tolerant contrasting supervision demonstrating ``two nodes do not belong to the same class.'' Thus, the problem of generating high-quality pseudo-labels is then transformed into a relaxed version, i.e., finding reliable contrasting pairs. To achieve this, we propose a general framework for GNNs, termed Pseudo Contrastive Learning (PCL). It separates two nodes whose positive and negative pseudo-labels target the same class. To incorporate topological knowledge into learning, we devise a topologically weighted contrastive loss that spends more effort separating negative pairs with smaller topological distances. Additionally, to alleviate the heavy reliance on data augmentation, we augment nodes only by applying dropout to the encoded representations. Theoretically, we prove that PCL with the lightweight augmentation works like a representation regularizer to effectively learn separation between negative pairs. Experimentally, we employ PCL on various models, which consistently outperform their counterparts using other popular general techniques on five real-world graphs.
    Online Continuous Hyperparameter Optimization for Contextual Bandits. (arXiv:2302.09440v1 [cs.LG])
    In stochastic contextual bandit problems, an agent sequentially makes actions from a time-dependent action set based on past experience to minimize the cumulative regret. Like many other machine learning algorithms, the performance of bandits heavily depends on their multiple hyperparameters, and theoretically derived parameter values may lead to unsatisfactory results in practice. Moreover, it is infeasible to use offline tuning methods like cross validation to choose hyperparameters under the bandit environment, as the decisions should be made in real time. To address this challenge, we propose the first online continuous hyperparameter tuning framework for contextual bandits to learn the optimal parameter configuration within a search space on the fly. Specifically, we use a double-layer bandit framework named CDT (Continuous Dynamic Tuning) and formulate the hyperparameter optimization as a non-stationary continuum-armed bandit, where each arm represents a combination of hyperparameters, and the corresponding reward is the algorithmic result. For the top layer, we propose the Zooming TS algorithm that utilizes Thompson Sampling (TS) for exploration and a restart technique to get around the switching environment. The proposed CDT framework can be easily used to tune contextual bandit algorithms without any pre-specified candidate set for hyperparameters. We further show that it could achieve sublinear regret in theory and performs consistently better on both synthetic and real datasets in practice.  ( 2 min )
    MARS: Meta-Learning as Score Matching in the Function Space. (arXiv:2210.13319v2 [cs.LG] UPDATED)
    Meta-learning aims to extract useful inductive biases from a set of related datasets. In Bayesian meta-learning, this is typically achieved by constructing a prior distribution over neural network parameters. However, specifying families of computationally viable prior distributions over the high-dimensional neural network parameters is difficult. As a result, existing approaches resort to meta-learning restrictive diagonal Gaussian priors, severely limiting their expressiveness and performance. To circumvent these issues, we approach meta-learning through the lens of functional Bayesian neural network inference, which views the prior as a stochastic process and performs inference in the function space. Specifically, we view the meta-training tasks as samples from the data-generating process and formalize meta-learning as empirically estimating the law of this stochastic process. Our approach can seamlessly acquire and represent complex prior knowledge by meta-learning the score function of the data-generating process marginals instead of parameter space priors. In a comprehensive benchmark, we demonstrate that our method achieves state-of-the-art performance in terms of predictive accuracy and substantial improvements in the quality of uncertainty estimates.
    Natural Language-conditioned Reinforcement Learning with Inside-out Task Language Development and Translation. (arXiv:2302.09368v1 [cs.CL])
    Natural Language-conditioned reinforcement learning (RL) enables the agents to follow human instructions. Previous approaches generally implemented language-conditioned RL by providing human instructions in natural language (NL) and training a following policy. In this outside-in approach, the policy needs to comprehend the NL and manage the task simultaneously. However, the unbounded NL examples often bring much extra complexity for solving concrete RL tasks, which can distract policy learning from completing the task. To ease the learning burden of the policy, we investigate an inside-out scheme for natural language-conditioned RL by developing a task language (TL) that is task-related and unique. The TL is used in RL to achieve highly efficient and effective policy training. Besides, a translator is trained to translate NL into TL. We implement this scheme as TALAR (TAsk Language with predicAte Representation) that learns multiple predicates to model object relationships as the TL. Experiments indicate that TALAR not only better comprehends NL instructions but also leads to a better instruction-following policy that improves 13.4% success rate and adapts to unseen expressions of NL instruction. The TL can also be an effective task abstraction, naturally compatible with hierarchical RL.  ( 2 min )
    Vulnerability analysis of captcha using Deep learning. (arXiv:2302.09389v1 [cs.CR])
    Several websites improve their security and avoid dangerous Internet attacks by implementing CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), a type of verification to identify whether the end-user is human or a robot. The most prevalent type of CAPTCHA is text-based, designed to be easily recognized by humans while being unsolvable towards machines or robots. However, as deep learning technology progresses, development of convolutional neural network (CNN) models that predict text-based CAPTCHAs becomes easier. The purpose of this research is to investigate the flaws and vulnerabilities in the CAPTCHA generating systems in order to design more resilient CAPTCHAs. To achieve this, we created CapNet, a Convolutional Neural Network. The proposed platform can evaluate both numerical and alphanumerical CAPTCHAs  ( 2 min )
    Data-Efficient Contrastive Self-supervised Learning: Easy Examples Contribute the Most. (arXiv:2302.09195v1 [cs.LG])
    Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required for learning high-quality representations. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of SSL on such subsets. Empirically, we discover, perhaps surprisingly, the subsets that contribute the most to SSL are those that contribute the least to supervised learning. Through extensive experiments, we show that our subsets outperform random subsets by more than 3% on CIFAR100, CIFAR10, and STL10. Interestingly, we also find that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10, without affecting downstream task performance.  ( 2 min )
    Speaker and Language Change Detection using Wav2vec2 and Whisper. (arXiv:2302.09381v1 [eess.AS])
    We investigate recent transformer networks pre-trained for automatic speech recognition for their ability to detect speaker and language changes in speech. We do this by simply adding speaker (change) or language targets to the labels. For Wav2vec2 pre-trained networks, we also investigate if the representation for the speaker change symbol can be conditioned to capture speaker identity characteristics. Using a number of constructed data sets we show that these capabilities are definitely there, with speaker recognition equal error rates of the order of 10% and language detection error rates of a few percent. We will publish the code for reproducibility.  ( 2 min )
    Neural Systematic Binder. (arXiv:2211.01177v3 [cs.CV] UPDATED)
    The key to high-level cognition is believed to be the ability to systematically manipulate and compose knowledge pieces. While token-like structured knowledge representations are naturally provided in text, it is elusive how to obtain them for unstructured modalities such as scene images. In this paper, we propose a neural mechanism called Neural Systematic Binder or SysBinder for constructing a novel structured representation called Block-Slot Representation. In Block-Slot Representation, object-centric representations known as slots are constructed by composing a set of independent factor representations called blocks, to facilitate systematic generalization. SysBinder obtains this structure in an unsupervised way by alternatingly applying two different binding principles: spatial binding for spatial modularity across the full scene and factor binding for factor modularity within an object. SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we find that SysBinder provides significantly better factor disentanglement within the slots than the conventional object-centric methods, including, for the first time, in visually complex scene images such as CLEVR-Tex. Furthermore, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen factor combinations.
    Adversarial random forests for density estimation and generative modeling. (arXiv:2205.09435v3 [stat.ML] UPDATED)
    We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-of-the-art probabilistic circuits and deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. An accompanying $\texttt{R}$ package, $\texttt{arf}$, is available on $\texttt{CRAN}$.
    Hardness of Agnostically Learning Halfspaces from Worst-Case Lattice Problems. (arXiv:2207.14030v2 [cs.LG] UPDATED)
    We show hardness of improperly learning halfspaces in the agnostic model, both in the distribution-independent as well as the distribution-specific setting, based on the assumption that worst-case lattice problems, such as GapSVP or SIVP, are hard. In particular, we show that under this assumption there is no efficient algorithm that outputs any binary hypothesis, not necessarily a halfspace, achieving misclassfication error better than $\frac 1 2 - \gamma$ even if the optimal misclassification error is as small is as small as $\delta$. Here, $\gamma$ can be smaller than the inverse of any polynomial in the dimension and $\delta$ as small as $exp(-\Omega(\log^{1-c}(d)))$, where $0 0$ learning halfspaces up to error $OPT_{LTF} + \epsilon$ takes time at least $d^{\tilde{\Omega}(1/\epsilon^{2-\beta})}$ under the same hardness assumptions. Similarly, we show that learning degree-$\ell$ polynomial threshold functions up to error $OPT_{{PTF}_\ell} + \epsilon$ takes time at least $d^{\tilde{\Omega}(\ell^{2-\beta}/\epsilon^{2-\beta})}$. $OPT_{LTF}$ and $OPT_{{PTF}_\ell}$ denote the best error achievable by any halfspace or polynomial threshold function, respectively. Our lower bounds qualitively match algorithmic guarantees and (nearly) recover known lower bounds based on non-worst-case assumptions. Previously, such hardness results [Daniely16, DKPZ21] were based on average-case complexity assumptions or restricted to the statistical query model. Our work gives the first hardness results basing these fundamental learning problems on worst-case complexity assumptions. It is inspired by a sequence of recent works showing hardness of learning well-separated Gaussian mixtures based on worst-case lattice problems.
    Riemannian Langevin Algorithm for Solving Semidefinite Programs. (arXiv:2010.11176v5 [stat.ML] UPDATED)
    We propose a Langevin diffusion-based algorithm for non-convex optimization and sampling on a product manifold of spheres. Under a logarithmic Sobolev inequality, we establish a guarantee for finite iteration convergence to the Gibbs distribution in terms of Kullback--Leibler divergence. We show that with an appropriate temperature choice, the suboptimality gap to the global minimum is guaranteed to be arbitrarily small with high probability. As an application, we consider the Burer--Monteiro approach for solving a semidefinite program (SDP) with diagonal constraints, and analyze the proposed Langevin algorithm for optimizing the non-convex objective. In particular, we establish a logarithmic Sobolev inequality for the Burer--Monteiro problem when there are no spurious local minima, but under the presence saddle points. Combining the results, we then provide a global optimality guarantee for the SDP and the Max-Cut problem. More precisely, we show that the Langevin algorithm achieves $\epsilon$ accuracy with high probability in $\widetilde{\Omega}( \epsilon^{-5} )$ iterations.
    Reinforcement Learning in the Wild with Maximum Likelihood-based Model Transfer. (arXiv:2302.09273v1 [cs.LG])
    In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as \textit{Model Transfer Reinforcement Learning (MTRL)} problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a \textit{constrained Maximum Likelihood Estimation (MLE)}-based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP.
    The Mori-Zwanzig formulation of deep learning. (arXiv:2209.05544v3 [cs.LG] UPDATED)
    We develop a new formulation of deep learning based on the Mori-Zwanzig (MZ) formalism of irreversible statistical mechanics. The new formulation is built upon the well-known duality between deep neural networks and discrete dynamical systems, and it allows us to directly propagate quantities of interest (conditional expectations and probability density functions) forward and backward through the network by means of exact linear operator equations. Such new equations can be used as a starting point to develop new effective parameterizations of deep neural networks, and provide a new framework to study deep-learning via operator theoretic methods. The proposed MZ formulation of deep learning naturally introduces a new concept, i.e., the memory of the neural network, which plays a fundamental role in low-dimensional modeling and parameterization. By using the theory of contraction mappings, we develop sufficient conditions for the memory of the neural network to decay with the number of layers. This allows us to rigorously transform deep networks into shallow ones, e.g., by reducing the number of neurons per layer (using projection operators), or by reducing the total number of layers (using the decay property of the memory operator).
    Reproducing Random Forest Efficacy in Detecting Port Scanning. (arXiv:2302.09317v1 [cs.CR])
    Port scanning is the process of attempting to connect to various network ports on a computing endpoint to determine which ports are open and which services are running on them. It is a common method used by hackers to identify vulnerabilities in a network or system. By determining which ports are open, an attacker can identify which services and applications are running on a device and potentially exploit any known vulnerabilities in those services. Consequently, it is important to detect port scanning because it is often the first step in a cyber attack. By identifying port scanning attempts, cybersecurity professionals can take proactive measures to protect the systems and networks before an attacker has a chance to exploit any vulnerabilities. Against this background, researchers have worked for over a decade to develop robust methods to detect port scanning. One such method revealed by a recent systematic review is the random forest supervised machine learning algorithm. The review revealed six existing studies using random forest since 2021. Unfortunately, those studies each exhibit different results, do not all use the same training and testing dataset, and only two include source code. Accordingly, the goal of this work was to reproduce the six random forest studies while addressing the apparent shortcomings. The outcomes are significant for researchers looking to explore random forest to detect port scanning and for practitioners interested in reliable technology to detect the early stages of cyber attack.
    Scaling Dimension. (arXiv:2302.09101v1 [cs.LG])
    Conceptual Scaling is a useful standard tool in Formal Concept Analysis and beyond. Its mathematical theory, as elaborated in the last chapter of the FCA monograph, still has room for improvement. As it stands, even some of the basic definitions are in flux. Our contribution was triggered by the study of concept lattices for tree classifiers and the scaling methods used there. We extend some basic notions, give precise mathematical definitions for them and introduce the concept of scaling dimension. In addition to a detailed discussion of its properties, including an example, we show theoretical bounds related to the order dimension of concept lattices. We also study special subclasses, such as the ordinal and the interordinal scaling dimensions, and show for them first results and examples.
    Visual Analysis of Discrimination in Machine Learning. (arXiv:2007.15182v2 [cs.HC] UPDATED)
    The growing use of automated decision-making in critical applications, such as crime prediction and college admission, has raised questions about fairness in machine learning. How can we decide whether different treatments are reasonable or discriminatory? In this paper, we investigate discrimination in machine learning from a visual analytics perspective and propose an interactive visualization tool, DiscriLens, to support a more comprehensive analysis. To reveal detailed information on algorithmic discrimination, DiscriLens identifies a collection of potentially discriminatory itemsets based on causal modeling and classification rules mining. By combining an extended Euler diagram with a matrix-based visualization, we develop a novel set visualization to facilitate the exploration and interpretation of discriminatory itemsets. A user study shows that users can interpret the visually encoded information in DiscriLens quickly and accurately. Use cases demonstrate that DiscriLens provides informative guidance in understanding and reducing algorithmic discrimination.
    A Proximal Algorithm for Sampling from Non-convex Potentials. (arXiv:2205.10188v2 [cs.LG] UPDATED)
    We study sampling problems associated with non-convex potentials that meanwhile lack smoothness. In particular, we consider target distributions that satisfy either logarithmic-Sobolev inequality or Poincar\'e inequality. Rather than smooth, the potentials are assumed to be semi-smooth or the summation of multiple semi-smooth functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling in the non-convex and semi-smooth setting. This work extends the recent algorithm in \cite{LiaChe21,LiaChe22} for non-smooth/semi-smooth log-concave distribution to the setting with non-convex potentials. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.
    Imitating Past Successes can be Very Suboptimal. (arXiv:2206.03378v2 [cs.LG] UPDATED)
    Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we formally relate outcome-conditioned imitation learning to reward maximization, drawing a precise relationship between the learned policy and Q-values and explaining the close connections between these methods and prior EM-based policy search methods. This analysis shows that existing outcome-conditioned imitation learning methods do not necessarily improve the policy, but a simple modification results in a method that does guarantee policy improvement, under some assumptions.
    An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge. (arXiv:2205.14775v3 [stat.ML] UPDATED)
    We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
    Teachable Reinforcement Learning via Advice Distillation. (arXiv:2203.11197v2 [cs.LG] UPDATED)
    Training automated agents to complete complex tasks in interactive environments is challenging: reinforcement learning requires careful hand-engineering of reward functions, imitation learning requires specialized infrastructure and access to a human expert, and learning from intermediate forms of supervision (like binary preferences) is time-consuming and extracts little information from each human intervention. Can we overcome these challenges by building agents that learn from rich, interactive feedback instead? We propose a new supervision paradigm for interactive learning based on "teachable" decision-making systems that learn from structured advice provided by an external teacher. We begin by formalizing a class of human-in-the-loop decision making problems in which multiple forms of teacher-provided advice are available to a learner. We then describe a simple learning algorithm for these problems that first learns to interpret advice, then learns from advice to complete tasks even in the absence of human supervision. In puzzle-solving, navigation, and locomotion domains, we show that agents that learn from advice can acquire new skills with significantly less human supervision than standard reinforcement learning algorithms and often less than imitation learning.
    A Federated Approach for Hate Speech Detection. (arXiv:2302.09243v1 [cs.LG])
    Hate speech detection has been the subject of high research attention, due to the scale of content created on social media. In spite of the attention and the sensitive nature of the task, privacy preservation in hate speech detection has remained under-studied. The majority of research has focused on centralised machine learning infrastructures which risk leaking data. In this paper, we show that using federated machine learning can help address privacy the concerns that are inherent to hate speech detection while obtaining up to 6.81% improvement in terms of F1-score.
    Sample-Efficient Safety Assurances using Conformal Prediction. (arXiv:2109.14082v4 [cs.RO] UPDATED)
    When deploying machine learning models in high-stakes robotics applications, the ability to detect unsafe situations is crucial. Early warning systems can provide alerts when an unsafe situation is imminent (in the absence of corrective action). To reliably improve safety, these warning systems should have a provable false negative rate; i.e. of the situations that are unsafe, fewer than $\epsilon$ will occur without an alert. In this work, we present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics, in order to tune warning systems to provably achieve an $\epsilon$ false negative rate using as few as $1/\epsilon$ data points. We apply our framework to a driver warning system and a robotic grasping application, and empirically demonstrate guaranteed false negative rate while also observing low false detection (positive) rate.
    Parameter Averaging for SGD Stabilizes the Implicit Bias towards Flat Regions. (arXiv:2302.09376v1 [stat.ML])
    Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and show this learning scheme significantly improves performance.
    Estimating Treatment Effects from Irregular Time Series Observations with Hidden Confounders. (arXiv:2302.09446v1 [cs.LG])
    Estimating treatment effects plays a crucial role in causal inference, having many real-world applications like policy analysis and decision making. Nevertheless, estimating treatment effects in the longitudinal setting in the presence of hidden confounders remains an extremely challenging problem. Recently, there is a growing body of work attempting to obtain unbiased ITE estimates from time-dynamic observational data by ignoring the possible existence of hidden confounders. Additionally, many existing works handling hidden confounders are not applicable for continuous-time settings. In this paper, we extend the line of work focusing on deconfounding in the dynamic time setting in the presence of hidden confounders. We leverage recent advancements in neural differential equations to build a latent factor model using a stochastic controlled differential equation and Lipschitz constrained convolutional operation in order to continuously incorporate information about ongoing interventions and irregularly sampled observations. Experiments on both synthetic and real-world datasets highlight the promise of continuous time methods for estimating treatment effects in the presence of hidden confounders.
    A Novel Framework for Policy Mirror Descent with General Parametrization and Linear Convergence. (arXiv:2301.13139v2 [stat.ML] UPDATED)
    Modern policy optimization methods in applied reinforcement learning, such as Trust Region Policy Optimization and Policy Mirror Descent, are often based on the policy gradient framework. While theoretical guarantees have been established for this class of algorithms, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. softmax, and it generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization class, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total expected Bregman divergence of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.
    Gradual Domain Adaptation via Normalizing Flows. (arXiv:2206.11492v2 [stat.ML] UPDATED)
    Standard domain adaptation methods do not work well when a large gap exists between the source and target domains. Gradual domain adaptation is one of the approaches used to address the problem. It involves leveraging the intermediate domain, which gradually shifts from the source domain to the target domain. The previous work assumed that the number of intermediate domains is large and the distance between adjacent domains is small; hence, the gradual domain adaptation algorithm, involving self-training with unlabeled datasets, was applicable. In practice, however, gradual self-training will fail because the number of intermediate domains is limited and the distance between adjacent domains is large. We propose the use of normalizing flows to deal with this problem while maintaining the framework of unsupervised domain adaptation. We generate pseudo intermediate domains from normalizing flows and then use them for gradual domain adaptation. We evaluate our proposed method by experiments with real-world datasets and confirm that it mitigates the above-explained problem and improves the classification performance.
    Faster Adaptive Federated Learning. (arXiv:2212.00974v2 [cs.LG] UPDATED)
    Federated learning has attracted increasing attention with the emergence of distributed data. While extensive federated learning algorithms have been proposed for the non-convex distributed problem, federated learning in practice still faces numerous challenges, such as the large training iterations to converge since the sizes of models and datasets keep increasing, and the lack of adaptivity by SGD-based model updates. Meanwhile, the study of adaptive methods in federated learning is scarce and existing works either lack a complete theoretical convergence guarantee or have slow sample complexity. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on the momentum-based variance-reduced technique in cross-silo FL. We first explore how to design the adaptive algorithm in the FL setting. By providing a counter-example, we prove that a simple combination of FL and adaptive methods could lead to divergence. More importantly, we provide a convergence analysis for our method and prove that our algorithm is the first adaptive FL algorithm to reach the best-known samples $O(\epsilon^{-3})$ and $O(\epsilon^{-2})$ communication rounds to find an $\epsilon$-stationary point without large batches. The experimental results on the language modeling task and image classification task with heterogeneous data demonstrate the efficiency of our algorithms.
    Towards Adversarial Realism and Robust Learning for IoT Intrusion Detection and Classification. (arXiv:2301.13122v2 [cs.CR] UPDATED)
    The Internet of Things (IoT) faces tremendous security challenges. Machine learning models can be used to tackle the growing number of cyber-attack variations targeting IoT systems, but the increasing threat posed by adversarial attacks restates the need for reliable defense strategies. This work describes the types of constraints required for a realistic adversarial cyber-attack example and proposes a methodology for a trustworthy adversarial robustness analysis with a realistic adversarial evasion attack vector. The proposed methodology was used to evaluate three supervised algorithms, Random Forest (RF), Extreme Gradient Boosting (XGB), and Light Gradient Boosting Machine (LGBM), and one unsupervised algorithm, Isolation Forest (IFOR). Constrained adversarial examples were generated with the Adaptative Perturbation Pattern Method (A2PM), and evasion attacks were performed against models created with regular and adversarial training. Even though RF was the least affected in binary classification, XGB consistently achieved the highest accuracy in multi-class classification. The obtained results evidence the inherent susceptibility of tree-based algorithms and ensembles to adversarial evasion attacks and demonstrates the benefits of adversarial training and a security by design approach for a more robust IoT network intrusion detection and cyber-attack classification.
    Reflective-Net: Learning from Explanations. (arXiv:2011.13986v2 [cs.LG] UPDATED)
    Humans possess a remarkable capability to make fast, intuitive decisions, but also to self-reflect, i.e., to explain to oneself, and to efficiently learn from explanations by others. This work provides the first steps toward mimicking this process by capitalizing on the explanations generated based on existing explanation methods, i.e. Grad-CAM. Learning from explanations combined with conventional labeled data yields significant improvements for classification in terms of accuracy and training time.
    Falsification of Learning-Based Controllers through Multi-Fidelity Bayesian Optimization. (arXiv:2212.14118v3 [eess.SY] UPDATED)
    Simulation-based falsification is a practical testing method to increase confidence that the system will meet safety requirements. Because full-fidelity simulations can be computationally demanding, we investigate the use of simulators with different levels of fidelity. As a first step, we express the overall safety specification in terms of environmental parameters and structure this safety specification as an optimization problem. We propose a multi-fidelity falsification framework using Bayesian optimization, which is able to determine at which level of fidelity we should conduct a safety evaluation in addition to finding possible instances from the environment that cause the system to fail. This method allows us to automatically switch between inexpensive, inaccurate information from a low-fidelity simulator and expensive, accurate information from a high-fidelity simulator in a cost-effective way. Our experiments on various environments in simulation demonstrate that multi-fidelity Bayesian optimization has falsification performance comparable to single-fidelity Bayesian optimization but with much lower cost.
    Calibrating the Rigged Lottery: Making All Tickets Reliable. (arXiv:2302.09369v1 [cs.LG])
    Although sparse training has been successfully used in various resource-limited deep learning tasks to save memory, accelerate training, and reduce inference time, the reliability of the produced sparse models remains unexplored. Previous research has shown that deep neural networks tend to be over-confident, and we find that sparse training exacerbates this problem. Therefore, calibrating the sparse models is crucial for reliable prediction and decision-making. In this paper, we propose a new sparse training method to produce sparse models with improved confidence calibration. In contrast to previous research that uses only one mask to control the sparse topology, our method utilizes two masks, including a deterministic mask and a random mask. The former efficiently searches and activates important weights by exploiting the magnitude of weights and gradients. While the latter brings better exploration and finds more appropriate weight values by random updates. Theoretically, we prove our method can be viewed as a hierarchical variational approximation of a probabilistic deep Gaussian process. Extensive experiments on multiple datasets, model architectures, and sparsities show that our method reduces ECE values by up to 47.8\% and simultaneously maintains or even improves accuracy with only a slight increase in computation and storage burden.
    Kernel Methods for Unobserved Confounding: Negative Controls, Proxies, and Instruments. (arXiv:2012.10315v4 [stat.ML] UPDATED)
    Negative control is a strategy for learning the causal relationship between treatment and outcome in the presence of unmeasured confounding. The treatment effect can nonetheless be identified if two auxiliary variables are available: a negative control treatment (which has no effect on the actual outcome), and a negative control outcome (which is not affected by the actual treatment). These auxiliary variables can also be viewed as proxies for a traditional set of control variables, and they bear resemblance to instrumental variables. I propose a family of algorithms based on kernel ridge regression for learning nonparametric treatment effects with negative controls. Examples include dose response curves, dose response curves with distribution shift, and heterogeneous treatment effects. Data may be discrete or continuous, and low, high, or infinite dimensional. I prove uniform consistency and provide finite sample rates of convergence. I estimate the dose response curve of cigarette smoking on infant birth weight adjusting for unobserved confounding due to household income, using a data set of singleton births in the state of Pennsylvania between 1989 and 1991.
    Markovian Gaussian Process Variational Autoencoders. (arXiv:2207.05543v2 [cs.LG] UPDATED)
    Sequential VAEs have been successfully considered for many high-dimensional time series modelling problems, with many variant models relying on discrete-time mechanisms such as recurrent neural networks (RNNs). On the other hand, continuous-time methods have recently gained attraction, especially in the context of irregularly-sampled time series, where they can better handle the data than discrete-time methods. One such class are Gaussian process variational autoencoders (GPVAEs), where the VAE prior is set as a Gaussian process (GP). However, a major limitation of GPVAEs is that it inherits the cubic computational cost as GPs, making it unattractive to practioners. In this work, we leverage the equivalent discrete state space representation of Markovian GPs to enable linear time GPVAE training via Kalman filtering and smoothing. We show on a variety of high-dimensional temporal and spatiotemporal tasks that our method performs favourably compared to existing approaches whilst being computationally highly scalable.
    MultiViz: Towards Visualizing and Understanding Multimodal Models. (arXiv:2207.00056v2 [cs.LG] UPDATED)
    The promise of multimodal models for real-world applications has inspired research in visualizing and understanding their internal mechanics with the end goal of empowering stakeholders to visualize model behavior, perform model debugging, and promote trust in machine learning models. However, modern multimodal models are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize the internal modeling of multimodal interactions in these models? Our paper aims to fill this gap by proposing MultiViz, a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages: (1) unimodal importance: how each modality contributes towards downstream modeling and prediction, (2) cross-modal interactions: how different modalities relate with each other, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction. MultiViz is designed to operate on diverse modalities, models, tasks, and research areas. Through experiments on 8 trained models across 6 real-world tasks, we show that the complementary stages in MultiViz together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MultiViz is publicly available, will be regularly updated with new interpretation tools and metrics, and welcomes inputs from the community.
    A Review of Safe Reinforcement Learning: Methods, Theory and Applications. (arXiv:2205.10330v4 [cs.AI] UPDATED)
    Reinforcement learning (RL) has achieved tremendous success in many complex decision making tasks. When it comes to deploying RL in the real world, safety concerns are usually raised, leading to a growing demand for safe RL algorithms, such as in autonomous driving and robotics scenarios. While safety control has a long history, the study of safe RL algorithms is still in the early stages. To establish a good foundation for future research in this thread, in this paper, we provide a review for safe RL from the perspectives of methods, theory and applications. Firstly, we review the progress of safe RL from five dimensions and come up with five problems that are crucial for safe RL being deployed in real-world applications, coined as "2H3W". Secondly, we analyze the theory and algorithm progress from the perspectives of answering the "2H3W" problems. Then, the sample complexity of safe RL methods is reviewed and discussed, followed by an introduction of the applications and benchmarks of safe RL algorithms. Finally, we open the discussion of the challenging problems in safe RL, hoping to inspire more future research on this thread. To advance the study of safe RL algorithms, we release a benchmark suite, an open-sourced repository containing the implementations of major safe RL algorithms, along with tutorials at the link: https://github.com/chauncygu/Safe-Reinforcement-Learning-Baselines.git.
    CLAM: Selective Clarification for Ambiguous Questions with Generative Language Models. (arXiv:2212.07769v2 [cs.CL] UPDATED)
    Users often ask dialogue systems ambiguous questions that require clarification. We show that current language models rarely ask users to clarify ambiguous questions and instead provide incorrect answers. To address this, we introduce CLAM: a framework for getting language models to selectively ask for clarification about ambiguous user questions. In particular, we show that we can prompt language models to detect whether a given question is ambiguous, generate an appropriate clarifying question to ask the user, and give a final answer after receiving clarification. We also show that we can simulate users by providing language models with privileged information. This lets us automatically evaluate multi-turn clarification dialogues. Finally, CLAM significantly improves language models' accuracy on mixed ambiguous and unambiguous questions relative to SotA.
    Learning with Impartiality to Walk on the Pareto Frontier of Fairness, Privacy, and Utility. (arXiv:2302.09183v1 [cs.LG])
    Deploying machine learning (ML) models often requires both fairness and privacy guarantees. Both of these objectives present unique trade-offs with the utility (e.g., accuracy) of the model. However, the mutual interactions between fairness, privacy, and utility are less well-understood. As a result, often only one objective is optimized, while the others are tuned as hyper-parameters. Because they implicitly prioritize certain objectives, such designs bias the model in pernicious, undetectable ways. To address this, we adopt impartiality as a principle: design of ML pipelines should not favor one objective over another. We propose impartially-specified models, which provide us with accurate Pareto frontiers that show the inherent trade-offs between the objectives. Extending two canonical ML frameworks for privacy-preserving learning, we provide two methods (FairDP-SGD and FairPATE) to train impartially-specified models and recover the Pareto frontier. Through theoretical privacy analysis and a comprehensive empirical study, we provide an answer to the question of where fairness mitigation should be integrated within a privacy-aware ML pipeline.
    Average-case Acceleration Through Spectral Density Estimation. (arXiv:2002.04756v7 [math.OC] UPDATED)
    We develop a framework for the average-case analysis of random quadratic problems and derive algorithms that are optimal under this analysis. This yields a new class of methods that achieve acceleration given a model of the Hessian's eigenvalue distribution. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions. These methods are momentum-based algorithms, whose hyper-parameters can be estimated without knowledge of the Hessian's smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Through empirical benchmarks on quadratic and logistic regression problems, we identify regimes in which the the proposed methods improve over classical (worst-case) accelerated methods.
    Is Differentiable Architecture Search truly a One-Shot Method?. (arXiv:2108.05647v3 [cs.LG] UPDATED)
    Differentiable architecture search (DAS) is a widely researched tool for the discovery of novel architectures, due to its promising results for image classification. The main benefit of DAS is the effectiveness achieved through the weight-sharing one-shot paradigm, which allows efficient architecture search. In this work, we investigate DAS in a systematic case study of inverse problems, which allows us to analyze these potential benefits in a controlled manner. We demonstrate that the success of DAS can be extended from image classification to signal reconstruction, in principle. However, our experiments also expose three fundamental difficulties in the evaluation of DAS-based methods in inverse problems: First, the results show a large variance in all test cases. Second, the final performance is strongly dependent on the hyperparameters of the optimizer. And third, the performance of the weight-sharing architecture used during training does not reflect the final performance of the found architecture well. While the results on image reconstruction confirm the potential of the DAS paradigm, they challenge the common understanding of DAS as a one-shot method.
    Distributional Offline Policy Evaluation with Predictive Error Guarantees. (arXiv:2302.09456v1 [cs.LG])
    We study the problem of estimating the distribution of the return of a policy using an offline dataset that is not generated from the policy, i.e., distributional offline policy evaluation (OPE). We propose an algorithm called Fitted Likelihood Estimation (FLE), which conducts a sequence of Maximum Likelihood Estimation (MLE) problems and has the flexibility of integrating any state-of-art probabilistic generative models as long as it can be trained via MLE. FLE can be used for both finite horizon and infinite horizon discounted settings where rewards can be multi-dimensional vectors. In our theoretical results, we show that for both finite and infinite horizon discounted settings, FLE can learn distributions that are close to the ground truth under total variation distance and Wasserstein distance, respectively. Our theoretical results hold under the conditions that the offline data covers the test policy's traces and the supervised learning MLE procedures succeed. Experimentally, we demonstrate the performance of FLE with two generative models, Gaussian mixture models and diffusion models. For the multi-dimensional reward setting, FLE with diffusion models is capable of estimating the complicated distribution of the return of a test policy.
    Resource Constrained Vehicular Edge Federated Learning with Highly Mobile Connected Vehicles. (arXiv:2210.15496v2 [eess.SY] UPDATED)
    This paper proposes a vehicular edge federated learning (VEFL) solution, where an edge server leverages highly mobile connected vehicles' (CVs') onboard central processing units (CPUs) and local datasets to train a global model. Convergence analysis reveals that the VEFL training loss depends on the successful receptions of the CVs' trained models over the intermittent vehicle-to-infrastructure (V2I) wireless links. Owing to high mobility, in the full device participation case (FDPC), the edge server aggregates client model parameters based on a weighted combination according to the CVs' dataset sizes and sojourn periods, while it selects a subset of CVs in the partial device participation case (PDPC). We then devise joint VEFL and radio access technology (RAT) parameters optimization problems under delay, energy and cost constraints to maximize the probability of successful reception of the locally trained models. Considering that the optimization problem is NP-hard, we decompose it into a VEFL parameter optimization sub-problem, given the estimated worst-case sojourn period, delay and energy expense, and an online RAT parameter optimization sub-problem. Finally, extensive simulations are conducted to validate the effectiveness of the proposed solutions with a practical 5G new radio (5G-NR) RAT under a realistic microscopic mobility model.
    Effective Multimodal Reinforcement Learning with Modality Alignment and Importance Enhancement. (arXiv:2302.09318v1 [cs.LG])
    Many real-world applications require an agent to make robust and deliberate decisions with multimodal information (e.g., robots with multi-sensory inputs). However, it is very challenging to train the agent via reinforcement learning (RL) due to the heterogeneity and dynamic importance of different modalities. Specifically, we observe that these issues make conventional RL methods difficult to learn a useful state representation in the end-to-end training with multimodal information. To address this, we propose a novel multimodal RL approach that can do multimodal alignment and importance enhancement according to their similarity and importance in terms of RL tasks respectively. By doing so, we are able to learn an effective state representation and consequentially improve the RL training process. We test our approach on several multimodal RL domains, showing that it outperforms state-of-the-art methods in terms of learning speed and policy quality.
    On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks. (arXiv:2110.15538v3 [cs.LG] UPDATED)
    Layer-wise model fusion via optimal transport, named OTFusion, applies soft neuron association for unifying different pre-trained networks to save computational resources. While enjoying its success, OTFusion requires the input networks to have the same number of layers. To address this issue, we propose a novel model fusion framework, named CLAFusion, to fuse neural networks with a different number of layers, which we refer to as heterogeneous neural networks, via cross-layer alignment. The cross-layer alignment problem, which is an unbalanced assignment problem, can be solved efficiently using dynamic programming. Based on the cross-layer alignment, our framework balances the number of layers of neural networks before applying layer-wise model fusion. Our experiments indicate that CLAFusion, with an extra finetuning process, improves the accuracy of residual networks on the CIFAR10, CIFAR100, and Tiny-ImageNet datasets. Furthermore, we explore its practical usage for model compression and knowledge distillation when applying to the teacher-student setting.
    Adversarial Weight Perturbation Improves Generalization in Graph Neural Networks. (arXiv:2212.04983v3 [cs.LG] UPDATED)
    A lot of theoretical and empirical evidence shows that the flatter local minima tend to improve generalization. Adversarial Weight Perturbation (AWP) is an emerging technique to efficiently and effectively find such minima. In AWP we minimize the loss w.r.t. a bounded worst-case perturbation of the model parameters thereby favoring local minima with a small loss in a neighborhood around them. The benefits of AWP, and more generally the connections between flatness and generalization, have been extensively studied for i.i.d. data such as images. In this paper, we extensively study this phenomenon for graph data. Along the way, we first derive a generalization bound for non-i.i.d. node classification tasks. Then we identify a vanishing-gradient issue with all existing formulations of AWP and we propose a new Weighted Truncated AWP (WT-AWP) to alleviate this issue. We show that regularizing graph neural networks with WT-AWP consistently improves both natural and robust generalization across many different graph learning tasks and models.
    Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural Isometry and Exact Recovery. (arXiv:2209.15265v3 [cs.LG] UPDATED)
    The practice of deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters. This appears to contradict traditional statistical wisdom, in which a trade-off between model complexity and fit to the data is essential. We aim to address this discrepancy by adopting a convex optimization and sparse recovery perspective. We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization. Under certain regularity assumptions on the data, we show that ReLU networks with an arbitrary number of parameters learn only simple models that explain the data. This is analogous to the recovery of the sparsest linear model in compressed sensing. For ReLU networks and their variants with skip connections or normalization layers, we present isometry conditions that ensure the exact recovery of planted neurons. For randomly generated data, we show the existence of a phase transition in recovering planted neural network models, which is easy to describe: whenever the ratio between the number of samples and the dimension exceeds a numerical threshold, the recovery succeeds with high probability; otherwise, it fails with high probability. Surprisingly, ReLU networks learn simple and sparse models that generalize well even when the labels are noisy . The phase transition phenomenon is confirmed through numerical experiments.
    Conjugate Gradient Method for Generative Adversarial Networks. (arXiv:2203.14495v2 [cs.LG] UPDATED)
    One of the training strategies of generative models is to minimize the Jensen--Shannon divergence between the model distribution and the data distribution. Since data distribution is unknown, generative adversarial networks (GANs) formulate this problem as a game between two models, a generator and a discriminator. The training can be formulated in the context of game theory and the local Nash equilibrium (LNE). It does not seem feasible to derive guarantees of stability or optimality for the existing methods. This optimization problem is far more challenging than the single objective setting. Here, we use the conjugate gradient method to reliably and efficiently solve the LNE problem in GANs. We give a proof and convergence analysis under mild assumptions showing that the proposed method converges to a LNE with three different learning rate update rules, including a constant learning rate. Finally, we demonstrate that the proposed method outperforms stochastic gradient descent (SGD) and momentum SGD in terms of best Frechet inception distance (FID) score and outperforms Adam on average. The code is available at \url{https://github.com/Hiroki11x/ConjugateGradient_GAN}.
    Adversarial Policies Beat Superhuman Go AIs. (arXiv:2211.00241v3 [cs.LG] UPDATED)
    We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree search, and a >97% win rate when KataGo uses enough search to be superhuman. We train our adversaries with a modified KataGo implementation, using less than 14% of the compute used to train the original KataGo. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is interpretable to the extent that human experts can successfully implement it, without algorithmic assistance, to consistently beat superhuman AIs. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://goattack.far.ai/.
    Lifelong Bandit Optimization: No Prior and No Regret. (arXiv:2210.15513v2 [stat.ML] UPDATED)
    Machine learning algorithms are often repeatedly applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LIBO, an algorithm which adapts to the environment by learning from past experience and becomes more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LIBO sequentially meta-learns a kernel that approximates the true kernel and solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized or linear bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LIBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LIBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. We propose F-LIBO, which solves the lifelong problem in a federated manner.
    No-Regret Dynamics in the Fenchel Game: A Unified Framework for Algorithmic Convex Optimization. (arXiv:2111.11309v3 [cs.LG] UPDATED)
    We develop an algorithmic framework for solving convex optimization problems using no-regret game dynamics. By converting the problem of minimizing a convex function into an auxiliary problem of solving a min-max game in a sequential fashion, we can consider a range of strategies for each of the two-players who must select their actions one after the other. A common choice for these strategies are so-called no-regret learning algorithms, and we describe a number of such and prove bounds on their regret. We then show that many classical first-order methods for convex optimization -- including average-iterate gradient descent, the Frank-Wolfe algorithm, Nesterov's acceleration methods, and the accelerated proximal method -- can be interpreted as special cases of our framework as long as each player makes the correct choice of no-regret strategy. Proving convergence rates in this framework becomes very straightforward, as they follow from plugging in the appropriate known regret bounds. Our framework also gives rise to a number of new first-order methods for special cases of convex optimization that were not previously known.
    On Equivalent Optimization of Machine Learning Methods. (arXiv:2302.09160v1 [cs.LG])
    At the core of many machine learning methods resides an iterative optimization algorithm for their training. Such optimization algorithms often come with a plethora of choices regarding their implementation. In the case of deep neural networks, choices of optimizer, learning rate, batch size, etc. must be made. Despite the fundamental way in which these choices impact the training of deep neural networks, there exists no general method for identifying when they lead to equivalent, or non-equivalent, optimization trajectories. By viewing iterative optimization as a discrete-time dynamical system, we are able to leverage Koopman operator theory, where it is known that conjugate dynamics can have identical spectral objects. We find highly overlapping Koopman spectra associated with the application of online mirror and gradient descent to specific problems, illustrating that such a data-driven approach can corroborate the recently discovered analytical equivalence between the two optimizers. We extend our analysis to feedforward, fully connected neural networks, providing the first general characterization of when choices of learning rate, batch size, layer width, data set, and activation function lead to equivalent, and non-equivalent, evolution of network parameters during training. Among our main results, we find that learning rate to batch size ratio, layer width, nature of data set (handwritten vs. synthetic), and activation function affect the nature of conjugacy. Our data-driven approach is general and can be utilized broadly to compare the optimization of machine learning methods.
    BolT: Fused Window Transformers for fMRI Time Series Analysis. (arXiv:2205.11578v3 [eess.SP] UPDATED)
    Deep-learning models have enabled performance leaps in analysis of high-dimensional functional MRI (fMRI) data. Yet, many previous methods are suboptimally sensitive for contextual representations across diverse time scales. Here, we present BolT, a blood-oxygen-level-dependent transformer model, for analyzing multi-variate fMRI time series. BolT leverages a cascade of transformer encoders equipped with a novel fused window attention mechanism. Encoding is performed on temporally-overlapped windows within the time series to capture local representations. To integrate information temporally, cross-window attention is computed between base tokens in each window and fringe tokens from neighboring windows. To gradually transition from local to global representations, the extent of window overlap and thereby number of fringe tokens are progressively increased across the cascade. Finally, a novel cross-window regularization is employed to align high-level classification features across the time series. Comprehensive experiments on large-scale public datasets demonstrate the superior performance of BolT against state-of-the-art methods. Furthermore, explanatory analyses to identify landmark time points and regions that contribute most significantly to model decisions corroborate prominent neuroscientific findings in the literature.
    Solving Seismic Wave Equations on Variable Velocity Models with Fourier Neural Operator. (arXiv:2209.12340v3 [cs.LG] UPDATED)
    In the study of subsurface seismic imaging, solving the acoustic wave equation is a pivotal component in existing models. The advancement of deep learning enables solving partial differential equations, including wave equations, by applying neural networks to identify the mapping between the inputs and the solution. This approach can be faster than traditional numerical methods when numerous instances are to be solved. Previous works that concentrate on solving the wave equation by neural networks consider either a single velocity model or multiple simple velocity models, which is restricted in practice. Instead, inspired by the idea of operator learning, this work leverages the Fourier neural operator (FNO) to effectively learn the frequency domain seismic wavefields under the context of variable velocity models. We also propose a new framework paralleled Fourier neural operator (PFNO) for efficiently training the FNO-based solver given multiple source locations and frequencies. Numerical experiments demonstrate the high accuracy of both FNO and PFNO with complicated velocity models in the OpenFWI datasets. Furthermore, the cross-dataset generalization test verifies that PFNO adapts to out-of-distribution velocity models. Moreover, PFNO has robust performance in the presence of random noise in the labels. Finally, PFNO admits higher computational efficiency on large-scale testing datasets than the traditional finite-difference method. The aforementioned advantages endow the FNO-based solver with the potential to build powerful models for research on seismic waves.
    Shortcut Learning Through the Lens of Early Training Dynamics. (arXiv:2302.09344v1 [cs.LG])
    Deep Neural Networks (DNNs) are prone to learn shortcut patterns that damage the generalization of the DNN during deployment. Shortcut Learning is concerning, particularly when the DNNs are applied to safety-critical domains. This paper aims to better understand shortcut learning through the lens of the learning dynamics of the internal neurons during the training process. More specifically, we make the following observations: (1) While previous works treat shortcuts as synonymous with spurious correlations, we emphasize that not all spurious correlations are shortcuts. We show that shortcuts are only those spurious features that are "easier" than the core features. (2) We build upon this premise and use instance difficulty methods (like Prediction Depth) to quantify "easy" and to identify this behavior during the training phase. (3) We empirically show that shortcut learning can be detected by observing the learning dynamics of the DNN's early layers, irrespective of the network architecture used. In other words, easy features learned by the initial layers of a DNN early during the training are potential shortcuts. We verify our claims on simulated and real medical imaging data and justify the empirical success of our hypothesis by showing the theoretical connections between Prediction Depth and information-theoretic concepts like V-usable information. Lastly, our experiments show the insufficiency of monitoring only accuracy plots during training (as is common in machine learning pipelines), and we highlight the need for monitoring early training dynamics using example difficulty metrics.
    Causal Balancing for Domain Generalization. (arXiv:2206.05263v4 [cs.LG] UPDATED)
    While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.
    A large-scale and PCR-referenced vocal audio dataset for COVID-19. (arXiv:2212.07738v2 [cs.SD] UPDATED)
    The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were collected in the 'Speak up to help beat coronavirus' digital survey alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,794 of 72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms were reported by 45.62% of participants. This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.
    Hybrid Traffic Control and Coordination from Pixels. (arXiv:2302.09167v1 [cs.MA])
    Traffic congestion is a persistent problem in our society. Existing methods for traffic control have proven futile in alleviating current congestion levels leading researchers to explore ideas with robot vehicles given the increased emergence of vehicles with different levels of autonomy on our roads. This gives rise to hybrid traffic control, where robot vehicles regulate human-driven vehicles, through reinforcement learning (RL). However, most existing studies use precise observations that involve global information, such as network throughput, as well as local information, such as vehicle positions and velocities. Obtaining this information requires updating existing road infrastructure with vast sensor networks and communication to potentially unwilling human drivers. We consider image observations as the alternative for hybrid traffic control via RL: 1) images are readily available through satellite imagery, in-car camera systems, and traffic monitoring systems; 2) Images do not require a complete re-imagination of the observation space from network to network; and 3) images only require communication to equipment. In this work, we show that robot vehicles using image observations can achieve similar performance to using precise information on networks, including ring, figure eight, merge, bottleneck, and intersections. We also demonstrate increased performance (up to 26%) in certain cases on tested networks, despite only using local traffic information as opposed to global traffic information.
    Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues. (arXiv:2111.02574v2 [cs.CL] UPDATED)
    Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values and eliminate costly human supervision used in previous benchmarks. We also propose a new contextual semantic parsing model, which encodes the formal slots and values, and only the last agent and user utterances. We show that the succinct representation reduces the compounding effect of translation errors, without harming the accuracy in practice. We evaluate our approach on several dialogue state tracking benchmarks. On RiSAWOZ, CrossWOZ, CrossWOZ-EN, and MultiWOZ-ZH datasets we improve the state of the art by 11%, 17%, 20%, and 0.3% in joint goal accuracy. We present a comprehensive error analysis for all three datasets showing erroneous annotations can lead to misguided judgments on the quality of the model. Finally, we present RiSAWOZ English and German datasets, created using our translation methodology. On these datasets, accuracy is within 11% of the original showing that high-accuracy multilingual dialogue datasets are possible without relying on expensive human annotations. We release our datasets and software open source.
    Function Composition in Trustworthy Machine Learning: Implementation Choices, Insights, and Questions. (arXiv:2302.09190v1 [cs.LG])
    Ensuring trustworthiness in machine learning (ML) models is a multi-dimensional task. In addition to the traditional notion of predictive performance, other notions such as privacy, fairness, robustness to distribution shift, adversarial robustness, interpretability, explainability, and uncertainty quantification are important considerations to evaluate and improve (if deficient). However, these sub-disciplines or 'pillars' of trustworthiness have largely developed independently, which has limited us from understanding their interactions in real-world ML pipelines. In this paper, focusing specifically on compositions of functions arising from the different pillars, we aim to reduce this gap, develop new insights for trustworthy ML, and answer questions such as the following. Does the composition of multiple fairness interventions result in a fairer model compared to a single intervention? How do bias mitigation algorithms for fairness affect local post-hoc explanations? Does a defense algorithm for untargeted adversarial attacks continue to be effective when composed with a privacy transformation? Toward this end, we report initial empirical results and new insights from 9 different compositions of functions (or pipelines) on 7 real-world datasets along two trustworthy dimensions - fairness and explainability. We also report progress, and implementation choices, on an extensible composer tool to encourage the combination of functionalities from multiple pillars. To-date, the tool supports bias mitigation algorithms for fairness and post-hoc explainability methods. We hope this line of work encourages the thoughtful consideration of multiple pillars when attempting to formulate and resolve a trustworthiness problem.
    Efficient Wireless Federated Learning with Partial Model Aggregation. (arXiv:2204.09746v3 [cs.LG] UPDATED)
    The data heterogeneity across devices and the limited communication resources, e.g., bandwidth and energy, are two of the main bottlenecks for wireless federated learning (FL). To tackle these challenges, we first devise a novel FL framework with partial model aggregation (PMA). This approach aggregates the lower layers of neural networks, responsible for feature extraction, at the parameter server while keeping the upper layers, responsible for complex pattern recognition, at devices for personalization. The proposed PMA-FL is able to address the data heterogeneity and reduce the transmitted information in wireless channels. Then, we derive a convergence bound of the framework under a non-convex loss function setting to reveal the role of unbalanced data size in the learning performance. On this basis, we maximize the scheduled data size to minimize the global loss function through jointly optimize the device scheduling, bandwidth allocation, computation and communication time division policies with the assistance of Lyapunov optimization. Our analysis reveals that the optimal time division is achieved when the communication and computation parts of PMA-FL have the same power. We also develop a bisection method to solve the optimal bandwidth allocation policy and use the set expansion algorithm to address the device scheduling policy. Compared with the benchmark schemes, the proposed PMA-FL improves 3.13\% and 11.8\% accuracy on two typical datasets with heterogeneous data distribution settings, i.e., MINIST and CIFAR-10, respectively. In addition, the proposed joint dynamic device scheduling and resource management approach achieve slightly higher accuracy than the considered benchmarks, but they provide a satisfactory energy and time reduction: 29\% energy or 20\% time reduction on the MNIST; and 25\% energy or 12.5\% time reduction on the CIFAR-10.
    Split Localized Conformal Prediction. (arXiv:2206.13092v2 [stat.ML] UPDATED)
    Conformal prediction is a simple and powerful tool that can quantify uncertainty without any distributional assumptions. Many existing methods only address the average coverage guarantee, which is not ideal compared to the stronger conditional coverage guarantee. Existing methods of approximating conditional coverage require additional models or time effort, which makes them not easy to scale. In this paper, we propose a modified non-conformity score by leveraging the local approximation of the conditional distribution using kernel density estimation. The modified score inherits the spirit of split conformal methods, which is simple and efficient and can scale to high dimensional settings. We also proposed a unified framework that brings together our method and several state-of-the-art. We perform extensive empirical evaluations: results measured by both average and conditional coverage confirm the advantage of our method.
    Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints. (arXiv:2302.09185v1 [cs.CL])
    The limits of open-ended generative models are unclear, yet increasingly important. What causes them to succeed and what causes them to fail? In this paper, we take a prompt-centric approach to analyzing and bounding the abilities of open-ended generative models. We present a generic methodology of analysis with two challenging prompt constraint types: structural and stylistic. These constraint types are categorized into a set of well-defined constraints that are analyzable by a single prompt. We then systematically create a diverse set of simple, natural, and useful prompts to robustly analyze each individual constraint. Using the GPT-3 text-davinci-002 model as a case study, we generate outputs from our collection of prompts and analyze the model's generative failures. We also show the generalizability of our proposed method on other large models like BLOOM and OPT. Our results and our in-context mitigation strategies reveal open challenges for future research. We have publicly released our code at https://github.com/SALT-NLP/Bound-Cap-LLM.
    RecNet: Early Attention Guided Feature Recovery. (arXiv:2302.09409v1 [cs.LG])
    Uncertainty in sensors results in corrupted input streams and hinders the performance of Deep Neural Networks (DNN), which focus on deducing information from data. However, for sensors with multiple input streams, the relevant information among the streams correlates and hence contains mutual information. This paper utilizes this opportunity to recover the perturbed information due to corrupted input streams. We propose RecNet, which estimates the information entropy at every element of the input feature to the network and interpolates the missing information in the input feature matrix. Finally, using the estimated information entropy and interpolated data, we introduce a novel guided replacement procedure to recover the complete information that is the input to the downstream DNN task. We evaluate the proposed algorithm on a sound event detection and localization application where audio streams from the microphone array are corrupted. We have recovered the performance drop due to the corrupted input stream and reduced the localization error with non-corrupted input streams.
    How to choose the most appropriate centrality measure?. (arXiv:2003.01052v4 [physics.soc-ph] UPDATED)
    We propose a new method for selecting the most appropriate network centrality measure based on the user's opinion on how such a measure should work on simple graphs. The method consists in: (1) forming a set $\cal F$ of candidate measures; (2) generating a list $\cal G$ of fairly simple graphs such that for every pair of measures in $\cal F$, the centrality rankings they define differ on some graph $G\in{\cal G}$; (3) compiling a survey that consists of questions on comparing the centrality of test nodes in some graphs $G\in{\cal G}$; (4) completing this survey, which yields a centrality measure consistent with all user responses. We develop algorithms that implement the proposed method, called culling, for an arbitrary finite set $\cal F$ that does not contain order-equivalent measures. The culling method can be used either for rapid analysis or in combination with a normative approach by compiling a survey on the subset of measures that satisfy chosen axioms. As an example, this method is applied to a set of forty diverse centrality measures. Abbreviated surveys are constructed on the subsets of measures that satisfy the Self-consistency or Bridge axioms.
    Brainomaly: Unsupervised Neurologic Disease Detection Utilizing Unannotated T1-weighted Brain MR Images. (arXiv:2302.09200v1 [eess.IV])
    Deep neural networks have revolutionized the field of supervised learning by enabling accurate predictions through learning from large annotated datasets. However, acquiring large annotated medical imaging datasets is a challenging task, especially for rare diseases, due to the high cost, time, and effort required for annotation. In these scenarios, unsupervised disease detection methods, such as anomaly detection, can save significant human effort. A typically used approach for anomaly detection is to learn the images from healthy subjects only, assuming the model will detect the images from diseased subjects as outliers. However, in many real-world scenarios, unannotated datasets with a mix of healthy and diseased individuals are available. Recent studies have shown improvement in unsupervised disease/anomaly detection using such datasets of unannotated images from healthy and diseased individuals compared to datasets that only include images from healthy individuals. A major issue remains unaddressed in these studies, which is selecting the best model for inference from a set of trained models without annotated samples. To address this issue, we propose Brainomaly, a GAN-based image-to-image translation method for neurologic disease detection using unannotated T1-weighted brain MRIs of individuals with neurologic diseases and healthy subjects. Brainomaly is trained to remove the diseased regions from the input brain MRIs and generate MRIs of corresponding healthy brains. Instead of generating the healthy images directly, Brainomaly generates an additive map where each voxel indicates the amount of changes required to make the input image look healthy. In addition, Brainomaly uses a pseudo-AUC metric for inference model selection, which further improves the detection performance. Our Brainomaly outperforms existing state-of-the-art methods by large margins.
    On Handling Catastrophic Forgetting for Incremental Learning of Human Physical Activity on the Edge. (arXiv:2302.09310v1 [cs.LG])
    Human activity recognition (HAR) has been a classic research problem. In particular, with recent machine learning (ML) techniques, the recognition task has been largely investigated by companies and integrated into their products for customers. However, most of them apply a predefined activity set and conduct the learning process on the cloud, hindering specific personalizations from end users (i.e., edge devices). Even though recent progress in Incremental Learning allows learning new-class data on the fly, the learning process is generally conducted on the cloud, requiring constant data exchange between cloud and edge devices, thus leading to data privacy issues. In this paper, we propose PILOTE, which pushes the incremental learning process to the extreme edge, while providing reliable data privacy and practical utility, e.g., low processing latency, personalization, etc. In particular, we consider the practical challenge of extremely limited data during the incremental learning process on edge, where catastrophic forgetting is required to be handled in a practical way. We validate PILOTE with extensive experiments on human activity data collected from mobile sensors. The results show PILOTE can work on edge devices with extremely limited resources while providing reliable performance.
    Optimization-Informed Neural Networks. (arXiv:2210.02113v2 [math.OC] UPDATED)
    Solving constrained nonlinear optimization problems (CNLPs) is a longstanding problem that arises in various fields, e.g., economics, computer science, and engineering. We propose optimization-informed neural networks (OINN), a deep learning approach to solve CNLPs. By neurodynamic optimization methods, a CNLP is first reformulated as an initial value problem (IVP) involving an ordinary differential equation (ODE) system. A neural network model is then used as an approximate solution for this IVP, with the endpoint being the prediction to the CNLP. We propose a novel training algorithm that directs the model to hold the best prediction during training. In a nutshell, OINN transforms a CNLP into a neural network training problem. By doing so, we can solve CNLPs based on deep learning infrastructure only, without using standard optimization solvers or numerical integration solvers. The effectiveness of the proposed approach is demonstrated through a collection of classical problems, e.g., variational inequalities, nonlinear complementary problems, and standard CNLPs.
    Copula-based synthetic population generation. (arXiv:2302.09193v1 [stat.ML])
    Population synthesis consists of generating synthetic but realistic representations of a target population of micro-agents for the purpose of behavioral modeling and simulation. We introduce a new framework based on copulas to generate synthetic data for a target population of which only the empirical marginal distributions are known by using a sample from another population sharing similar marginal dependencies. This makes it possible to include a spatial component in the generation of population synthesis and to combine various sources of information to obtain more realistic population generators. Specifically, we normalize the data and treat them as realizations of a given copula, and train a generative model on the normalized data before injecting the information on the marginals. We compare the copulas framework to IPF and to modern probabilistic approaches such as Bayesian networks, variational auto-encoders, and generative adversarial networks. We also illustrate on American Community Survey data that the method proposed allows to study the structure of the data at different geographical levels in a way that is robust to the peculiarities of the marginal distributions.
    Using Deep Reinforcement Learning for mmWave Real-Time Scheduling. (arXiv:2210.01423v2 [cs.NI] UPDATED)
    We study the problem of real-time scheduling in a multi-hop millimeter-wave (mmWave) mesh. We develop a model-free deep reinforcement learning algorithm called Adaptive Activator RL (AARL), which determines the subset of mmWave links that should be activated during each time slot and the power level for each link. The most important property of AARL is its ability to make scheduling decisions within the strict time slot constraints of typical 5G mmWave networks. AARL can handle a variety of network topologies, network loads, and interference models, it can also adapt to different workloads. We demonstrate the operation of AARL on several topologies: a small topology with 10 links, a moderately-sized mesh with 48 links, and a large topology with 96 links. We show that for each topology, we compare the throughput obtained by AARL to that of a benchmark algorithm called RPMA (Residual Profit Maximizer Algorithm). The most important advantage of AARL compared to RPMA is that it is much faster and can make the necessary scheduling decisions very rapidly during every time slot, while RPMA cannot. In addition, the quality of the scheduling decisions made by AARL outperforms those made by RPMA.
    Machine Love. (arXiv:2302.09248v1 [cs.AI])
    While ML generates much economic value, many of us have problematic relationships with social media and other ML-powered applications. One reason is that ML often optimizes for what we want in the moment, which is easy to quantify but at odds with what is known scientifically about human flourishing. Thus, through its impoverished models of us, ML currently falls far short of its exciting potential, which is for it to help us to reach ours. While there is no consensus on defining human flourishing, from diverse perspectives across psychology, philosophy, and spiritual traditions, love is understood to be one of its primary catalysts. Motivated by this view, this paper explores whether there is a useful conception of love fitting for machines to embody, as historically it has been generative to explore whether a nebulous concept, such as life or intelligence, can be thoughtfully abstracted and reimagined, as in the fields of machine intelligence or artificial life. This paper forwards a candidate conception of machine love, inspired in particular by work in positive psychology and psychotherapy: to provide unconditional support enabling humans to autonomously pursue their own growth and development. Through proof of concept experiments, this paper aims to highlight the need for richer models of human flourishing in ML, provide an example framework through which positive psychology can be combined with ML to realize a rough conception of machine love, and demonstrate that current language models begin to enable embodying qualitative humanistic principles. The conclusion is that though at present ML may often serve to addict, distract, or divide us, an alternative path may be opening up: We may align ML to support our growth, through it helping us to align ourselves towards our highest aspirations.
    PFGE: Parsimonious Fast Geometric Ensembling of DNNs. (arXiv:2202.06658v7 [cs.LG] UPDATED)
    Ensemble methods have been widely used to improve the performance of machine learning methods in terms of generalization, while they are hard to use in deep learning systems, as training an ensemble of deep neural networks (DNNs) incurs an extremely higher computational overhead of model training. Recently, advanced techniques such as fast geometric ensembling (FGE) and snapshot ensemble have been proposed. These methods can train the model ensembles in the same time as a single model, thus getting around the hurdle of training time. However, their memory overhead for test-time inference remains much higher than single model based methods. Here we propose a parsimonious FGE (PFGE) that employs a lightweight ensemble of higher-performing DNNs, generated by successively-performed stochastic weight averaging procedures. Experimental results across different modern DNN architectures on widely used image datasets CIFAR-$\{10,100\}$ and Imagenet, demonstrate that PFGE can achieve 5x memory efficiency than prior art methods, yet without compromise in generalization performance. Our code is available at https://github.com/ZJLAB-AMMI/PFGE.
    Counterfactual Explainable Recommendation. (arXiv:2108.10539v3 [cs.IR] UPDATED)
    By providing explanations for users and system designers to facilitate better understanding and decision making, explainable recommendation has been an important research problem. In this paper, we propose Counterfactual Explainable Recommendation (CountER), which takes the insights of counterfactual reasoning from causal inference for explainable recommendation. CountER is able to formulate the complexity and the strength of explanations, and it adopts a counterfactual learning framework to seek simple (low complexity) and effective (high strength) explanations for the model decision. Technically, for each item recommended to each user, CountER formulates a joint optimization problem to generate minimal changes on the item aspects so as to create a counterfactual item, such that the recommendation decision on the counterfactual item is reversed. These altered aspects constitute the explanation of why the original item is recommended. The counterfactual explanation helps both the users for better understanding and the system designers for better model debugging. Another contribution of the work is the evaluation of explainable recommendation, which has been a challenging task. Fortunately, counterfactual explanations are very suitable for standard quantitative evaluation. To measure the explanation quality, we design two types of evaluation metrics, one from user's perspective (i.e. why the user likes the item), and the other from model's perspective (i.e. why the item is recommended by the model). We apply our counterfactual learning algorithm on a black-box recommender system and evaluate the generated explanations on five real-world datasets. Results show that our model generates more accurate and effective explanations than state-of-the-art explainable recommendation models.
    AutoAC: Towards Automated Attribute Completion for Heterogeneous Graph Neural Network. (arXiv:2301.03049v2 [cs.LG] UPDATED)
    Many real-world data can be modeled as heterogeneous graphs that contain multiple types of nodes and edges. Meanwhile, due to excellent performance, heterogeneous graph neural networks (GNNs) have received more and more attention. However, the existing work mainly focuses on the design of novel GNN models, while ignoring another important issue that also has a large impact on the model performance, namely the missing attributes of some node types. The handcrafted attribute completion requires huge expert experience and domain knowledge. Also, considering the differences in semantic characteristics between nodes, the attribute completion should be fine-grained, i.e., the attribute completion operation should be node-specific. Moreover, to improve the performance of the downstream graph learning task, attribute completion and the training of the heterogeneous GNN should be jointly optimized rather than viewed as two separate processes. To address the above challenges, we propose a differentiable attribute completion framework called AutoAC for automated completion operation search in heterogeneous GNNs. We first propose an expressive completion operation search space, including topology-dependent and topology-independent completion operations. Then, we propose a continuous relaxation schema and further propose a differentiable completion algorithm where the completion operation search is formulated as a bi-level joint optimization problem. To improve the search efficiency, we leverage two optimization techniques: discrete constraints and auxiliary unsupervised graph node clustering. Extensive experimental results on real-world datasets reveal that AutoAC outperforms the SOTA handcrafted heterogeneous GNNs and the existing attribute completion method
    Scalable Spatiotemporal Graph Neural Networks. (arXiv:2209.06520v2 [cs.LG] UPDATED)
    Neural forecasting of spatiotemporal time series drives both research and industrial innovation in several relevant application domains. Graph neural networks (GNNs) are often the core component of the forecasting architecture. However, in most spatiotemporal GNNs, the computational complexity scales up to a quadratic factor with the length of the sequence times the number of links in the graph, hence hindering the application of these models to large graphs and long temporal sequences. While methods to improve scalability have been proposed in the context of static graphs, few research efforts have been devoted to the spatiotemporal case. To fill this gap, we propose a scalable architecture that exploits an efficient encoding of both temporal and spatial dynamics. In particular, we use a randomized recurrent neural network to embed the history of the input time series into high-dimensional state representations encompassing multi-scale temporal dynamics. Such representations are then propagated along the spatial dimension using different powers of the graph adjacency matrix to generate node embeddings characterized by a rich pool of spatiotemporal features. The resulting node embeddings can be efficiently pre-computed in an unsupervised manner, before being fed to a feed-forward decoder that learns to map the multi-scale spatiotemporal representations to predictions. The training procedure can then be parallelized node-wise by sampling the node embeddings without breaking any dependency, thus enabling scalability to large networks. Empirical results on relevant datasets show that our approach achieves results competitive with the state of the art, while dramatically reducing the computational burden.
    Optimal Regret Is Achievable With Constant Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework. (arXiv:2201.12955v3 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. However, there is a large discrepancy between the superior practical performance of these approaches and their theoretical justification. Previous research only indicates a negative theoretical result: Thompson sampling could have a worst-case linear regret $\Omega(T)$ with a constant threshold on the inference error measured by one $\alpha$-divergence. To bridge this gap, we propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order $O(\log T)$ if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is. Our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error, to our best knowledge. Furthermore, in concordance with the negative results in previous studies, we show that only one bounded $\alpha$-divergence is insufficient to guarantee a sub-linear regret.
    Neural Attention Memory. (arXiv:2302.09422v1 [cs.LG])
    We propose a novel perspective of the attention mechanism by reinventing it as a memory architecture for neural networks, namely Neural Attention Memory (NAM). NAM is a memory structure that is both readable and writable via differentiable linear algebra operations. We explore three use cases of NAM: memory-augmented neural network (MANN), few-shot learning, and efficient long-range attention. First, we design two NAM-based MANNs of Long Short-term Memory (LSAM) and NAM Turing Machine (NAM-TM) that show better computational powers in algorithmic zero-shot generalization tasks compared to other baselines such as differentiable neural computer (DNC). Next, we apply NAM to the N-way K-shot learning task and show that it is more effective at reducing false positives compared to the baseline cosine classifier. Finally, we implement an efficient Transformer with NAM and evaluate it with long-range arena tasks to show that NAM can be an efficient and effective alternative for scaled dot-product attention.
    MEDFAIR: Benchmarking Fairness for Medical Imaging. (arXiv:2210.01725v2 [cs.LG] UPDATED)
    A multitude of work has shown that machine learning-based medical diagnosis systems can be biased against certain subgroups of people. This has motivated a growing number of bias mitigation algorithms that aim to address fairness issues in machine learning. However, it is difficult to compare their effectiveness in medical imaging for two reasons. First, there is little consensus on the criteria to assess fairness. Second, existing bias mitigation algorithms are developed under different settings, e.g., datasets, model selection strategies, backbones, and fairness metrics, making a direct comparison and evaluation based on existing results impossible. In this work, we introduce MEDFAIR, a framework to benchmark the fairness of machine learning models for medical imaging. MEDFAIR covers eleven algorithms from various categories, nine datasets from different imaging modalities, and three model selection criteria. Through extensive experiments, we find that the under-studied issue of model selection criterion can have a significant impact on fairness outcomes; while in contrast, state-of-the-art bias mitigation algorithms do not significantly improve fairness outcomes over empirical risk minimization (ERM) in both in-distribution and out-of-distribution settings. We evaluate fairness from various perspectives and make recommendations for different medical application scenarios that require different ethical principles. Our framework provides a reproducible and easy-to-use entry point for the development and evaluation of future bias mitigation algorithms in deep learning. Code is available at https://github.com/ys-zong/MEDFAIR.
    Unbalanced CO-Optimal Transport. (arXiv:2205.14923v3 [stat.ML] UPDATED)
    Optimal transport (OT) compares probability distributions by computing a meaningful alignment between their samples. CO-optimal transport (COOT) takes this comparison further by inferring an alignment between features as well. While this approach leads to better alignments and generalizes both OT and Gromov-Wasserstein distances, we provide a theoretical result showing that it is sensitive to outliers that are omnipresent in real-world data. This prompts us to propose unbalanced COOT for which we provably show its robustness to noise in the compared datasets. To the best of our knowledge, this is the first such result for OT methods in incomparable spaces. With this result in hand, we provide empirical evidence of this robustness for the challenging tasks of heterogeneous domain adaptation with and without varying proportions of classes and simultaneous alignment of samples and features across single-cell measurements.
    Distributed Non-Convex Optimization with One-Bit Compressors on Heterogeneous Data: Efficient and Resilient Algorithms. (arXiv:2210.00665v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a nascent decentralized learning framework under which a massive collection of heterogeneous clients collaboratively train a model without revealing their local data. Scarce communication, privacy leakage, and Byzantine attacks are the key bottlenecks of system scalability. In this paper, we focus on communication-efficient distributed (stochastic) gradient descent for non-convex optimization, a driving force of FL. We propose two algorithms, named {\em Adaptive Stochastic Sign SGD (Ada-StoSign)} and {\em $\beta$-Stochastic Sign SGD ($\beta$-StoSign)}, each of which compresses the local gradients into bit vectors. To handle unbounded gradients, Ada-StoSign uses a novel norm tracking function that adaptively adjusts a coarse estimation on the $\ell_{\infty}$ of the local gradients - a key parameter used in gradient compression. We show that Ada-StoSign converges in expectation with a rate $O(\log T/\sqrt{T} + 1/\sqrt{M})$, where $M$ is the number of clients. To the best of our knowledge, when $M$ is sufficiently large, Ada-StoSign outperforms the state-of-the-art sign-based method whose convergence rate is $O(T^{-1/4})$. Under bounded gradient assumption, $\beta$-StoSign achieves quantifiable Byzantine resilience and privacy assurances, and works with partial client participation and mini-batch gradients which could be unbounded. We corroborate and complement our theories by experiments on MNIST and CIFAR-10 datasets.
    Towards Federated Learning on Time-Evolving Heterogeneous Data. (arXiv:2112.13246v3 [cs.LG] UPDATED)
    Federated Learning (FL) is a learning paradigm that protects privacy by keeping client data on edge devices. However, optimizing FL in practice can be difficult due to the diversity and heterogeneity of the learning system. Despite recent research efforts to improve the optimization of heterogeneous data, the impact of time-evolving heterogeneous data in real-world scenarios, such as changing client data or intermittent clients joining or leaving during training, has not been studied well. In this work, we propose Continual Federated Learning (CFL), a flexible framework for capturing the time-evolving heterogeneity of FL. CFL can handle complex and realistic scenarios, which are difficult to evaluate in previous FL formulations, by extracting information from past local data sets and approximating local objective functions. We theoretically demonstrate that CFL methods have a faster convergence rate than FedAvg in time-evolving scenarios, with the benefit depending on approximation quality. Through experiments, we show that our numerical findings match the convergence analysis and that CFL methods significantly outperform other state-of-the-art FL baselines.
    Provable Acceleration of Heavy Ball beyond Quadratics for a Class of Polyak-\L{}ojasiewicz Functions when the Non-Convexity is Averaged-Out. (arXiv:2206.11872v2 [math.OC] UPDATED)
    Heavy Ball (HB) nowadays is one of the most popular momentum methods in non-convex optimization. It has been widely observed that incorporating the Heavy Ball dynamic in gradient-based methods accelerates the training process of modern machine learning models. However, the progress on establishing its theoretical foundation of acceleration is apparently far behind its empirical success. Existing provable acceleration results are of the quadratic or close-to-quadratic functions, as the current techniques of showing HB's acceleration are limited to the case when the Hessian is fixed. In this work, we develop some new techniques that help show acceleration beyond quadratics, which is achieved by analyzing how the change of the Hessian at two consecutive time points affects the convergence speed. Based on our technical results, a class of Polyak-\L{}ojasiewicz (PL) optimization problems for which provable acceleration can be achieved via HB is identified. Moreover, our analysis demonstrates a benefit of adaptively setting the momentum parameter.
    Data Augmentation on Graphs: A Technical Survey. (arXiv:2212.09970v2 [cs.LG] UPDATED)
    In recent years, graph representation learning has achieved remarkable success while suffering from low-quality data problems. As a mature technology to improve data quality in computer vision, data augmentation has also attracted increasing attention in graph domain. For promoting the development of this emerging research direction, in this survey, we comprehensively review and summarize the existing graph data augmentation (GDAug) techniques. Specifically, we first summarize a variety of feasible taxonomies, and then classify existing GDAug studies based on fine-grained graph elements. Furthermore, for each type of GDAug technique, we formalize the general definition, discuss the technical details, and give schematic illustration. In addition, we also summarize common performance metrics and specific design metrics for constructing a GDAug evaluation system. Finally, we summarize the applications of GDAug from both data and model levels, as well as future directions. Latest advances in GDAug are summarized in a GitHub repository: https://github.com/jjzhou012/GDAug-Survey.
    Delving into the Adversarial Robustness of Federated Learning. (arXiv:2302.09479v1 [cs.LG])
    In Federated Learning (FL), models are as fragile as centrally trained models against adversarial examples. However, the adversarial robustness of federated learning remains largely unexplored. This paper casts light on the challenge of adversarial robustness of federated learning. To facilitate a better understanding of the adversarial vulnerability of the existing FL methods, we conduct comprehensive robustness evaluations on various attacks and adversarial training methods. Moreover, we reveal the negative impacts induced by directly adopting adversarial training in FL, which seriously hurts the test accuracy, especially in non-IID settings. In this work, we propose a novel algorithm called Decision Boundary based Federated Adversarial Training (DBFAT), which consists of two components (local re-weighting and global regularization) to improve both accuracy and robustness of FL systems. Extensive experiments on multiple datasets demonstrate that DBFAT consistently outperforms other baselines under both IID and non-IID settings.
    MNL-Bandit with Knapsacks. (arXiv:2106.01135v2 [cs.LG] UPDATED)
    We consider a dynamic assortment selection problem where a seller has a fixed inventory of $N$ substitutable products and faces an unknown demand that arrives sequentially over $T$ periods. In each period, the seller needs to decide on the assortment of products (of cardinality at most $K$) to offer to the customers. The customer's response follows an unknown multinomial logit model (MNL) with parameters $v$. The goal of the seller is to maximize the total expected revenue given the fixed initial inventory of $N$ products. We give a policy that achieves a regret of $\tilde O\Big(K \sqrt{KN T}\Big(\sqrt{v_{\text{max}}} + \frac{1}{q_{\text{min}}}\text{OPT}\Big)\Big)$, where $v_{\text{max}}\leq 1$ is the maximum utility for any product and $q_{\text{min}}$ the minimum inventory level, under a mild assumption on the model parameters. In particular, our policy achieves a near-optimal $\tilde O(\sqrt{T})$ regret in a large-inventory setting. Our policy builds upon the UCB-based approach for MNL-bandit without inventory constraints in [1] and addresses the inventory constraints through an exponentially sized LP for which we present a tractable approximation while keeping the $\tilde O(\sqrt{T})$ regret bound.
    Unveiling Transformers with LEGO: a synthetic reasoning task. (arXiv:2206.04301v3 [cs.LG] UPDATED)
    We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel \emph{association} pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds ``shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose the LEGO attention module, a drop-in replacement for vanilla attention heads. This architectural change significantly reduces Flops and maintains or even \emph{improves} the model's performance at large-scale pretraining.
    ViTA: A Vision Transformer Inference Accelerator for Edge Applications. (arXiv:2302.09108v1 [cs.AR])
    Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their ability to capture the global relation between features which leads to superior performance. However, they are compute-heavy and difficult to deploy in resource-constrained edge devices. Existing hardware accelerators, including those for the closely-related BERT transformer models, do not target highly resource-constrained environments. In this paper, we address this gap and propose ViTA - a configurable hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices and avoiding repeated off-chip memory accesses. We employ a head-level pipeline and inter-layer MLP optimizations, and can support several commonly used vision transformer models with changes solely in our control logic. We achieve nearly 90% hardware utilization efficiency on most vision transformer models, report a power of 0.88W when synthesised with a clock of 150 MHz, and get reasonable frame rates - all of which makes ViTA suitable for edge applications.
    Towards Co-operative Congestion Mitigation. (arXiv:2302.09140v1 [cs.LG])
    The effects of traffic congestion are widespread and are an impedance to everyday life. Piecewise constant driving policies have shown promise in helping mitigate traffic congestion in simulation environments. However, no works currently test these policies in situations involving real human users. Thus, we propose to evaluate these policies through the use of a shared control framework in a collaborative experiment with the human driver and the driving policy aiming to co-operatively mitigate congestion. We intend to use the CARLA simulator alongside the Flow framework to conduct user studies to evaluate the affect of piecewise constant driving policies. As such, we present our in-progress work in building our framework and discuss our proposed plan on evaluating this framework through a human-in-the-loop simulation user study.
    Graph Generative Model for Benchmarking Graph Neural Networks. (arXiv:2207.04396v3 [cs.LG] UPDATED)
    As the field of Graph Neural Networks (GNN) continues to grow, it experiences a corresponding increase in the need for large, real-world datasets to train and test new GNN models on challenging, realistic problems. Unfortunately, such graph datasets are often generated from online, highly privacy-restricted ecosystems, which makes research and development on these datasets hard, if not impossible. This greatly reduces the amount of benchmark graphs available to researchers, causing the field to rely only on a handful of publicly-available datasets. To address this problem, we introduce a novel graph generative model, Computation Graph Transformer (CGT) that learns and reproduces the distribution of real-world graphs in a privacy-controlled way. More specifically, CGT (1) generates effective benchmark graphs on which GNNs show similar task performance as on the source graphs, (2) scales to process large-scale graphs, (3) incorporates off-the-shelf privacy modules to guarantee end-user privacy of the generated graph. Extensive experiments across a vast body of graph generative models show that only our model can successfully generate privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to benchmark GNN models.
    FrAug: Frequency Domain Augmentation for Time Series Forecasting. (arXiv:2302.09292v1 [cs.LG])
    Data augmentation (DA) has become a de facto solution to expand training data size for deep learning. With the proliferation of deep models for time series analysis, various time series DA techniques are proposed in the literature, e.g., cropping-, warping-, flipping-, and mixup-based methods. However, these augmentation methods mainly apply to time series classification and anomaly detection tasks. In time series forecasting (TSF), we need to model the fine-grained temporal relationship within time series segments to generate accurate forecasting results given data in a look-back window. Existing DA solutions in the time domain would break such a relationship, leading to poor forecasting accuracy. To tackle this problem, this paper proposes simple yet effective frequency domain augmentation techniques that ensure the semantic consistency of augmented data-label pairs in forecasting, named FrAug. We conduct extensive experiments on eight widely-used benchmarks with several state-of-the-art TSF deep models. Our results show that FrAug can boost the forecasting accuracy of TSF models in most cases. Moreover, we show that FrAug enables models trained with 1\% of the original training data to achieve similar performance to the ones trained on full training data, which is particularly attractive for cold-start forecasting. Finally, we show that applying test-time training with FrAug greatly improves forecasting accuracy for time series with significant distribution shifts, which often occurs in real-life TSF applications. Our code is available at https://anonymous.4open.science/r/Fraug-more-results-1785.
    Satisficing Paths and Independent Multi-Agent Reinforcement Learning in Stochastic Games. (arXiv:2110.04638v4 [cs.GT] UPDATED)
    In multi-agent reinforcement learning (MARL), independent learners are those that do not observe the actions of other agents in the system. Due to the decentralization of information, it is challenging to design independent learners that drive play to equilibrium. This paper investigates the feasibility of using satisficing dynamics to guide independent learners to approximate equilibrium in stochastic games. For $\epsilon \geq 0$, an $\epsilon$-satisficing policy update rule is any rule that instructs the agent to not change its policy when it is $\epsilon$-best-responding to the policies of the remaining players; $\epsilon$-satisficing paths are defined to be sequences of joint policies obtained when each agent uses some $\epsilon$-satisficing policy update rule to select its next policy. We establish structural results on the existence of $\epsilon$-satisficing paths into $\epsilon$-equilibrium in both symmetric $N$-player games and general stochastic games with two players. We then present an independent learning algorithm for $N$-player symmetric games and give high probability guarantees of convergence to $\epsilon$-equilibrium under self-play. This guarantee is made using symmetry alone, leveraging the previously unexploited structure of $\epsilon$-satisficing paths.
    Learning Hyper Label Model for Programmatic Weak Supervision. (arXiv:2207.13545v3 [cs.LG] UPDATED)
    To reduce the human annotation efforts, the programmatic weak supervision (PWS) paradigm abstracts weak supervision sources as labeling functions (LFs) and involves a label model to aggregate the output of multiple LFs to produce training labels. Most existing label models require a parameter learning step for each dataset. In this work, we present a hyper label model that (once learned) infers the ground-truth labels for each dataset in a single forward pass without dataset-specific parameter learning. The hyper label model approximates an optimal analytical (yet computationally intractable) solution of the ground-truth labels. We train the model on synthetic data generated in the way that ensures the model approximates the analytical optimal solution, and build the model upon Graph Neural Network (GNN) to ensure the model prediction being invariant (or equivariant) to the permutation of LFs (or data points). On 14 real-world datasets, our hyper label model outperforms the best existing methods in both accuracy (by 1.4 points on average) and efficiency (by six times on average). Our code is available at https://github.com/wurenzhi/hyper_label_model
    Contrastive Trajectory Similarity Learning with Dual-Feature Attention. (arXiv:2210.05155v3 [cs.DB] UPDATED)
    Trajectory similarity measures act as query predicates in trajectory databases, making them the key player in determining the query results. They also have a heavy impact on the query efficiency. An ideal measure should have the capability to accurately evaluate the similarity between any two trajectories in a very short amount of time. Towards this aim, we propose a contrastive learning-based trajectory modeling method named TrajCL. We present four trajectory augmentation methods and a novel dual-feature self-attention-based trajectory backbone encoder. The resultant model can jointly learn both the spatial and the structural patterns of trajectories. Our model does not involve any recurrent structures and thus has a high efficiency. Besides, our pre-trained backbone encoder can be fine-tuned towards other computationally expensive measures with minimal supervision data. Experimental results show that TrajCL is consistently and significantly more accurate than the state-of-the-art trajectory similarity measures. After fine-tuning, i.e., to serve as an estimator for heuristic measures, TrajCL can even outperform the state-of-the-art supervised method by up to 56% in the accuracy for processing trajectory similarity queries.
    Asynchronous Distributed Bilevel Optimization. (arXiv:2212.10048v2 [cs.LG] UPDATED)
    Bilevel optimization plays an essential role in many machine learning tasks, ranging from hyperparameter optimization to meta-learning. Existing studies on bilevel optimization, however, focus on either centralized or synchronous distributed setting. The centralized bilevel optimization approaches require collecting massive amount of data to a single server, which inevitably incur significant communication expenses and may give rise to data privacy risks. Synchronous distributed bilevel optimization algorithms, on the other hand, often face the straggler problem and will immediately stop working if a few workers fail to respond. As a remedy, we propose Asynchronous Distributed Bilevel Optimization (ADBO) algorithm. The proposed ADBO can tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions, and its convergence is theoretically guaranteed. Furthermore, it is revealed through theoretic analysis that the iteration complexity of ADBO to obtain the $\epsilon$-stationary point is upper bounded by $\mathcal{O}(\frac{1}{{{\epsilon ^2}}})$. Thorough empirical studies on public datasets have been conducted to elucidate the effectiveness and efficiency of the proposed ADBO.
    Why Is Public Pretraining Necessary for Private Model Training?. (arXiv:2302.09483v1 [cs.LG])
    In the privacy-utility tradeoff of a model trained on benchmark language and vision tasks, remarkable improvements have been widely reported with the use of pretraining on publicly available data. This is in part due to the benefits of transfer learning, which is the standard motivation for pretraining in non-private settings. However, the stark contrast in the improvement achieved through pretraining under privacy compared to non-private settings suggests that there may be a deeper, distinct cause driving these gains. To explain this phenomenon, we hypothesize that the non-convex loss landscape of a model training necessitates an optimization algorithm to go through two phases. In the first, the algorithm needs to select a good "basin" in the loss landscape. In the second, the algorithm solves an easy optimization within that basin. The former is a harder problem to solve with private data, while the latter is harder to solve with public data due to a distribution shift or data scarcity. Guided by this intuition, we provide theoretical constructions that provably demonstrate the separation between private training with and without public pretraining. Further, systematic experiments on CIFAR10 and LibriSpeech provide supporting evidence for our hypothesis.
    Auto.gov: Learning-based On-chain Governance for Decentralized Finance (DeFi). (arXiv:2302.09551v1 [q-fin.RM])
    Decentralized finance (DeFi) has seen a tremendous increase in interest in the past years with many types of protocols, such as lending protocols or automated market-makers (AMMs) These protocols are typically controlled using off-chain governance, where token holders can vote to modify different parameters of the protocol. Up till now, however, choosing these parameters has been a manual process, typically done by the core team behind the protocol. In this work, we model a DeFi environment and propose a semi-automatic parameter adjustment approach with deep Q-network (DQN) reinforcement learning. Our system automatically generates intuitive governance proposals to adjust these parameters with data-driven justifications. Our evaluation results demonstrate that a learning-based on-chain governance procedure is more reactive, objective, and efficient than the existing manual approach.
    Online Graph Topology Learning from Matrix-valued Time Series. (arXiv:2107.08020v2 [stat.ML] UPDATED)
    This paper is concerned with the statistical analysis of matrix-valued time series. These are data collected over a network of sensors (typically a set of spatial locations) along time, where a vector of features is observed per time instant per sensor. Thus each sensor is characterized by a vectorial time series. We would like to identify the dependency structure among these sensors and represent it by a graph. When there is only one feature per sensor, the vector auto-regressive models have been widely adapted to infer the structure of Granger causality. The resulting graph is referred to as causal graph. Our first contribution is then extending VAR models to matrix-variate models to serve the purpose of graph learning. Secondly, we propose two online procedures respectively in low and high dimensions, which can update quickly the estimates of coefficients when new samples arrive. In particular in high dimensional regime, a novel Lasso-type is introduced and we develop its homotopy algorithms for the online learning. We also provide an adaptive tuning procedure for the regularization parameter. Lastly, we consider that, the application of AR models onto data usually requires detrending the raw data, however, this step is forbidden in online context. Therefore, we augment the proposed AR models by incorporating trend as extra parameter, and then adapt the online algorithms to the augmented data models, which allow us to simultaneously learn the graph and trend from streaming samples. In this work, we consider primarily the periodic trend. Numerical experiments using both synthetic and real data are performed, whose results support the effectiveness of the proposed methods.
    Adversarial Machine Learning: A Systematic Survey of Backdoor Attack, Weight Attack and Adversarial Example. (arXiv:2302.09457v1 [cs.LG])
    Adversarial machine learning (AML) studies the adversarial phenomenon of machine learning, which may make inconsistent or unexpected predictions with humans. Some paradigms have been recently developed to explore this adversarial phenomenon occurring at different stages of a machine learning system, such as training-time adversarial attack (i.e., backdoor attack), deployment-time adversarial attack (i.e., weight attack), and inference-time adversarial attack (i.e., adversarial example). However, although these paradigms share a common goal, their developments are almost independent, and there is still no big picture of AML. In this work, we aim to provide a unified perspective to the AML community to systematically review the overall progress of this field. We firstly provide a general definition about AML, and then propose a unified mathematical framework to covering existing attack paradigms. According to the proposed unified framework, we can not only clearly figure out the connections and differences among these paradigms, but also systematically categorize and review existing works in each paradigm.
    Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment. (arXiv:2302.09473v1 [cs.CV])
    While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.
    Fairly Predicting Graft Failure in Liver Transplant for Organ Assigning. (arXiv:2302.09400v1 [cs.AI])
    Liver transplant is an essential therapy performed for severe liver diseases. The fact of scarce liver resources makes the organ assigning crucial. Model for End-stage Liver Disease (MELD) score is a widely adopted criterion when making organ distribution decisions. However, it ignores post-transplant outcomes and organ/donor features. These limitations motivate the emergence of machine learning (ML) models. Unfortunately, ML models could be unfair and trigger bias against certain groups of people. To tackle this problem, this work proposes a fair machine learning framework targeting graft failure prediction in liver transplant. Specifically, knowledge distillation is employed to handle dense and sparse features by combining the advantages of tree models and neural networks. A two-step debiasing method is tailored for this framework to enhance fairness. Experiments are conducted to analyze unfairness issues in existing models and demonstrate the superiority of our method in both prediction and fairness performance.
    A Cubic Regularization Approach for Finding Local Minimax Points in Nonconvex Minimax Optimization. (arXiv:2110.07098v5 [math.OC] UPDATED)
    Gradient descent-ascent (GDA) is a widely used algorithm for minimax optimization. However, GDA has been proved to converge to stationary points for nonconvex minimax optimization, which are suboptimal compared with local minimax points. In this work, we develop cubic regularization (CR) type algorithms that globally converge to local minimax points in nonconvex-strongly-concave minimax optimization. We first show that local minimax points are equivalent to second-order stationary points of a certain envelope function. Then, inspired by the classic cubic regularization algorithm, we propose an algorithm named Cubic-LocalMinimax for finding local minimax points, and provide a comprehensive convergence analysis by leveraging its intrinsic potential function. Specifically, we establish the global convergence of Cubic-LocalMinimax to a local minimax point at a sublinear convergence rate and characterize its iteration complexity. Also, we propose a GDA-based solver for solving the cubic subproblem involved in Cubic-LocalMinimax up to certain pre-defined accuracy, and analyze the overall gradient and Hessian-vector product computation complexities of such an inexact Cubic-LocalMinimax algorithm. Moreover, we propose a stochastic variant of Cubic-LocalMinimax for large-scale minimax optimization, and characterize its sample complexity under stochastic sub-sampling. Experimental results demonstrate faster convergence of our stochastic Cubic-LocalMinimax than some existing algorithms.
    Learning Good State and Action Representations via Tensor Decomposition. (arXiv:2105.01136v2 [stat.ML] UPDATED)
    The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP's tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation.
    Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search. (arXiv:2206.00702v5 [cs.AI] UPDATED)
    Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreachable subgoals swiftly, allowing to focus on feasible further subgoals. In this way, AdaSubS benefits from the efficiency of planning with longer subgoals and the fine control with the shorter ones, and thus scales well to difficult planning problems. We show that AdaSubS significantly surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik's Cube, and inequality proving benchmark INT.
    Reverse Differentiation via Predictive Coding. (arXiv:2103.04689v4 [cs.LG] UPDATED)
    Deep learning has redefined the field of artificial intelligence (AI) thanks to the rise of artificial neural networks, which are architectures inspired by their neurological counterpart in the brain. Through the years, this dualism between AI and neuroscience has brought immense benefits to both fields, allowing neural networks to be used in dozens of applications. These networks use an efficient implementation of reverse differentiation, called backpropagation (BP). This algorithm, however, is often criticized for its biological implausibility (e.g., lack of local update rules for the parameters). Therefore, biologically plausible learning methods that rely on predictive coding (PC), a framework for describing information processing in the brain, are increasingly studied. Recent works prove that these methods can approximate BP up to a certain margin on multilayer perceptrons (MLPs), and asymptotically on any other complex model, and that zero-divergence inference learning (Z-IL), a variant of PC, is able to exactly implement BP on MLPs. However, the recent literature shows also that there is no biologically plausible method yet that can exactly replicate the weight update of BP on complex models. To fill this gap, in this paper, we generalize (PC and) Z-IL by directly defining them on computational graphs, and show that it can perform exact reverse differentiation. What results is the first biologically plausible algorithm that is equivalent to BP in the way of updating parameters on any neural network, providing a bridge between the interdisciplinary research of neuroscience and deep learning.
    TorchFL: A Performant Library for Bootstrapping Federated Learning Experiments. (arXiv:2211.00735v2 [cs.LG] UPDATED)
    With the increased legislation around data privacy, federated learning (FL) has emerged as a promising technique that allows the clients (end-user) to collaboratively train deep learning (DL) models without transferring and storing the data in a centralized, third-party server. We introduce TorchFL, a performant library for (i) bootstrapping the FL experiments, (ii) executing them using various hardware accelerators, (iii) profiling the performance, and (iv) logging the overall and agent-specific results on the go. Being built on a bottom-up design using PyTorch and Lightning, TorchFL provides ready-to-use abstractions for models, datasets, and FL algorithms, while allowing the developers to customize them as and when required. This paper aims to dig deeper into the architecture and design of TorchFL, elaborate on how it allows researchers to bootstrap the federated learning experience, and provide experiments and code snippets for the same. With the ready-to-use implementation of state-of-the-art DL models, datasets, and federated learning support, TorchFL aims to allow researchers with little to no engineering background to set up FL experiments with minimal coding and infrastructure overhead.
    Leveraging Causal Graphs for Blocking in Randomized Experiments. (arXiv:2111.02306v2 [stat.ME] UPDATED)
    Randomized experiments are often performed to study the causal effects of interest. Blocking is a technique to precisely estimate the causal effects when the experimental material is not homogeneous. It involves stratifying the available experimental material based on the covariates causing non-homogeneity and then randomizing the treatment within those strata (known as blocks). This eliminates the unwanted effect of the covariates on the causal effects of interest. We investigate the problem of finding a stable set of covariates to be used to form blocks, that minimizes the variance of the causal effect estimates. Using the underlying causal graph, we provide an efficient algorithm to obtain such a set for a general semi-Markovian causal model.
    Large-Scale Representation Learning on Graphs via Bootstrapping. (arXiv:2102.06514v3 [cs.LG] UPDATED)
    Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootstrapped Graph Latents (BGRL) - a graph representation learning method that learns by predicting alternative augmentations of the input. BGRL uses only simple augmentations and alleviates the need for contrasting with negative examples, and is thus scalable by design. BGRL outperforms or matches prior methods on several established benchmarks, while achieving a 2-10x reduction in memory costs. Furthermore, we show that BGRL can be scaled up to extremely large graphs with hundreds of millions of nodes in the semi-supervised regime - achieving state-of-the-art performance and improving over supervised baselines where representations are shaped only through label information. In particular, our solution centered on BGRL constituted one of the winning entries to the Open Graph Benchmark - Large Scale Challenge at KDD Cup 2021, on a graph orders of magnitudes larger than all previously available benchmarks, thus demonstrating the scalability and effectiveness of our approach.
    Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep learning. (arXiv:2302.09526v1 [stat.ME])
    We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized-Linear-Models (GLM), we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is inevitably beneficial to integrate the unlabeled data with some non-zero mixing ratio $\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework for estimating the best mixing ratio $\alpha^*$ where mixed-SSL delivers the best predictive performance, while using the labeled and the unlabeled data on hand. The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, under a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) in improving more complex models such as deep neural networks, in a real-world regression tasks.
    Generative Ornstein-Uhlenbeck Markets via Geometric Deep Learning. (arXiv:2302.09176v1 [q-fin.CP])
    We consider the problem of simultaneously approximating the conditional distribution of market prices and their log returns with a single machine learning model. We show that an instance of the GDN model of Kratsios and Papon (2022) solves this problem without having prior assumptions on the market's "clipped" log returns, other than that they follow a generalized Ornstein-Uhlenbeck process with a priori unknown dynamics. We provide universal approximation guarantees for these conditional distributions and contingent claims with a Lipschitz payoff function.
    Exploration and Incentives in Reinforcement Learning. (arXiv:2103.00360v5 [cs.LG] UPDATED)
    How do you incentivize self-interested agents to $\textit{explore}$ when they prefer to $\textit{exploit}$? We consider complex exploration problems, where each agent faces the same (but unknown) MDP. In contrast with traditional formulations of reinforcement learning, agents control the choice of policies, whereas an algorithm can only issue recommendations. However, the algorithm controls the flow of information, and can incentivize the agents to explore via information asymmetry. We design an algorithm which explores all reachable states in the MDP. We achieve provable guarantees similar to those for incentivizing exploration in static, stateless exploration problems studied previously. To the best of our knowledge, this is the first work to consider mechanism design in a stateful, reinforcement learning setting.
    A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT. (arXiv:2302.09419v1 [cs.AI])
    The Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A pretrained foundation model, such as BERT, GPT-3, MAE, DALLE-E, and ChatGPT, is trained on large-scale data which provides a reasonable parameter initialization for a wide range of downstream applications. The idea of pretraining behind PFMs plays an important role in the application of large models. Different from previous methods that apply convolution and recurrent modules for feature extractions, the generative pre-training (GPT) method applies Transformer as the feature extractor and is trained on large datasets with an autoregressive paradigm. Similarly, the BERT apples transformers to train on large datasets as a contextual language model. Recently, the ChatGPT shows promising success on large language models, which applies an autoregressive language model with zero shot or few show prompting. With the extraordinary success of PFMs, AI has made waves in a variety of fields over the past few years. Considerable methods, datasets, and evaluation metrics have been proposed in the literature, the need is raising for an updated survey. This study provides a comprehensive review of recent research advancements, current and future challenges, and opportunities for PFMs in text, image, graph, as well as other data modalities. We first review the basic components and existing pretraining in natural language processing, computer vision, and graph learning. We then discuss other advanced PFMs for other data modalities and unified PFMs considering the data quality and quantity. Besides, we discuss relevant research about the fundamentals of the PFM, including model efficiency and compression, security, and privacy. Finally, we lay out key implications, future research directions, challenges, and open problems.
    Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension. (arXiv:2302.09301v1 [cs.CL])
    Prompting has become an important mechanism by which users can more effectively interact with many flavors of foundation model. Indeed, the last several years have shown that well-honed prompts can sometimes unlock emergent capabilities within such models. While there has been a substantial amount of empirical exploration of prompting within the community, relatively few works have studied prompting at a mathematical level. In this work we aim to take a first step towards understanding basic geometric properties induced by prompts in Stable Diffusion, focusing on the intrinsic dimension of internal representations within the model. We find that choice of prompt has a substantial impact on the intrinsic dimension of representations at both layers of the model which we explored, but that the nature of this impact depends on the layer being considered. For example, in certain bottleneck layers of the model, intrinsic dimension of representations is correlated with prompt perplexity (measured using a surrogate model), while this correlation is not apparent in the latent layers. Our evidence suggests that intrinsic dimension could be a useful tool for future studies of the impact of different prompts on text-to-image models.
    Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus. (arXiv:2209.14927v3 [cs.CV] UPDATED)
    Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen -- the focus -- as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction.
    Generalization and Stability of Interpolating Neural Networks with Minimal Width. (arXiv:2302.09235v1 [stat.ML])
    We investigate the generalization and optimization of $k$-homogeneous shallow neural-network classifiers in the interpolating regime. The study focuses on analyzing the performance of the model when it is capable of perfectly classifying the input data with a positive margin $\gamma$. When using gradient descent with logistic-loss minimization, we show that the training loss converges to zero at a rate of $\tilde O(1/\gamma^{2/k} T)$ given a polylogarithmic number of neurons. This suggests that gradient descent can find a perfect classifier for $n$ input data within $\tilde{\Omega}(n)$ iterations. Additionally, through a stability analysis we show that with $m=\Omega(\log^{4/k} (n))$ neurons and $T=\Omega(n)$ iterations, the test loss is bounded by $\tilde{O}(1/\gamma^{2/k} n)$. This is in contrast to existing stability results which require polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new self-bounded weak convexity property, which leads to a generalized local quasi-convexity property for sufficiently parameterized neural-network classifiers. Eventually, despite the objective's non-convexity, this leads to convergence and generalization-gap bounds that are similar to those in the convex setting of linear logistic regression.
    Online Instrumental Variable Regression: Regret Analysis and Bandit Feedback. (arXiv:2302.09357v1 [cs.LG])
    The independence of noise and covariates is a standard assumption in online linear regression and linear bandit literature. This assumption and the following analysis are invalid in the case of endogeneity, i.e., when the noise and covariates are correlated. In this paper, we study the online setting of instrumental variable (IV) regression, which is widely used in economics to tackle endogeneity. Specifically, we analyse and upper bound regret of Two-Stage Least Squares (2SLS) approach to IV regression in the online setting. Our analysis shows that Online 2SLS (O2SLS) achieves $O(d^2 \log^2 T)$ regret after $T$ interactions, where d is the dimension of covariates. Following that, we leverage the O2SLS as an oracle to design OFUL-IV, a linear bandit algorithm. OFUL-IV can tackle endogeneity and achieves $O(d \sqrt{T} \log T)$ regret. For datasets with endogeneity, we experimentally demonstrate that O2SLS and OFUL-IV incur lower regrets than the state-of-the-art algorithms for both the online linear regression and linear bandit settings.
    Quasi-Bayesian Nonparametric Density Estimation via Autoregressive Predictive Updates. (arXiv:2206.06462v2 [stat.ML] UPDATED)
    Bayesian methods are a popular choice for statistical inference in small-data regimes due to the regularization effect induced by the prior. In the context of density estimation, the standard nonparametric Bayesian approach is to target the posterior predictive of the Dirichlet process mixture model. In general, direct estimation of the posterior predictive is intractable and so methods typically resort to approximating the posterior distribution as an intermediate step. The recent development of quasi-Bayesian predictive copula updates, however, has made it possible to perform tractable predictive density estimation without the need for posterior approximation. Although these estimators are computationally appealing, they tend to struggle on non-smooth data distributions. This is due to the comparatively restrictive form of the likelihood models from which the proposed copula updates were derived. To address this shortcoming, we consider a Bayesian nonparametric model with an autoregressive likelihood decomposition and a Gaussian process prior. While the predictive update of such a model is typically intractable, we derive a quasi-Bayesian predictive update that achieves state-of-the-art results in small-data regimes.
    Estimating Optimal Policy Value in General Linear Contextual Bandits. (arXiv:2302.09451v1 [cs.LG])
    In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as $V^*$ estimation. It was recently shown that fast $V^*$ estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$ is indeed information-theoretically possible, where $d$ is the dimension. We then present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V^*$ that holds for general distributions and is tight when the context distribution is Gaussian. We prove our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to estimate the upper bound. We use this upper bound and the estimator to obtain novel and improved guarantees for several applications in bandit model selection and testing for treatment effects.
    Promoting Cooperation in Multi-Agent Reinforcement Learning via Mutual Help. (arXiv:2302.09277v1 [cs.LG])
    Multi-agent reinforcement learning (MARL) has achieved great progress in cooperative tasks in recent years. However, in the local reward scheme, where only local rewards for each agent are given without global rewards shared by all the agents, traditional MARL algorithms lack sufficient consideration of agents' mutual influence. In cooperative tasks, agents' mutual influence is especially important since agents are supposed to coordinate to achieve better performance. In this paper, we propose a novel algorithm Mutual-Help-based MARL (MH-MARL) to instruct agents to help each other in order to promote cooperation. MH-MARL utilizes an expected action module to generate expected other agents' actions for each particular agent. Then, the expected actions are delivered to other agents for selective imitation during training. Experimental results show that MH-MARL improves the performance of MARL both in success rate and cumulative reward.
    The Generalization Error of Stochastic Mirror Descent on Over-Parametrized Linear Models. (arXiv:2302.09433v1 [cs.LG])
    Despite being highly over-parametrized, and having the ability to fully interpolate the training data, deep networks are known to generalize well to unseen data. It is now understood that part of the reason for this is that the training algorithms used have certain implicit regularization properties that ensure interpolating solutions with "good" properties are found. This is best understood in linear over-parametrized models where it has been shown that the celebrated stochastic gradient descent (SGD) algorithm finds an interpolating solution that is closest in Euclidean distance to the initial weight vector. Different regularizers, replacing Euclidean distance with Bregman divergence, can be obtained if we replace SGD with stochastic mirror descent (SMD). Empirical observations have shown that in the deep network setting, SMD achieves a generalization performance that is different from that of SGD (and which depends on the choice of SMD's potential function. In an attempt to begin to understand this behavior, we obtain the generalization error of SMD for over-parametrized linear models for a binary classification problem where the two classes are drawn from a Gaussian mixture model. We present simulation results that validate the theory and, in particular, introduce two data models, one for which SMD with an $\ell_2$ regularizer (i.e., SGD) outperforms SMD with an $\ell_1$ regularizer, and one for which the reverse happens.
    Deep Neural Networks based Meta-Learning for Network Intrusion Detection. (arXiv:2302.09394v1 [cs.LG])
    Designing an intrusion detection system is difficult as network traffic encompasses various attack types, including new and evolving ones with minor changes. The data used to construct a predictive model has a skewed class distribution and limited representation of attack types, which differ from real network traffic. These limitations result in dataset shift, negatively impacting the machine learning models' predictive abilities and reducing the detection rate against novel attacks. To address the challenge of dataset shift, we introduce the INformation FUsion and Stacking Ensemble (INFUSE) for network intrusion detection. This approach further improves its predictive power by employing a deep neural network-based Meta-Learner on top of INFUSE. First, a hybrid feature space is created by integrating decision and feature spaces. Five different classifiers are utilized to generate a pool of decision spaces. The feature space is then enriched through a deep sparse autoencoder that learns the semantic relationships between attacks. Finally, the deep Meta-Learner acts as an ensemble combiner to analyze the hybrid feature space and make a final decision. Our evaluation on stringent benchmark datasets and comparison to existing techniques showed the effectiveness of INFUSE with an F-Score of 0.91, Accuracy of 91.6%, and Recall of 0.94 on the Test+ dataset, and an F-Score of 0.91, Accuracy of 85.6%, and Recall of 0.87 on the stringent Test-21 dataset. These promising results indicate the proposed technique has strong generalization capability and the potential to detect network attacks.
    Do Bayesian Neural Networks Need To Be Fully Stochastic?. (arXiv:2211.06291v2 [cs.LG] UPDATED)
    We investigate the benefit of treating all the parameters in a Bayesian neural network stochastically and find compelling theoretical and empirical evidence that this standard construction may be unnecessary. To this end, we prove that expressive predictive distributions require only small amounts of stochasticity. In particular, partially stochastic networks with only $n$ stochastic biases are universal probabilistic predictors for $n$-dimensional predictive problems. In empirical investigations, we find no systematic benefit of full stochasticity across four different inference modalities and eight datasets; partially stochastic networks can match and sometimes even outperform fully stochastic networks, despite their reduced memory costs.
    Data Augmentation for Imbalanced Regression. (arXiv:2302.09288v1 [stat.ML])
    In this work, we consider the problem of imbalanced data in a regression framework when the imbalanced phenomenon concerns continuous or discrete covariates. Such a situation can lead to biases in the estimates. In this case, we propose a data augmentation algorithm that combines a weighted resampling (WR) and a data augmentation (DA) procedure. In a first step, the DA procedure permits exploring a wider support than the initial one. In a second step, the WR method drives the exogenous distribution to a target one. We discuss the choice of the DA procedure through a numerical study that illustrates the advantages of this approach. Finally, an actuarial application is studied.
    Improved Robust Algorithms for Learning with Discriminative Feature Feedback. (arXiv:2209.03753v3 [cs.LG] UPDATED)
    Discriminative Feature Feedback is a setting proposed by Dastupta et al. (2018), which provides a protocol for interactive learning based on feature explanations that are provided by a human teacher. The features distinguish between the labels of pairs of possibly similar instances. That work has shown that learning in this model can have considerable statistical and computational advantages over learning in standard label-based interactive learning models. In this work, we provide new robust interactive learning algorithms for the Discriminative Feature Feedback model, with mistake bounds that are significantly lower than those of previous robust algorithms for this setting. In the adversarial setting, we reduce the dependence on the number of protocol exceptions from quadratic to linear. In addition, we provide an algorithm for a slightly more restricted model, which obtains an even smaller mistake bound for large models with many exceptions. In the stochastic setting, we provide the first algorithm that converges to the exception rate with a polynomial sample complexity. Our algorithm and analysis for the stochastic setting involve a new construction that we call Feature Influence, which may be of wider applicability.
    A Genetic Algorithm-based Framework for Learning Statistical Power Manifold. (arXiv:2209.00215v3 [stat.CO] UPDATED)
    Statistical power is a measure of the replicability of a categorical hypothesis test. Formally, it is the probability of detecting an effect, if there is a true effect present in the population. Hence, optimizing statistical power as a function of some parameters of a hypothesis test is desirable. However, for most hypothesis tests, the explicit functional form of statistical power for individual model parameters is unknown; but calculating power for a given set of values of those parameters is possible using simulated experiments. These simulated experiments are usually computationally expensive. Hence, developing the entire statistical power manifold using simulations can be very time-consuming. We propose a novel genetic algorithm-based framework for learning statistical power manifolds. For a multiple linear regression $F$-test, we show that the proposed algorithm/framework learns the statistical power manifold much faster as compared to a brute-force approach as the number of queries to the power oracle is significantly reduced. We also show that the quality of learning the manifold improves as the number of iterations increases for the genetic algorithm. Such tools are useful for evaluating statistical power trade-offs when researchers have little information regarding a priori best guesses of primary effect sizes of interest or how sampling variability in non-primary effects impacts power for primary ones.
    OMINACS: Online ML-Based IoT Network Attack Detection and Classification System. (arXiv:2302.09225v1 [cs.NI])
    Several Machine Learning (ML) methodologies have been proposed to improve security in Internet Of Things (IoT) networks and reduce the damage caused by the action of malicious agents. However, detecting and classifying attacks with high accuracy and precision is still a major challenge. This paper proposes an online attack detection and network traffic classification system, which combines stream Machine Learning, Deep Learning, and Ensemble Learning technique. Using multiple stages of data analysis, the system can detect the presence of malicious traffic flows and classify them according to the type of attack they represent. Furthermore, we show how to implement this system both in an IoT network and from an ML point of view. The system was evaluated in three IoT network security datasets, in which it obtained accuracy and precision above 90% with a reduced false alarm rate.
    Stochastic Generative Flow Networks. (arXiv:2302.09465v1 [cs.LG])
    Generative Flow Networks (or GFlowNets for short) are a family of probabilistic agents that learn to sample complex combinatorial structures through the lens of "inference as control". They have shown great potential in generating high-quality and diverse candidates from a given energy landscape. However, existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with stochastic dynamics, which can limit their applicability. To overcome this challenge, this paper introduces Stochastic GFlowNets, a new algorithm that extends GFlowNets to stochastic environments. By decomposing state transitions into two steps, Stochastic GFlowNets isolate environmental stochasticity and learn a dynamics model to capture it. Extensive experimental results demonstrate that Stochastic GFlowNets offer significant advantages over standard GFlowNets as well as MCMC- and RL-based approaches, on a variety of standard benchmarks with stochastic dynamics.
    Designing Equitable Algorithms. (arXiv:2302.09157v1 [cs.LG])
    Predictive algorithms are now used to help distribute a large share of our society's resources and sanctions, such as healthcare, loans, criminal detentions, and tax audits. Under the right circumstances, these algorithms can improve the efficiency and equity of decision-making. At the same time, there is a danger that the algorithms themselves could entrench and exacerbate disparities, particularly along racial, ethnic, and gender lines. To help ensure their fairness, many researchers suggest that algorithms be subject to at least one of three constraints: (1) no use of legally protected features, such as race, ethnicity, and gender; (2) equal rates of "positive" decisions across groups; and (3) equal error rates across groups. Here we show that these constraints, while intuitively appealing, often worsen outcomes for individuals in marginalized groups, and can even leave all groups worse off. The inherent trade-off we identify between formal fairness constraints and welfare improvements -- particularly for the marginalized -- highlights the need for a more robust discussion on what it means for an algorithm to be "fair". We illustrate these ideas with examples from healthcare and the criminal-legal system, and make several proposals to help practitioners design more equitable algorithms.
    Probabilistic Back-ends for Online Speaker Recognition and Clustering. (arXiv:2302.09523v1 [eess.AS])
    This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an extremely constrained version of probabilistic linear discriminant analysis (PLDA). The proposed model improves over the cosine scoring for multi-enrollment recognition while keeping the same performance in the case of one-to-one comparisons. Finally, we consider an online speaker clustering task where each step naturally involves multi-enrollment recognition. We propose an online clustering algorithm allowing us to take benefits from the PLDA model such as the ability to handle uncertainty and better score calibration. Our experiments demonstrate the effectiveness of the proposed algorithm.
    Fast Kernel Methods for Generic Lipschitz Losses via \texorpdfstring{$p$}{p}-Sparsified Sketches. (arXiv:2206.03827v3 [stat.ML] UPDATED)
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods.
    Likelihood-Free Inference in State-Space Models with Unknown Dynamics. (arXiv:2111.01555v2 [cs.LG] UPDATED)
    Likelihood-free inference (LFI) has been successfully applied to state-space models, where the likelihood of observations is not available but synthetic observations generated by a black-box simulator can be used for inference instead. However, much of the research up to now have been restricted to cases, in which a model of state transition dynamics can be formulated in advance and the simulation budget is unrestricted. These methods fail to address the problem of state inference when simulations are computationally expensive and the Markovian state transition dynamics are undefined. The approach proposed in this manuscript enables LFI of states with a limited number of simulations by estimating the transition dynamics, and using state predictions as proposals for simulations. In the experiments with non-stationary user models, the proposed method demonstrates significant improvement in accuracy for both state inference and prediction, where a multi-output Gaussian process is used for LFI of states, and a Bayesian Neural Network as a surrogate model of transition dynamics.
    Transformer-Based Neural Marked Spatio Temporal Point Process Model for Football Match Events Analysis. (arXiv:2302.09276v1 [cs.AI])
    With recently available football match event data that record the details of football matches, analysts and researchers have a great opportunity to develop new performance metrics, gain insight, and evaluate key performance. However, most sports sequential events modeling methods and performance metrics approaches could be incomprehensive in dealing with such large-scale spatiotemporal data (in particular, temporal process), thereby necessitating a more comprehensive spatiotemporal model and a holistic performance metric. To this end, we proposed the Transformer-Based Neural Marked Spatio Temporal Point Process (NMSTPP) model for football event data based on the neural temporal point processes (NTPP) framework. In the experiments, our model outperformed the prediction performance of the baseline models. Furthermore, we proposed the holistic possession utilization score (HPUS) metric for a more comprehensive football possession analysis. For verification, we examined the relationship with football teams' final ranking, average goal score, and average xG over a season. It was observed that the average HPUS showed significant correlations regardless of not using goal and details of shot information. Furthermore, we show HPUS examples in analyzing possessions, matches, and between matches.
    Rapid Design of Top-Performing Metal-Organic Frameworks with Qualitative Representations of Building Blocks. (arXiv:2302.09184v1 [cond-mat.mtrl-sci])
    Data-driven materials design often encounters challenges where systems require or possess qualitative (categorical) information. Metal-organic frameworks (MOFs) are an example of such material systems. The representation of MOFs through different building blocks makes it a challenge for designers to incorporate qualitative information into design optimization. Furthermore, the large number of potential building blocks leads to a combinatorial challenge, with millions of possible MOFs that could be explored through time consuming physics-based approaches. In this work, we integrated Latent Variable Gaussian Process (LVGP) and Multi-Objective Batch-Bayesian Optimization (MOBBO) to identify top-performing MOFs adaptively, autonomously, and efficiently without any human intervention. Our approach provides three main advantages: (i) no specific physical descriptors are required and only building blocks that construct the MOFs are used in global optimization through qualitative representations, (ii) the method is application and property independent, and (iii) the latent variable approach provides an interpretable model of qualitative building blocks with physical justification. To demonstrate the effectiveness of our method, we considered a design space with more than 47,000 MOF candidates. By searching only ~1% of the design space, LVGP-MOBBO was able to identify all MOFs on the Pareto front and more than 97% of the 50 top-performing designs for the CO$_2$ working capacity and CO$_2$/N$_2$ selectivity properties. Finally, we compared our approach with the Random Forest algorithm and demonstrated its efficiency, interpretability, and robustness.
    Real-time Neural-MPC: Deep Learning Model Predictive Control for Quadrotors and Agile Robotic Platforms. (arXiv:2203.07747v3 [cs.RO] UPDATED)
    Model Predictive Control (MPC) has become a popular framework in embedded control for high-performance autonomous systems. However, to achieve good control performance using MPC, an accurate dynamics model is key. To maintain real-time operation, the dynamics models used on embedded systems have been limited to simple first-principle models, which substantially limits their representative power. In contrast to such simple models, machine learning approaches, specifically neural networks, have been shown to accurately model even complex dynamic effects, but their large computational complexity hindered combination with fast real-time iteration loops. With this work, we present Real-time Neural MPC, a framework to efficiently integrate large, complex neural network architectures as dynamics models within a model-predictive control pipeline. Our experiments, performed in simulation and the real world onboard a highly agile quadrotor platform, demonstrate the capabilities of the described system to run learned models with, previously infeasible, large modeling capacity using gradient-based online optimization MPC. Compared to prior implementations of neural networks in online optimization MPC we can leverage models of over 4000 times larger parametric capacity in a 50Hz real-time window on an embedded platform. Further, we show the feasibility of our framework on real-world problems by reducing the positional tracking error by up to 82% when compared to state-of-the-art MPC approaches without neural network dynamics.  ( 2 min )
    JANA: Jointly Amortized Neural Approximation of Complex Bayesian Models. (arXiv:2302.09125v1 [cs.LG])
    This work proposes ''jointly amortized neural approximation'' (JANA) of intractable likelihood functions and posterior densities arising in Bayesian surrogate modeling and simulation-based inference. We train three complementary networks in an end-to-end fashion: 1) a summary network to compress individual data points, sets, or time series into informative embedding vectors; 2) a posterior network to learn an amortized approximate posterior; and 3) a likelihood network to learn an amortized approximate likelihood. Their interaction opens a new route to amortized marginal likelihood and posterior predictive estimation -- two important ingredients of Bayesian workflows that are often too expensive for standard methods. We benchmark the fidelity of JANA on a variety of simulation models against state-of-the-art Bayesian methods and propose a powerful and interpretable diagnostic for joint calibration. In addition, we investigate the ability of recurrent likelihood networks to emulate complex time series models without resorting to hand-crafted summary statistics.  ( 2 min )
    A Statistical Analysis of Polyak-Ruppert Averaged Q-learning. (arXiv:2112.14582v4 [stat.ML] UPDATED)
    We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-learning without the Lipschitz condition.  ( 2 min )
    Deep Joint Source-Channel Coding with Iterative Source Error Correction. (arXiv:2302.09174v1 [cs.LG])
    In this paper, we propose an iterative source error correction (ISEC) decoding scheme for deep-learning-based joint source-channel coding (Deep JSCC). Given a noisy codeword received through the channel, we use a Deep JSCC encoder and decoder pair to update the codeword iteratively to find a (modified) maximum a-posteriori (MAP) solution. For efficient MAP decoding, we utilize a neural network-based denoiser to approximate the gradient of the log-prior density of the codeword space. Albeit the non-convexity of the optimization problem, our proposed scheme improves various distortion and perceptual quality metrics from the conventional one-shot (non-iterative) Deep JSCC decoding baseline. Furthermore, the proposed scheme produces more reliable source reconstruction results compared to the baseline when the channel noise characteristics do not match the ones used during training.
    A Coupled Design of Exploiting Record Similarity for Practical Vertical Federated Learning. (arXiv:2106.06312v3 [cs.LG] UPDATED)
    Federated learning is a learning paradigm to enable collaborative learning across different parties without revealing raw data. Notably, vertical federated learning (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, most existing studies in VFL disregard the "record linkage" process. They design algorithms either assuming the data from different parties can be exactly linked or simply linking each record with its most similar neighboring record. These approaches may fail to capture the key features from other less similar records. Moreover, such improper linkage cannot be corrected by training since existing approaches provide no feedback on linkage during training. In this paper, we design a novel coupled training paradigm, FedSim, that integrates one-to-many linkage into the training process. Besides enabling VFL in many real-world applications with fuzzy identifiers, FedSim also achieves better performance in traditional VFL tasks. Moreover, we theoretically analyze the additional privacy risk incurred by sharing similarities. Our experiments on eight datasets with various similarity metrics show that FedSim outperforms other state-of-the-art baselines. The codes of FedSim are available at https://github.com/Xtra-Computing/FedSim.  ( 2 min )
    Bayesian Quantification with Black-Box Estimators. (arXiv:2302.09159v1 [stat.ML])
    Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.
    Topological Feature Selection: A Graph-Based Filter Feature Selection Approach. (arXiv:2302.09543v1 [cs.LG])
    In this paper, we introduce a novel unsupervised, graph-based filter feature selection technique which exploits the power of topologically constrained network representations. We model dependency structures among features using a family of chordal graphs (the Triangulated Maximally Filtered Graph), and we maximise the likelihood of features' relevance by studying their relative position inside the network. Such an approach presents three aspects that are particularly satisfactory compared to its alternatives: (i) it is highly tunable and easily adaptable to the nature of input data; (ii) it is fully explainable, maintaining, at the same time, a remarkable level of simplicity; (iii) it is computationally cheaper compared to its alternatives. We test our algorithm on 16 benchmark datasets from different applicative domains showing that it outperforms or matches the current state-of-the-art under heterogeneous evaluation conditions.
    Continuous Mean-Covariance Bandits. (arXiv:2102.12090v4 [cs.LG] UPDATED)
    Existing risk-aware multi-armed bandit models typically focus on risk measures of individual options such as variance. As a result, they cannot be directly applied to important real-world online decision making problems with correlated options. In this paper, we propose a novel Continuous Mean-Covariance Bandit (CMCB) model to explicitly take into account option correlation. Specifically, in CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions. The agent's objective is to achieve the best trade-off between reward and risk, measured with option covariance. To capture different reward observation scenarios in practice, we consider three feedback settings, i.e., full-information, semi-bandit and full-bandit feedback. We propose novel algorithms with optimal regrets (within logarithmic factors), and provide matching lower bounds to validate their optimalities. The experimental results also demonstrate the superiority of our algorithms. To the best of our knowledge, this is the first work that considers option correlation in risk-aware bandits and explicitly quantifies how arbitrary covariance structures impact the learning performance. The novel analytical techniques we developed for exploiting the estimated covariance to build concentration and bounding the risk of selected actions based on sampling strategy properties can likely find applications in other bandit analysis and be of independent interests.
    AIIR-MIX: Multi-Agent Reinforcement Learning Meets Attention Individual Intrinsic Reward Mixing Network. (arXiv:2302.09531v1 [cs.LG])
    Deducing the contribution of each agent and assigning the corresponding reward to them is a crucial problem in cooperative Multi-Agent Reinforcement Learning (MARL). Previous studies try to resolve the issue through designing an intrinsic reward function, but the intrinsic reward is simply combined with the environment reward by summation in these studies, which makes the performance of their MARL framework unsatisfactory. We propose a novel method named Attention Individual Intrinsic Reward Mixing Network (AIIR-MIX) in MARL, and the contributions of AIIR-MIX are listed as follows:(a) we construct a novel intrinsic reward network based on the attention mechanism to make teamwork more effective. (b) we propose a Mixing network that is able to combine intrinsic and extrinsic rewards non-linearly and dynamically in response to changing conditions of the environment. We compare AIIR-MIX with many State-Of-The-Art (SOTA) MARL methods on battle games in StarCraft II. And the results demonstrate that AIIR-MIX performs admirably and can defeat the current advanced methods on average test win rate. To validate the effectiveness of AIIR-MIX, we conduct additional ablation studies. The results show that AIIR-MIX can dynamically assign each agent a real-time intrinsic reward in accordance with their actual contribution.
    TAX: Tendency-and-Assignment Explainer for Semantic Segmentation with Multi-Annotators. (arXiv:2302.09561v1 [cs.CV])
    To understand how deep neural networks perform classification predictions, recent research attention has been focusing on developing techniques to offer desirable explanations. However, most existing methods cannot be easily applied for semantic segmentation; moreover, they are not designed to offer interpretability under the multi-annotator setting. Instead of viewing ground-truth pixel-level labels annotated by a single annotator with consistent labeling tendency, we aim at providing interpretable semantic segmentation and answer two critical yet practical questions: "who" contributes to the resulting segmentation, and "why" such an assignment is determined. In this paper, we present a learning framework of Tendency-and-Assignment Explainer (TAX), designed to offer interpretability at the annotator and assignment levels. More specifically, we learn convolution kernel subsets for modeling labeling tendencies of each type of annotation, while a prototype bank is jointly observed to offer visual guidance for learning the above kernels. For evaluation, we consider both synthetic and real-world datasets with multi-annotators. We show that our TAX can be applied to state-of-the-art network architectures with comparable performances, while segmentation interpretability at both levels can be offered accordingly.
    Mismatched No More: Joint Model-Policy Optimization for Model-Based RL. (arXiv:2110.02758v2 [cs.LG] UPDATED)
    Many model-based reinforcement learning (RL) methods follow a similar template: fit a model to previously observed data, and then use data from that model for RL or planning. However, models that achieve better training performance (e.g., lower MSE) are not necessarily better for control: an RL agent may seek out the small fraction of states where an accurate model makes mistakes, or it might act in ways that do not expose the errors of an inaccurate model. As noted in prior work, there is an objective mismatch: models are useful if they yield good policies, but they are trained to maximize their accuracy, rather than the performance of the policies that result from them. In this work, we propose a single objective for jointly training the model and the policy, such that updates to either component increase a lower bound on expected return. To the best of our knowledge, this is the first lower bound for model-based RL that holds globally and can be efficiently estimated in continuous settings; it is the only lower bound that mends the objective mismatch problem. A version of this bound becomes tight under certain assumptions. Optimizing this bound resembles a GAN: a classifier distinguishes between real and fake transitions, the model is updated to produce transitions that look realistic, and the policy is updated to avoid states where the model predictions are unrealistic. Numerical simulations demonstrate that optimizing this bound yields reward maximizing policies and yields dynamics that (perhaps surprisingly) can aid in exploration. We also show that a deep RL algorithm loosely based on our lower bound can achieve performance competitive with prior model-based methods, and better performance on certain hard exploration tasks.  ( 3 min )
    Efficient exploration via epistemic-risk-seeking policy optimization. (arXiv:2302.09339v1 [cs.LG])
    Exploration remains a key challenge in deep reinforcement learning (RL). Optimism in the face of uncertainty is a well-known heuristic with theoretical guarantees in the tabular setting, but how best to translate the principle to deep reinforcement learning, which involves online stochastic gradients and deep network function approximators, is not fully understood. In this paper we propose a new, differentiable optimistic objective that when optimized yields a policy that provably explores efficiently, with guarantees even under function approximation. Our new objective is a zero-sum two-player game derived from endowing the agent with an epistemic-risk-seeking utility function, which converts uncertainty into value and encourages the agent to explore uncertain states. We show that the solution to this game minimizes an upper bound on the regret, with the `players' each attempting to minimize one component of a particular regret decomposition. We derive a new model-free algorithm which we call `epistemic-risk-seeking actor-critic', which is simply an application of simultaneous stochastic gradient ascent-descent to the game. We conclude with some results showing good performance of a deep RL agent using the technique on the challenging `DeepSea' environment, showing significant performance improvements even over other efficient exploration techniques, as well as results on the Atari benchmark.
    MaxGNR: A Dynamic Weight Strategy via Maximizing Gradient-to-Noise Ratio for Multi-Task Learning. (arXiv:2302.09352v1 [cs.CV])
    When modeling related tasks in computer vision, Multi-Task Learning (MTL) can outperform Single-Task Learning (STL) due to its ability to capture intrinsic relatedness among tasks. However, MTL may encounter the insufficient training problem, i.e., some tasks in MTL may encounter non-optimal situation compared with STL. A series of studies point out that too much gradient noise would lead to performance degradation in STL, however, in the MTL scenario, Inter-Task Gradient Noise (ITGN) is an additional source of gradient noise for each task, which can also affect the optimization process. In this paper, we point out ITGN as a key factor leading to the insufficient training problem. We define the Gradient-to-Noise Ratio (GNR) to measure the relative magnitude of gradient noise and design the MaxGNR algorithm to alleviate the ITGN interference of each task by maximizing the GNR of each task. We carefully evaluate our MaxGNR algorithm on two standard image MTL datasets: NYUv2 and Cityscapes. The results show that our algorithm outperforms the baselines under identical experimental conditions.
    Stochastic Approximation Approaches to Group Distributionally Robust Optimization. (arXiv:2302.09267v1 [cs.LG])
    This paper investigates group distributionally robust optimization (GDRO), with the purpose to learn a model that performs well over $m$ different distributions. First, we formulate GDRO as a stochastic convex-concave saddle-point problem, and demonstrate that stochastic mirror descent (SMD), using $m$ samples in each iteration, achieves an $O(m (\log m)/\epsilon^2)$ sample complexity for finding an $\epsilon$-optimal solution, which matches the $\Omega(m/\epsilon^2)$ lower bound up to a logarithmic factor. Then, we make use of techniques from online learning to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample complexity. Specifically, we cast GDRO as a two-players game where one player simply performs SMD and the other executes an online algorithm for non-oblivious multi-armed bandits. Next, we consider a more practical scenario where the number of samples that can be drawn from each distribution is different, and propose a novel formulation of weighted DRO, which allows us to derive distribution-dependent convergence rates. Denote by $n_i$ the sample budget for the $i$-th distribution, and assume $n_1 \geq n_2 \geq \cdots \geq n_m$. In the first approach, we incorporate non-uniform sampling into SMD such that the sample budget is satisfied in expectation, and prove the excess risk of the $i$-th distribution decreases at an $O(\sqrt{n_1 \log m}/n_i)$ rate. In the second approach, we use mini-batches to meet the budget exactly and also reduce the variance in stochastic gradients, and then leverage stochastic mirror-prox algorithm, which can exploit small variances, to optimize a carefully designed weighted DRO problem. Under appropriate conditions, it attains an $O((\log m)/\sqrt{n_i})$ convergence rate, which almost matches the optimal $O(\sqrt{1/n_i})$ rate of only learning from the $i$-th distribution with $n_i$ samples.
    Best of Both Worlds Policy Optimization. (arXiv:2302.09408v1 [cs.LG])
    Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization. Specifically, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.
    Cluster-Guided Label Generation in Extreme Multi-Label Classification. (arXiv:2302.09150v1 [cs.CL])
    For extreme multi-label classification (XMC), existing classification-based models poorly perform for tail labels and often ignore the semantic relations among labels, like treating "Wikipedia" and "Wiki" as independent and separate labels. In this paper, we cast XMC as a generation task (XLGen), where we benefit from pre-trained text-to-text models. However, generating labels from the extremely large label space is challenging without any constraints or guidance. We, therefore, propose to guide label generation using label cluster information to hierarchically generate lower-level labels. We also find that frequency-based label ordering and using decoding ensemble methods are critical factors for the improvements in XLGen. XLGen with cluster guidance significantly outperforms the classification and generation baselines on tail labels, and also generally improves the overall performance in four popular XMC benchmarks. In human evaluation, we also find XLGen generates unseen but plausible labels. Our code is now available at https://github.com/alexa/xlgen-eacl-2023.
  • Open

    Leveraging Causal Graphs for Blocking in Randomized Experiments. (arXiv:2111.02306v2 [stat.ME] UPDATED)
    Randomized experiments are often performed to study the causal effects of interest. Blocking is a technique to precisely estimate the causal effects when the experimental material is not homogeneous. It involves stratifying the available experimental material based on the covariates causing non-homogeneity and then randomizing the treatment within those strata (known as blocks). This eliminates the unwanted effect of the covariates on the causal effects of interest. We investigate the problem of finding a stable set of covariates to be used to form blocks, that minimizes the variance of the causal effect estimates. Using the underlying causal graph, we provide an efficient algorithm to obtain such a set for a general semi-Markovian causal model.  ( 2 min )
    Pseudo-labeling for Kernel Ridge Regression under Covariate Shift. (arXiv:2302.10160v1 [stat.ME])
    We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate model accordingly. Our non-asymptotic excess risk bounds show that in quite general scenarios, our estimator adapts to the structure of the target distribution as well as the covariate shift. It achieves the minimax optimal error rate up to a logarithmic factor. The use of pseudo-labels in model selection does not have major negative impacts.  ( 2 min )
    Hardness of Agnostically Learning Halfspaces from Worst-Case Lattice Problems. (arXiv:2207.14030v2 [cs.LG] UPDATED)
    We show hardness of improperly learning halfspaces in the agnostic model, both in the distribution-independent as well as the distribution-specific setting, based on the assumption that worst-case lattice problems, such as GapSVP or SIVP, are hard. In particular, we show that under this assumption there is no efficient algorithm that outputs any binary hypothesis, not necessarily a halfspace, achieving misclassfication error better than $\frac 1 2 - \gamma$ even if the optimal misclassification error is as small is as small as $\delta$. Here, $\gamma$ can be smaller than the inverse of any polynomial in the dimension and $\delta$ as small as $exp(-\Omega(\log^{1-c}(d)))$, where $0 0$ learning halfspaces up to error $OPT_{LTF} + \epsilon$ takes time at least $d^{\tilde{\Omega}(1/\epsilon^{2-\beta})}$ under the same hardness assumptions. Similarly, we show that learning degree-$\ell$ polynomial threshold functions up to error $OPT_{{PTF}_\ell} + \epsilon$ takes time at least $d^{\tilde{\Omega}(\ell^{2-\beta}/\epsilon^{2-\beta})}$. $OPT_{LTF}$ and $OPT_{{PTF}_\ell}$ denote the best error achievable by any halfspace or polynomial threshold function, respectively. Our lower bounds qualitively match algorithmic guarantees and (nearly) recover known lower bounds based on non-worst-case assumptions. Previously, such hardness results [Daniely16, DKPZ21] were based on average-case complexity assumptions or restricted to the statistical query model. Our work gives the first hardness results basing these fundamental learning problems on worst-case complexity assumptions. It is inspired by a sequence of recent works showing hardness of learning well-separated Gaussian mixtures based on worst-case lattice problems.  ( 2 min )
    The d-separation criterion in Categorical Probability. (arXiv:2207.05740v3 [math.ST] UPDATED)
    The d-separation criterion detects the compatibility of a joint probability distribution with a directed acyclic graph through certain conditional independences. In this work, we study this problem in the context of categorical probability theory by introducing a categorical definition of causal models, a categorical notion of d-separation, and proving an abstract version of the d-separation criterion. This approach has two main benefits. First, categorical d-separation is a very intuitive criterion based on topological connectedness. Second, our results apply both to measure-theoretic probability (with standard Borel spaces) and beyond probability theory, including to deterministic and possibilistic networks. It therefore provides a clean proof of the equivalence of local and global Markov properties with causal compatibility for continuous and mixed random variables as well as deterministic and possibilistic variables.  ( 2 min )
    A normative framework for deriving neural networks with multi-compartmental neurons and non-Hebbian plasticity. (arXiv:2302.10051v1 [q-bio.NC])
    An established normative approach for understanding the algorithmic basis of neural computation is to derive online algorithms from principled computational objectives and evaluate their compatibility with anatomical and physiological observations. Similarity matching objectives have served as successful starting points for deriving online algorithms that map onto neural networks (NNs) with point neurons and Hebbian/anti-Hebbian plasticity. These NN models account for many anatomical and physiological observations; however, the objectives have limited computational power and the derived NNs do not explain multi-compartmental neuronal structures and non-Hebbian forms of plasticity that are prevalent throughout the brain. In this article, we review and unify recent extensions of the similarity matching approach to address more complex objectives, including a broad range of unsupervised and self-supervised learning tasks that can be formulated as generalized eigenvalue problems or nonnegative matrix factorization problems. Interestingly, the online algorithms derived from these objectives naturally map onto NNs with multi-compartmental neurons and local, non-Hebbian learning rules. Therefore, this unified extension of the similarity matching approach provides a normative framework that facilitates understanding the multi-compartmental neuronal structures and non-Hebbian plasticity found throughout the brain.  ( 2 min )
    Guided Deep Kernel Learning. (arXiv:2302.09574v1 [cs.LG])
    Combining Gaussian processes with the expressive power of deep neural networks is commonly done nowadays through deep kernel learning (DKL). Unfortunately, due to the kernel optimization process, this often results in losing their Bayesian benefits. In this study, we present a novel approach for learning deep kernels by utilizing infinite-width neural networks. We propose to use the Neural Network Gaussian Process (NNGP) model as a guide to the DKL model in the optimization process. Our approach harnesses the reliable uncertainty estimation of the NNGPs to adapt the DKL target confidence when it encounters novel data points. As a result, we get the best of both worlds, we leverage the Bayesian behavior of the NNGP, namely its robustness to overfitting, and accurate uncertainty estimation, while maintaining the generalization abilities, scalability, and flexibility of deep kernels. Empirically, we show on multiple benchmark datasets of varying sizes and dimensionality, that our method is robust to overfitting, has good predictive performance, and provides reliable uncertainty estimations.
    Euler State Networks: Non-dissipative Reservoir Computing. (arXiv:2203.09382v2 [cs.LG] UPDATED)
    Inspired by the numerical solution of ordinary differential equations, in this paper we propose a novel Reservoir Computing (RC) model, called the Euler State Network (EuSN). The introduced approach makes use of forward Euler discretization and antisymmetric recurrent matrices to design reservoir dynamics that are both stable and non-dissipative by construction. Our mathematical analysis shows that the resulting model is biased towards unitary effective spectral radius and zero local Lyapunov exponents, intrinsically operating at the edge of stability. Experiments on synthetic tasks indicate the marked superiority of the proposed approach, compared to standard RC models, in tasks requiring long-term memorization skills. Furthermore, results on real-world time series classification benchmarks point out that EuSN is capable of matching (or even surpassing) the level of accuracy of trainable Recurrent Neural Networks, while allowing up to 100-fold savings in computation time and energy consumption.
    Adversarial Policies Beat Superhuman Go AIs. (arXiv:2211.00241v3 [cs.LG] UPDATED)
    We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree search, and a >97% win rate when KataGo uses enough search to be superhuman. We train our adversaries with a modified KataGo implementation, using less than 14% of the compute used to train the original KataGo. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is interpretable to the extent that human experts can successfully implement it, without algorithmic assistance, to consistently beat superhuman AIs. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://goattack.far.ai/.
    Quasi-Bayesian Nonparametric Density Estimation via Autoregressive Predictive Updates. (arXiv:2206.06462v2 [stat.ML] UPDATED)
    Bayesian methods are a popular choice for statistical inference in small-data regimes due to the regularization effect induced by the prior. In the context of density estimation, the standard nonparametric Bayesian approach is to target the posterior predictive of the Dirichlet process mixture model. In general, direct estimation of the posterior predictive is intractable and so methods typically resort to approximating the posterior distribution as an intermediate step. The recent development of quasi-Bayesian predictive copula updates, however, has made it possible to perform tractable predictive density estimation without the need for posterior approximation. Although these estimators are computationally appealing, they tend to struggle on non-smooth data distributions. This is due to the comparatively restrictive form of the likelihood models from which the proposed copula updates were derived. To address this shortcoming, we consider a Bayesian nonparametric model with an autoregressive likelihood decomposition and a Gaussian process prior. While the predictive update of such a model is typically intractable, we derive a quasi-Bayesian predictive update that achieves state-of-the-art results in small-data regimes.
    A Novel Collaborative Self-Supervised Learning Method for Radiomic Data. (arXiv:2302.09807v1 [eess.IV])
    The computer-aided disease diagnosis from radiomic data is important in many medical applications. However, developing such a technique relies on annotating radiological images, which is a time-consuming, labor-intensive, and expensive process. In this work, we present the first novel collaborative self-supervised learning method to solve the challenge of insufficient labeled radiomic data, whose characteristics are different from text and image data. To achieve this, we present two collaborative pretext tasks that explore the latent pathological or biological relationships between regions of interest and the similarity and dissimilarity information between subjects. Our method collaboratively learns the robust latent feature representations from radiomic data in a self-supervised manner to reduce human annotation efforts, which benefits the disease diagnosis. We compared our proposed method with other state-of-the-art self-supervised learning methods on a simulation study and two independent datasets. Extensive experimental results demonstrated that our method outperforms other self-supervised learning methods on both classification and regression tasks. With further refinement, our method shows the potential advantage in automatic disease diagnosis with large-scale unlabeled data available.
    Minimax risk classifiers with 0-1 loss. (arXiv:2201.06487v5 [stat.ML] UPDATED)
    Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minize the worst-case 0-1 loss with respect to uncertainty sets of distributions that can include the underlying distribution, with a tunable confidence. We show that MRCs can provide tight performance guarantees at learning and are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice.
    Data Augmentation for Imbalanced Regression. (arXiv:2302.09288v1 [stat.ML])
    In this work, we consider the problem of imbalanced data in a regression framework when the imbalanced phenomenon concerns continuous or discrete covariates. Such a situation can lead to biases in the estimates. In this case, we propose a data augmentation algorithm that combines a weighted resampling (WR) and a data augmentation (DA) procedure. In a first step, the DA procedure permits exploring a wider support than the initial one. In a second step, the WR method drives the exogenous distribution to a target one. We discuss the choice of the DA procedure through a numerical study that illustrates the advantages of this approach. Finally, an actuarial application is studied.
    Scalable Marked Point Processes for Exchangeable and Non-Exchangeable Event Sequences. (arXiv:2105.14574v3 [stat.ML] UPDATED)
    We adopt the interpretability offered by a parametric, Hawkes-process-inspired conditional probability mass function for the marks and apply variational inference techniques to derive a general and scalable inferential framework for marked point processes. The framework can handle both exchangeable and non-exchangeable event sequences with minimal tuning and without any pre-training. This contrasts with many parametric and non-parametric state-of-the-art methods that typically require pre-training and/or careful tuning, and can only handle exchangeable event sequences. The framework's competitive computational and predictive performance against other state-of-the-art methods are illustrated through real data experiments. Its attractiveness for large-scale applications is demonstrated through a case study involving all events occurring in an English Premier League season.
    Spatio-Temporal Momentum: Jointly Learning Time-Series and Cross-Sectional Strategies. (arXiv:2302.10175v1 [q-fin.PM])
    We introduce Spatio-Temporal Momentum strategies, a class of models that unify both time-series and cross-sectional momentum strategies by trading assets based on their cross-sectional momentum features over time. While both time-series and cross-sectional momentum strategies are designed to systematically capture momentum risk premia, these strategies are regarded as distinct implementations and do not consider the concurrent relationship and predictability between temporal and cross-sectional momentum features of different assets. We model spatio-temporal momentum with neural networks of varying complexities and demonstrate that a simple neural network with only a single fully connected layer learns to simultaneously generate trading signals for all assets in a portfolio by incorporating both their time-series and cross-sectional momentum features. Backtesting on portfolios of 46 actively-traded US equities and 12 equity index futures contracts, we demonstrate that the model is able to retain its performance over benchmarks in the presence of high transaction costs of up to 5-10 basis points. In particular, we find that the model when coupled with least absolute shrinkage and turnover regularization results in the best performance over various transaction cost scenarios.
    The Generalization Error of Stochastic Mirror Descent on Over-Parametrized Linear Models. (arXiv:2302.09433v1 [cs.LG])
    Despite being highly over-parametrized, and having the ability to fully interpolate the training data, deep networks are known to generalize well to unseen data. It is now understood that part of the reason for this is that the training algorithms used have certain implicit regularization properties that ensure interpolating solutions with "good" properties are found. This is best understood in linear over-parametrized models where it has been shown that the celebrated stochastic gradient descent (SGD) algorithm finds an interpolating solution that is closest in Euclidean distance to the initial weight vector. Different regularizers, replacing Euclidean distance with Bregman divergence, can be obtained if we replace SGD with stochastic mirror descent (SMD). Empirical observations have shown that in the deep network setting, SMD achieves a generalization performance that is different from that of SGD (and which depends on the choice of SMD's potential function. In an attempt to begin to understand this behavior, we obtain the generalization error of SMD for over-parametrized linear models for a binary classification problem where the two classes are drawn from a Gaussian mixture model. We present simulation results that validate the theory and, in particular, introduce two data models, one for which SMD with an $\ell_2$ regularizer (i.e., SGD) outperforms SMD with an $\ell_1$ regularizer, and one for which the reverse happens.
    The Mori-Zwanzig formulation of deep learning. (arXiv:2209.05544v3 [cs.LG] UPDATED)
    We develop a new formulation of deep learning based on the Mori-Zwanzig (MZ) formalism of irreversible statistical mechanics. The new formulation is built upon the well-known duality between deep neural networks and discrete dynamical systems, and it allows us to directly propagate quantities of interest (conditional expectations and probability density functions) forward and backward through the network by means of exact linear operator equations. Such new equations can be used as a starting point to develop new effective parameterizations of deep neural networks, and provide a new framework to study deep-learning via operator theoretic methods. The proposed MZ formulation of deep learning naturally introduces a new concept, i.e., the memory of the neural network, which plays a fundamental role in low-dimensional modeling and parameterization. By using the theory of contraction mappings, we develop sufficient conditions for the memory of the neural network to decay with the number of layers. This allows us to rigorously transform deep networks into shallow ones, e.g., by reducing the number of neurons per layer (using projection operators), or by reducing the total number of layers (using the decay property of the memory operator).
    Parameter Averaging for SGD Stabilizes the Implicit Bias towards Flat Regions. (arXiv:2302.09376v1 [stat.ML])
    Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and show this learning scheme significantly improves performance.
    Kernel Methods for Unobserved Confounding: Negative Controls, Proxies, and Instruments. (arXiv:2012.10315v4 [stat.ML] UPDATED)
    Negative control is a strategy for learning the causal relationship between treatment and outcome in the presence of unmeasured confounding. The treatment effect can nonetheless be identified if two auxiliary variables are available: a negative control treatment (which has no effect on the actual outcome), and a negative control outcome (which is not affected by the actual treatment). These auxiliary variables can also be viewed as proxies for a traditional set of control variables, and they bear resemblance to instrumental variables. I propose a family of algorithms based on kernel ridge regression for learning nonparametric treatment effects with negative controls. Examples include dose response curves, dose response curves with distribution shift, and heterogeneous treatment effects. Data may be discrete or continuous, and low, high, or infinite dimensional. I prove uniform consistency and provide finite sample rates of convergence. I estimate the dose response curve of cigarette smoking on infant birth weight adjusting for unobserved confounding due to household income, using a data set of singleton births in the state of Pennsylvania between 1989 and 1991.
    Improved dimension dependence of a proximal algorithm for sampling. (arXiv:2302.10081v1 [math.ST])
    We propose a sampling algorithm that achieves superior complexity bounds in all the classical settings (strongly log-concave, log-concave, Logarithmic-Sobolev inequality (LSI), Poincar\'e inequality) as well as more general settings with semi-smooth or composite potentials. Our algorithm is based on the proximal sampler introduced in~\citet{lee2021structured}. The performance of this proximal sampler is determined by that of the restricted Gaussian oracle (RGO), a key step in the proximal sampler. The main contribution of this work is an inexact realization of RGO based on approximate rejection sampling. To bound the inexactness of RGO, we establish a new concentration inequality for semi-smooth functions over Gaussian distributions, extending the well-known concentration inequality for Lipschitz functions. Applying our RGO implementation to the proximal sampler, we achieve state-of-the-art complexity bounds in almost all settings. For instance, for strongly log-concave distributions, our method has complexity bound $\tilde\mathcal{O}(\kappa d^{1/2})$ without warm start, better than the minimax bound for MALA. For distributions satisfying the LSI, our bound is $\tilde \mathcal{O}(\hat \kappa d^{1/2})$ where $\hat \kappa$ is the ratio between smoothness and the LSI constant, better than all existing bounds.
    Lifelong Bandit Optimization: No Prior and No Regret. (arXiv:2210.15513v2 [stat.ML] UPDATED)
    Machine learning algorithms are often repeatedly applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LIBO, an algorithm which adapts to the environment by learning from past experience and becomes more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LIBO sequentially meta-learns a kernel that approximates the true kernel and solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized or linear bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LIBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LIBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. We propose F-LIBO, which solves the lifelong problem in a federated manner.
    Identifying Weight-Variant Latent Causal Models. (arXiv:2208.14153v5 [cs.LG] UPDATED)
    The task of causal representation learning aims to uncover latent higher-level causal representations that affect lower-level observations. Identifying true latent causal representations from observed data, while allowing instantaneous causal relations among latent variables, remains a challenge, however. To this end, we start from the analysis of three intrinsic properties in identifying latent space from observations: transitivity, permutation indeterminacy, and scaling indeterminacy. We find that transitivity acts as a key role in impeding the identifiability of latent causal representations. To address the unidentifiable issue due to transitivity, we introduce a novel identifiability condition where the underlying latent causal model satisfies a linear-Gaussian model, in which the causal coefficients and the distribution of Gaussian noise are modulated by an additional observed variable. Under some mild assumptions, we can show that the latent causal representations can be identified up to trivial permutation and scaling. Furthermore, based on this theoretical result, we propose a novel method, termed Structural caUsAl Variational autoEncoder, which directly learns latent causal representations and causal relationships among them, together with the mapping from the latent causal variables to the observed ones. We show that the proposed method learns the true parameters asymptotically. Experimental results on synthetic and real data demonstrate the identifiability and consistency results and the efficacy of the proposed method in learning latent causal representations.
    Sharp analysis of EM for learning mixtures of pairwise differences. (arXiv:2302.10066v1 [math.ST])
    We consider a symmetric mixture of linear regressions with random samples from the pairwise comparison design, which can be seen as a noisy version of a type of Euclidean distance geometry problem. We analyze the expectation-maximization (EM) algorithm locally around the ground truth and establish that the sequence converges linearly, providing an $\ell_\infty$-norm guarantee on the estimation error of the iterates. Furthermore, we show that the limit of the EM sequence achieves the sharp rate of estimation in the $\ell_2$-norm, matching the information-theoretically optimal constant. We also argue through simulation that convergence from a random initialization is much more delicate in this setting, and does not appear to occur in general. Our results show that the EM algorithm can exhibit several unique behaviors when the covariate distribution is suitably structured.
    MARS: Meta-Learning as Score Matching in the Function Space. (arXiv:2210.13319v2 [cs.LG] UPDATED)
    Meta-learning aims to extract useful inductive biases from a set of related datasets. In Bayesian meta-learning, this is typically achieved by constructing a prior distribution over neural network parameters. However, specifying families of computationally viable prior distributions over the high-dimensional neural network parameters is difficult. As a result, existing approaches resort to meta-learning restrictive diagonal Gaussian priors, severely limiting their expressiveness and performance. To circumvent these issues, we approach meta-learning through the lens of functional Bayesian neural network inference, which views the prior as a stochastic process and performs inference in the function space. Specifically, we view the meta-training tasks as samples from the data-generating process and formalize meta-learning as empirically estimating the law of this stochastic process. Our approach can seamlessly acquire and represent complex prior knowledge by meta-learning the score function of the data-generating process marginals instead of parameter space priors. In a comprehensive benchmark, we demonstrate that our method achieves state-of-the-art performance in terms of predictive accuracy and substantial improvements in the quality of uncertainty estimates.
    Private (Stochastic) Non-Convex Optimization Revisited: Second-Order Stationary Points and Excess Risks. (arXiv:2302.09699v1 [cs.LG])
    We consider the problem of minimizing a non-convex objective while preserving the privacy of the examples in the training data. Building upon the previous variance-reduced algorithm SpiderBoost, we introduce a new framework that utilizes two different kinds of gradient oracles. The first kind of oracles can estimate the gradient of one point, and the second kind of oracles, less precise and more cost-effective, can estimate the gradient difference between two points. SpiderBoost uses the first kind periodically, once every few steps, while our framework proposes using the first oracle whenever the total drift has become large and relies on the second oracle otherwise. This new framework ensures the gradient estimations remain accurate all the time, resulting in improved rates for finding second-order stationary points. Moreover, we address a more challenging task of finding the global minima of a non-convex objective using the exponential mechanism. Our findings indicate that the regularized exponential mechanism can closely match previous empirical and population risk bounds, without requiring smoothness assumptions for algorithms with polynomial running time. Furthermore, by disregarding running time considerations, we show that the exponential mechanism can achieve a good population risk bound and provide a nearly matching lower bound.
    Best of Both Worlds Policy Optimization. (arXiv:2302.09408v1 [cs.LG])
    Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization. Specifically, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.  ( 2 min )
    mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization. (arXiv:2302.09693v1 [stat.ML])
    Modern deep learning models are over-parameterized, where different optima can result in widely varying generalization performance. To account for this, Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima, which arguably have better generalization abilities. In this paper, we focus on a variant of SAM known as micro-batch SAM (mSAM), which, during training, averages the updates generated by adversarial perturbations across several disjoint shards (micro batches) of a mini-batch. We extend a recently developed and well-studied general framework for flatness analysis to show that distributed gradient computation for sharpness-aware minimization theoretically achieves even flatter minima. In order to support this theoretical superiority, we provide a thorough empirical evaluation on a variety of image classification and natural language processing tasks. We also show that contrary to previous work, mSAM can be implemented in a flexible and parallelizable manner without significantly increasing computational costs. Our practical implementation of mSAM yields superior generalization performance across a wide range of tasks compared to SAM, further supporting our theoretical framework.
    Optimal Regret Is Achievable With Constant Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework. (arXiv:2201.12955v3 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. However, there is a large discrepancy between the superior practical performance of these approaches and their theoretical justification. Previous research only indicates a negative theoretical result: Thompson sampling could have a worst-case linear regret $\Omega(T)$ with a constant threshold on the inference error measured by one $\alpha$-divergence. To bridge this gap, we propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order $O(\log T)$ if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is. Our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error, to our best knowledge. Furthermore, in concordance with the negative results in previous studies, we show that only one bounded $\alpha$-divergence is insufficient to guarantee a sub-linear regret.  ( 2 min )
    Differentially Private Bayesian Neural Networks on Accuracy, Privacy and Reliability. (arXiv:2107.08461v2 [cs.LG] UPDATED)
    Bayesian neural network (BNN) allows for uncertainty quantification in prediction, offering an advantage over regular neural networks that has not been explored in the differential privacy (DP) framework. We fill this important gap by leveraging recent development in Bayesian deep learning and privacy accounting to offer a more precise analysis of the trade-off between privacy and accuracy in BNN. We propose three DP-BNNs that characterize the weight uncertainty for the same network architecture in distinct ways, namely DP-SGLD (via the noisy gradient method), DP-BBP (via changing the parameters of interest) and DP-MC Dropout (via the model architecture). Interestingly, we show a new equivalence between DP-SGD and DP-SGLD, implying that some non-Bayesian DP training naturally allows for uncertainty quantification. However, the hyperparameters such as learning rate and batch size, can have different or even opposite effects in DP-SGD and DP-SGLD. Extensive experiments are conducted to compare DP-BNNs, in terms of privacy guarantee, prediction accuracy, uncertainty quantification, calibration, computation speed, and generalizability to network architecture. As a result, we observe a new tradeoff between the privacy and the reliability. When compared to non-DP and non-Bayesian approaches, DP-SGLD is remarkably accurate under strong privacy guarantee, demonstrating the great potential of DP-BNN in real-world tasks.  ( 2 min )
    Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization. (arXiv:2302.09712v1 [stat.ML])
    Stacking many layers to create truly deep neural networks is arguably what has led to the recent explosion of these methods. However, many properties of deep neural networks are not yet understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. Our formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and yet have a significant effect on predicted behaviour. The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers.  ( 2 min )
    Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural Isometry and Exact Recovery. (arXiv:2209.15265v3 [cs.LG] UPDATED)
    The practice of deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters. This appears to contradict traditional statistical wisdom, in which a trade-off between model complexity and fit to the data is essential. We aim to address this discrepancy by adopting a convex optimization and sparse recovery perspective. We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization. Under certain regularity assumptions on the data, we show that ReLU networks with an arbitrary number of parameters learn only simple models that explain the data. This is analogous to the recovery of the sparsest linear model in compressed sensing. For ReLU networks and their variants with skip connections or normalization layers, we present isometry conditions that ensure the exact recovery of planted neurons. For randomly generated data, we show the existence of a phase transition in recovering planted neural network models, which is easy to describe: whenever the ratio between the number of samples and the dimension exceeds a numerical threshold, the recovery succeeds with high probability; otherwise, it fails with high probability. Surprisingly, ReLU networks learn simple and sparse models that generalize well even when the labels are noisy . The phase transition phenomenon is confirmed through numerical experiments.  ( 2 min )
    Imprecise Bayesian Neural Networks. (arXiv:2302.09656v1 [cs.LG])
    Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian neural networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of uncertainty are indistinguishable. We present imprecise Bayesian neural networks (IBNNs); they generalize and overcome some of the drawbacks of standard BNNs. These latter are trained using a single prior and likelihood distributions, whereas IBNNs are trained using credal prior and likelihood sets. They allow to distinguish between aleatoric and epistemic uncertainties, and to quantify them. In addition, IBNNs are robust in the sense of Bayesian sensitivity analysis, and are more robust than BNNs to distribution shift. They can also be used to compute sets of outcomes that enjoy PAC-like properties. We apply IBNNs to two case studies. One, to model blood glucose and insulin dynamics for artificial pancreas control, and two, for motion prediction in autonomous driving scenarios. We show that IBNNs performs better when compared to an ensemble of BNNs benchmark.  ( 2 min )
    Distributed Non-Convex Optimization with One-Bit Compressors on Heterogeneous Data: Efficient and Resilient Algorithms. (arXiv:2210.00665v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a nascent decentralized learning framework under which a massive collection of heterogeneous clients collaboratively train a model without revealing their local data. Scarce communication, privacy leakage, and Byzantine attacks are the key bottlenecks of system scalability. In this paper, we focus on communication-efficient distributed (stochastic) gradient descent for non-convex optimization, a driving force of FL. We propose two algorithms, named {\em Adaptive Stochastic Sign SGD (Ada-StoSign)} and {\em $\beta$-Stochastic Sign SGD ($\beta$-StoSign)}, each of which compresses the local gradients into bit vectors. To handle unbounded gradients, Ada-StoSign uses a novel norm tracking function that adaptively adjusts a coarse estimation on the $\ell_{\infty}$ of the local gradients - a key parameter used in gradient compression. We show that Ada-StoSign converges in expectation with a rate $O(\log T/\sqrt{T} + 1/\sqrt{M})$, where $M$ is the number of clients. To the best of our knowledge, when $M$ is sufficiently large, Ada-StoSign outperforms the state-of-the-art sign-based method whose convergence rate is $O(T^{-1/4})$. Under bounded gradient assumption, $\beta$-StoSign achieves quantifiable Byzantine resilience and privacy assurances, and works with partial client participation and mini-batch gradients which could be unbounded. We corroborate and complement our theories by experiments on MNIST and CIFAR-10 datasets.  ( 2 min )
    A Blackbox Approach to Best of Both Worlds in Bandits and Beyond. (arXiv:2302.09739v1 [cs.LG])
    Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the adversarial and the stochastic regimes have received growing attention recently. Existing techniques often require careful adaptation to every new problem setup, including specialised potentials and careful tuning of algorithm parameters. Yet, in domains such as linear bandits, it is still unknown if there exists an algorithm that can simultaneously obtain $O(\log(T))$ regret in the stochastic regime and $\tilde{O}(\sqrt{T})$ regret in the adversarial regime. In this work, we resolve this question positively and present a general reduction from best of both worlds to a wide family of follow-the-regularized-leader (FTRL) and online-mirror-descent (OMD) algorithms. We showcase the capability of this reduction by transforming existing algorithms that are only known to achieve worst-case guarantees into new algorithms with best-of-both-worlds guarantees in contextual bandits, graph bandits and tabular Markov decision processes.
    Nystr\"om $M$-Hilbert-Schmidt Independence Criterion. (arXiv:2302.09930v1 [stat.ML])
    Kernel techniques are among the most popular and powerful approaches of data science. Among the key features that make kernels ubiquitous are (i) the number of domains they have been designed for, (ii) the Hilbert structure of the function class associated to kernels facilitating their statistical analysis, and (iii) their ability to represent probability distributions without loss of information. These properties give rise to the immense success of Hilbert-Schmidt independence criterion (HSIC) which is able to capture joint independence of random variables under mild conditions, and permits closed-form estimators with quadratic computational complexity (w.r.t. the sample size). In order to alleviate the quadratic computational bottleneck in large-scale applications, multiple HSIC approximations have been proposed, however these estimators are restricted to $M=2$ random variables, do not extend naturally to the $M\ge 2$ case, and lack theoretical guarantees. In this work, we propose an alternative Nystr\"om-based HSIC estimator which handles the $M\ge 2$ case, prove its consistency, and demonstrate its applicability in multiple contexts, including synthetic examples, dependency testing of media annotations, and causal discovery.  ( 2 min )
    Unbalanced CO-Optimal Transport. (arXiv:2205.14923v3 [stat.ML] UPDATED)
    Optimal transport (OT) compares probability distributions by computing a meaningful alignment between their samples. CO-optimal transport (COOT) takes this comparison further by inferring an alignment between features as well. While this approach leads to better alignments and generalizes both OT and Gromov-Wasserstein distances, we provide a theoretical result showing that it is sensitive to outliers that are omnipresent in real-world data. This prompts us to propose unbalanced COOT for which we provably show its robustness to noise in the compared datasets. To the best of our knowledge, this is the first such result for OT methods in incomparable spaces. With this result in hand, we provide empirical evidence of this robustness for the challenging tasks of heterogeneous domain adaptation with and without varying proportions of classes and simultaneous alignment of samples and features across single-cell measurements.
    Rank-Minimizing and Structured Model Inference. (arXiv:2302.09521v1 [stat.ML])
    While extracting information from data with machine learning plays an increasingly important role, physical laws and other first principles continue to provide critical insights about systems and processes of interest in science and engineering. This work introduces a method that infers models from data with physical insights encoded in the form of structure and that minimizes the model order so that the training data are fitted well while redundant degrees of freedom without conditions and sufficient data to fix them are automatically eliminated. The models are formulated via solution matrices of specific instances of generalized Sylvester equations that enforce interpolation of the training data and relate the model order to the rank of the solution matrices. The proposed method numerically solves the Sylvester equations for minimal-rank solutions and so obtains models of low order. Numerical experiments demonstrate that the combination of structure preservation and rank minimization leads to accurate models with orders of magnitude fewer degrees of freedom than models of comparable prediction quality that are learned with structure preservation alone.
    Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep learning. (arXiv:2302.09526v1 [stat.ME])
    We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized-Linear-Models (GLM), we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is inevitably beneficial to integrate the unlabeled data with some non-zero mixing ratio $\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework for estimating the best mixing ratio $\alpha^*$ where mixed-SSL delivers the best predictive performance, while using the labeled and the unlabeled data on hand. The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, under a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) in improving more complex models such as deep neural networks, in a real-world regression tasks.
    Model-X Sequential Testing for Conditional Independence via Testing by Betting. (arXiv:2210.00354v2 [stat.ME] UPDATED)
    This paper develops a model-free sequential test for conditional independence. The proposed test allows researchers to analyze an incoming i.i.d. data stream with any arbitrary dependency structure, and safely conclude whether a feature is conditionally associated with the response under study. We allow the processing of data points online, as soon as they arrive, and stop data acquisition once significant results are detected, rigorously controlling the type-I error rate. Our test can work with any sophisticated machine learning algorithm to enhance data efficiency to the extent possible. The developed method is inspired by two statistical frameworks. The first is the model-X conditional randomization test, a test for conditional independence that is valid in offline settings where the sample size is fixed in advance. The second is testing by betting, a ``game-theoretic'' approach for sequential hypothesis testing. We conduct synthetic experiments to demonstrate the advantage of our test over out-of-the-box sequential tests that account for the multiplicity of tests in the time horizon, and demonstrate the practicality of our proposal by applying it to real-world tasks.  ( 2 min )
    Gradual Domain Adaptation via Normalizing Flows. (arXiv:2206.11492v2 [stat.ML] UPDATED)
    Standard domain adaptation methods do not work well when a large gap exists between the source and target domains. Gradual domain adaptation is one of the approaches used to address the problem. It involves leveraging the intermediate domain, which gradually shifts from the source domain to the target domain. The previous work assumed that the number of intermediate domains is large and the distance between adjacent domains is small; hence, the gradual domain adaptation algorithm, involving self-training with unlabeled datasets, was applicable. In practice, however, gradual self-training will fail because the number of intermediate domains is limited and the distance between adjacent domains is large. We propose the use of normalizing flows to deal with this problem while maintaining the framework of unsupervised domain adaptation. We generate pseudo intermediate domains from normalizing flows and then use them for gradual domain adaptation. We evaluate our proposed method by experiments with real-world datasets and confirm that it mitigates the above-explained problem and improves the classification performance.  ( 2 min )
    Markovian Gaussian Process Variational Autoencoders. (arXiv:2207.05543v2 [cs.LG] UPDATED)
    Sequential VAEs have been successfully considered for many high-dimensional time series modelling problems, with many variant models relying on discrete-time mechanisms such as recurrent neural networks (RNNs). On the other hand, continuous-time methods have recently gained attraction, especially in the context of irregularly-sampled time series, where they can better handle the data than discrete-time methods. One such class are Gaussian process variational autoencoders (GPVAEs), where the VAE prior is set as a Gaussian process (GP). However, a major limitation of GPVAEs is that it inherits the cubic computational cost as GPs, making it unattractive to practioners. In this work, we leverage the equivalent discrete state space representation of Markovian GPs to enable linear time GPVAE training via Kalman filtering and smoothing. We show on a variety of high-dimensional temporal and spatiotemporal tasks that our method performs favourably compared to existing approaches whilst being computationally highly scalable.  ( 2 min )
    Riemannian Langevin Algorithm for Solving Semidefinite Programs. (arXiv:2010.11176v5 [stat.ML] UPDATED)
    We propose a Langevin diffusion-based algorithm for non-convex optimization and sampling on a product manifold of spheres. Under a logarithmic Sobolev inequality, we establish a guarantee for finite iteration convergence to the Gibbs distribution in terms of Kullback--Leibler divergence. We show that with an appropriate temperature choice, the suboptimality gap to the global minimum is guaranteed to be arbitrarily small with high probability. As an application, we consider the Burer--Monteiro approach for solving a semidefinite program (SDP) with diagonal constraints, and analyze the proposed Langevin algorithm for optimizing the non-convex objective. In particular, we establish a logarithmic Sobolev inequality for the Burer--Monteiro problem when there are no spurious local minima, but under the presence saddle points. Combining the results, we then provide a global optimality guarantee for the SDP and the Max-Cut problem. More precisely, we show that the Langevin algorithm achieves $\epsilon$ accuracy with high probability in $\widetilde{\Omega}( \epsilon^{-5} )$ iterations.
    Conformal Prediction for Network-Assisted Regression. (arXiv:2302.10095v1 [stat.ME])
    An important problem in network analysis is predicting a node attribute using both network covariates, such as graph embedding coordinates or local subgraph counts, and conventional node covariates, such as demographic characteristics. While standard regression methods that make use of both types of covariates may be used for prediction, statistical inference is complicated by the fact that the nodal summary statistics are often dependent in complex ways. We show that under a mild joint exchangeability assumption, a network analog of conformal prediction achieves finite sample validity for a wide range of network covariates. We also show that a form of asymptotic conditional validity is achievable. The methods are illustrated on both simulated networks and a citation network dataset.
    Sketch In, Sketch Out: Accelerating both Learning and Inference for Structured Prediction with Kernels. (arXiv:2302.10128v1 [stat.ML])
    Surrogate kernel-based methods offer a flexible solution to structured output prediction by leveraging the kernel trick in both input and output spaces. In contrast to energy-based models, they avoid to pay the cost of inference during training, while enjoying statistical guarantees. However, without approximation, these approaches are condemned to be used only on a limited amount of training data. In this paper, we propose to equip surrogate kernel methods with approximations based on sketching, seen as low rank projections of feature maps both on input and output feature maps. We showcase the approach on Input Output Kernel ridge Regression (or Kernel Dependency Estimation) and provide excess risk bounds that can be in turn directly plugged on the final predictive model. An analysis of the complexity in time and memory show that sketching the input kernel mostly reduces training time while sketching the output kernel allows to reduce the inference time. Furthermore, we show that Gaussian and sub-Gaussian sketches are admissible sketches in the sense that they induce projection operators ensuring a small excess risk. Experiments on different tasks consolidate our findings.
    Adversarial random forests for density estimation and generative modeling. (arXiv:2205.09435v3 [stat.ML] UPDATED)
    We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-of-the-art probabilistic circuits and deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. An accompanying $\texttt{R}$ package, $\texttt{arf}$, is available on $\texttt{CRAN}$.  ( 2 min )
    Simplifying Momentum-based Riemannian Submanifold Optimization. (arXiv:2302.09738v1 [stat.ML])
    Riemannian submanifold optimization with momentum is computationally challenging because ensuring iterates remain on the submanifold often requires solving difficult differential equations. We simplify such optimization algorithms for the submanifold of symmetric positive-definite matrices with the affine invariant metric. We propose a generalized version of the Riemannian normal coordinates which dynamically trivializes the problem into a Euclidean unconstrained problem. We use our approach to explain and simplify existing approaches for structured covariances and develop efficient second-order optimizers for deep learning without explicit matrix inverses.  ( 2 min )
    Continuous Time Analysis of Dynamic Matching in Heterogeneous Networks. (arXiv:2302.09757v1 [cs.LG])
    This paper addresses the problem of dynamic matching in heterogeneous networks, where agents are subject to compatibility restrictions and stochastic arrival and departure times. In particular, we consider networks with one type of easy-to-match agents and multiple types of hard-to-match agents, each subject to its own set of compatibility constraints. Such a setting arises in many real-world applications, including kidney exchange programs and carpooling platforms, where some participants may have more stringent compatibility requirements than others. We introduce a novel approach to modeling dynamic matching by establishing ordinary differential equation (ODE) models, offering a new perspective for evaluating various matching algorithms. We study two algorithms, the Greedy Algorithm and the Patient Algorithm, which prioritize the matching of compatible hard-to-match agents over easy-to-match agents in heterogeneous networks. Our results show the trade-off between the conflicting goals of matching agents quickly and optimally, offering insights into the design of real-world dynamic matching systems. We present simulations and a real-world case study using data from the Organ Procurement and Transplantation Network to validate theoretical predictions.  ( 2 min )
    High-dimensional Central Limit Theorems for Linear Functionals of Online Least-Squares SGD. (arXiv:2302.09727v1 [math.ST])
    Stochastic gradient descent (SGD) has emerged as the quintessential method in a data scientist's toolbox. Much progress has been made in the last two decades toward understanding the iteration complexity of SGD (in expectation and high-probability) in the learning theory and optimization literature. However, using SGD for high-stakes applications requires careful quantification of the associated uncertainty. Toward that end, in this work, we establish high-dimensional Central Limit Theorems (CLTs) for linear functionals of online least-squares SGD iterates under a Gaussian design assumption. Our main result shows that a CLT holds even when the dimensionality is of order exponential in the number of iterations of the online SGD, thereby enabling high-dimensional inference with online SGD. Our proof technique involves leveraging Berry-Esseen bounds developed for martingale difference sequences and carefully evaluating the required moment and quadratic variation terms through recent advances in concentration inequalities for product random matrices. We also provide an online approach for estimating the variance appearing in the CLT (required for constructing confidence intervals in practice) and establish consistency results in the high-dimensional setting.  ( 2 min )
    SGDA with shuffling: faster convergence for nonconvex-P{\L} minimax optimization. (arXiv:2210.05995v2 [math.OC] UPDATED)
    Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvex-nonconcave objectives with Polyak-{\L}ojasiewicz (P{\L}) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-P{\L} and primal-P{\L}-P{\L} objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for GDA with an arbitrary step-size ratio, which matches the full-batch upper bound for the primal-P{\L}-P{\L} case.
    Large-Scale Representation Learning on Graphs via Bootstrapping. (arXiv:2102.06514v3 [cs.LG] UPDATED)
    Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootstrapped Graph Latents (BGRL) - a graph representation learning method that learns by predicting alternative augmentations of the input. BGRL uses only simple augmentations and alleviates the need for contrasting with negative examples, and is thus scalable by design. BGRL outperforms or matches prior methods on several established benchmarks, while achieving a 2-10x reduction in memory costs. Furthermore, we show that BGRL can be scaled up to extremely large graphs with hundreds of millions of nodes in the semi-supervised regime - achieving state-of-the-art performance and improving over supervised baselines where representations are shaped only through label information. In particular, our solution centered on BGRL constituted one of the winning entries to the Open Graph Benchmark - Large Scale Challenge at KDD Cup 2021, on a graph orders of magnitudes larger than all previously available benchmarks, thus demonstrating the scalability and effectiveness of our approach.
    Fast Kernel Methods for Generic Lipschitz Losses via \texorpdfstring{$p$}{p}-Sparsified Sketches. (arXiv:2206.03827v3 [stat.ML] UPDATED)
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods.
    A Statistical Analysis of Polyak-Ruppert Averaged Q-learning. (arXiv:2112.14582v4 [stat.ML] UPDATED)
    We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-learning without the Lipschitz condition.
    Sparse PCA Beyond Covariance Thresholding. (arXiv:2302.10158v1 [cs.LG])
    In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + \beta vv^\top})$, where $\beta > 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge \Omega(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as \[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in [KNV15a] and analyzed in [DM14]), required $\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$. For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding. Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$. In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in [dKNS20]. This model generalizes not only sparse PCA, but also other problems studied in prior works, including the sparse planted vector problem. As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees than the state of the art in some regimes. Our approach also works with the Wigner model for sparse PCA. Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise [dNNS22]. In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from [dNNS22]. requires quasi-polynomial time in these settings.
    When Personalization Harms: Reconsidering the Use of Group Attributes in Prediction. (arXiv:2206.02058v2 [stat.ML] UPDATED)
    Machine learning models are often personalized with categorical attributes that are protected, sensitive, self-reported, or costly to acquire. In this work, we show models that are personalized with group attributes can reduce performance at a group level. We propose formal conditions to ensure the "fair use" of group attributes in prediction tasks by training one additional model -- i.e., collective preference guarantees to ensure that each group who provides personal data will receive a tailored gain in performance in return. We present sufficient conditions to ensure fair use in empirical risk minimization and characterize failure modes that lead to fair use violations due to standard practices in model development and deployment. We present a comprehensive empirical study of fair use in clinical prediction tasks. Our results demonstrate the prevalence of fair use violations in practice and illustrate simple interventions to mitigate their harm.
    On the Expressivity of Persistent Homology in Graph Learning. (arXiv:2302.09826v1 [cs.LG])
    Persistent homology, a technique from computational topology, has recently shown strong empirical performance in the context of graph classification. Being able to capture long range graph properties via higher-order topological features, such as cycles of arbitrary length, in combination with multi-scale topological descriptors, has improved predictive performance for data sets with prominent topological structures, such as molecules. At the same time, the theoretical properties of persistent homology have not been formally assessed in this context. This paper intends to bridge the gap between computational topology and graph machine learning by providing a brief introduction to persistent homology in the context of graphs, as well as a theoretical discussion and empirical analysis of its expressivity for graph learning tasks.
    Likelihood-Free Inference in State-Space Models with Unknown Dynamics. (arXiv:2111.01555v2 [cs.LG] UPDATED)
    Likelihood-free inference (LFI) has been successfully applied to state-space models, where the likelihood of observations is not available but synthetic observations generated by a black-box simulator can be used for inference instead. However, much of the research up to now have been restricted to cases, in which a model of state transition dynamics can be formulated in advance and the simulation budget is unrestricted. These methods fail to address the problem of state inference when simulations are computationally expensive and the Markovian state transition dynamics are undefined. The approach proposed in this manuscript enables LFI of states with a limited number of simulations by estimating the transition dynamics, and using state predictions as proposals for simulations. In the experiments with non-stationary user models, the proposed method demonstrates significant improvement in accuracy for both state inference and prediction, where a multi-output Gaussian process is used for LFI of states, and a Bayesian Neural Network as a surrogate model of transition dynamics.
    Deep learning for inverse problems with unknown operator. (arXiv:2108.02744v2 [stat.ML] UPDATED)
    We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which current methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of "many" training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to "fit" the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.
    Graphical Dirichlet Process. (arXiv:2302.09111v1 [stat.ME])
    We consider the problem of clustering grouped data with possibly non-exchangeable groups whose dependencies can be characterized by a directed acyclic graph. To allow the sharing of clusters among the non-exchangeable groups, we propose a Bayesian nonparametric approach, termed graphical Dirichlet process, that jointly models the dependent group-specific random measures by assuming each random measure to be distributed as a Dirichlet process whose concentration parameter and based probability measure depend on those of its parent groups. The resulting joint stochastic process respects the Markov property of the directed acyclic graph that links the groups. We characterize the graphical Dirichlet process using a novel hypergraph representation as well as the stick-breaking representation, the restaurant-type representation, and the representation as a limit of a finite mixture model. We develop an efficient posterior inference algorithm and illustrate our model with simulations and a real grouped single-cell data.
    Free-Form Variational Inference for Gaussian Process State-Space Models. (arXiv:2302.09921v1 [cs.LG])
    Gaussian process state-space models (GPSSMs) provide a principled and flexible approach to modeling the dynamics of a latent state, which is observed at discrete-time points via a likelihood model. However, inference in GPSSMs is computationally and statistically challenging due to the large number of latent variables in the model and the strong temporal dependencies between them. In this paper, we propose a new method for inference in Bayesian GPSSMs, which overcomes the drawbacks of previous approaches, namely over-simplified assumptions, and high computational requirements. Our method is based on free-form variational inference via stochastic gradient Hamiltonian Monte Carlo within the inducing-variable formalism. Furthermore, by exploiting our proposed variational distribution, we provide a collapsed extension of our method where the inducing variables are marginalized analytically. We also showcase results when combining our framework with particle MCMC methods. We show that, on six real-world datasets, our approach can learn transition dynamics and latent states more accurately than competing methods.
    Do Bayesian Neural Networks Need To Be Fully Stochastic?. (arXiv:2211.06291v2 [cs.LG] UPDATED)
    We investigate the benefit of treating all the parameters in a Bayesian neural network stochastically and find compelling theoretical and empirical evidence that this standard construction may be unnecessary. To this end, we prove that expressive predictive distributions require only small amounts of stochasticity. In particular, partially stochastic networks with only $n$ stochastic biases are universal probabilistic predictors for $n$-dimensional predictive problems. In empirical investigations, we find no systematic benefit of full stochasticity across four different inference modalities and eight datasets; partially stochastic networks can match and sometimes even outperform fully stochastic networks, despite their reduced memory costs.
    Discriminative Clustering with Representation Learning with any Ratio of Labeled to Unlabeled Data. (arXiv:1912.12979v2 [stat.ML] UPDATED)
    We present a discriminative clustering approach in which the feature representation can be learned from data and moreover leverage labeled data. Representation learning can give a similarity-based clustering method the ability to automatically adapt to an underlying, yet hidden, geometric structure of the data. The proposed approach augments the DIFFRAC method with a representation learning capability, using a gradient-based stochastic training algorithm and an optimal transport algorithm with entropic regularization to perform the cluster assignment step. The resulting method is evaluated on several real datasets when varying the ratio of labeled data to unlabeled data and thereby interpolating between the fully unsupervised regime and the fully supervised regime. The experimental results suggest that the proposed method can learn powerful feature representations even in the fully unsupervised regime and can leverage even small amounts of labeled data to improve the feature representations and to obtain better clusterings of complex datasets.
    A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization. (arXiv:2302.09766v1 [math.OC])
    We focus on decentralized stochastic non-convex optimization, where $n$ agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: Prox-DASA and Prox-DASA-GT. These algorithms can find $\epsilon$-stationary points in $\mathcal{O}(n^{-1}\epsilon^{-2})$ iterations using constant batch sizes (i.e., $\mathcal{O}(1)$). Unlike prior work, our algorithms achieve a comparable complexity result without requiring large batch sizes, more complex per-iteration operations (such as double loops), or stronger assumptions. Our theoretical findings are supported by extensive numerical experiments, which demonstrate the superiority of our algorithms over previous approaches.
    An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge. (arXiv:2205.14775v3 [stat.ML] UPDATED)
    We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
    Learning to Increase the Power of Conditional Randomization Tests. (arXiv:2207.01022v2 [cs.LG] UPDATED)
    The model-X conditional randomization test is a generic framework for conditional independence testing, unlocking new possibilities to discover features that are conditionally associated with a response of interest while controlling type-I error rates. An appealing advantage of this test is that it can work with any machine learning model to design powerful test statistics. In turn, the common practice in the model-X literature is to form a test statistic using machine learning models, trained to maximize predictive accuracy with the hope to attain a test with good power. However, the ideal goal here is to drive the model (during training) to maximize the power of the test, not merely the predictive accuracy. In this paper, we bridge this gap by introducing, for the first time, novel model-fitting schemes that are designed to explicitly improve the power of model-X tests. This is done by introducing a new cost function that aims at maximizing the test statistic used to measure violations of conditional independence. Using synthetic and real data sets, we demonstrate that the combination of our proposed loss function with various base predictive models (lasso, elastic net, and deep neural networks) consistently increases the number of correct discoveries obtained, while maintaining type-I error rates under control.
    Improved Robust Algorithms for Learning with Discriminative Feature Feedback. (arXiv:2209.03753v3 [cs.LG] UPDATED)
    Discriminative Feature Feedback is a setting proposed by Dastupta et al. (2018), which provides a protocol for interactive learning based on feature explanations that are provided by a human teacher. The features distinguish between the labels of pairs of possibly similar instances. That work has shown that learning in this model can have considerable statistical and computational advantages over learning in standard label-based interactive learning models. In this work, we provide new robust interactive learning algorithms for the Discriminative Feature Feedback model, with mistake bounds that are significantly lower than those of previous robust algorithms for this setting. In the adversarial setting, we reduce the dependence on the number of protocol exceptions from quadratic to linear. In addition, we provide an algorithm for a slightly more restricted model, which obtains an even smaller mistake bound for large models with many exceptions. In the stochastic setting, we provide the first algorithm that converges to the exception rate with a polynomial sample complexity. Our algorithm and analysis for the stochastic setting involve a new construction that we call Feature Influence, which may be of wider applicability.
    Split Localized Conformal Prediction. (arXiv:2206.13092v2 [stat.ML] UPDATED)
    Conformal prediction is a simple and powerful tool that can quantify uncertainty without any distributional assumptions. Many existing methods only address the average coverage guarantee, which is not ideal compared to the stronger conditional coverage guarantee. Existing methods of approximating conditional coverage require additional models or time effort, which makes them not easy to scale. In this paper, we propose a modified non-conformity score by leveraging the local approximation of the conditional distribution using kernel density estimation. The modified score inherits the spirit of split conformal methods, which is simple and efficient and can scale to high dimensional settings. We also proposed a unified framework that brings together our method and several state-of-the-art. We perform extensive empirical evaluations: results measured by both average and conditional coverage confirm the advantage of our method.  ( 2 min )
    Efficient Data Analytics on Augmented Similarity Triplets. (arXiv:1912.12064v3 [cs.LG] UPDATED)
    Data analysis require a pairwise proximity measure over objects. Recent work has extended this to situations where the distance information between objects is given as comparison results of distances between three objects (triplets). Humans find the comparison tasks much easier than the exact distance computation and such data can be easily obtained in big quantity via crowd-sourcing. In this work, we propose triplets augmentation, an efficient method to extend the triplets data by inferring the hidden implicit information form the existing data. Triplets augmentation improves the quality of kernel-based and kernel-free data analytics. We also propose a novel set of algorithms for common data analysis tasks based on triplets. These methods work directly with triplets and avoid kernel evaluations, thus are scalable to big data. We demonstrate that our methods outperform the current best-known techniques and are robust to noisy data.  ( 2 min )
    Infinite-Dimensional Diffusion Models for Function Spaces. (arXiv:2302.10130v1 [stat.ML])
    We define diffusion-based generative models in infinite dimensions, and apply them to the generative modeling of functions. By first formulating such models in the infinite-dimensional limit and only then discretizing, we are able to obtain a sampling algorithm that has \emph{dimension-free} bounds on the distance from the sample measure to the target measure. Furthermore, we propose a new way to perform conditional sampling in an infinite-dimensional space and show that our approach outperforms previously suggested procedures.  ( 2 min )
    Simple Disentanglement of Style and Content in Visual Representations. (arXiv:2302.09795v1 [cs.LG])
    Learning visual representations with interpretable features, i.e., disentangled representations, remains a challenging problem. Existing methods demonstrate some success but are hard to apply to large-scale vision datasets like ImageNet. In this work, we propose a simple post-processing framework to disentangle content and style in learned representations from pre-trained vision models. We model the pre-trained features probabilistically as linearly entangled combinations of the latent content and style factors and develop a simple disentanglement algorithm based on the probabilistic model. We show that the method provably disentangles content and style features and verify its efficacy empirically. Our post-processed features yield significant domain generalization performance improvements when the distribution shift occurs due to style changes or style-related spurious correlations.  ( 2 min )
    Learning Good State and Action Representations via Tensor Decomposition. (arXiv:2105.01136v2 [stat.ML] UPDATED)
    The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP's tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation.  ( 2 min )
    Online Graph Topology Learning from Matrix-valued Time Series. (arXiv:2107.08020v2 [stat.ML] UPDATED)
    This paper is concerned with the statistical analysis of matrix-valued time series. These are data collected over a network of sensors (typically a set of spatial locations) along time, where a vector of features is observed per time instant per sensor. Thus each sensor is characterized by a vectorial time series. We would like to identify the dependency structure among these sensors and represent it by a graph. When there is only one feature per sensor, the vector auto-regressive models have been widely adapted to infer the structure of Granger causality. The resulting graph is referred to as causal graph. Our first contribution is then extending VAR models to matrix-variate models to serve the purpose of graph learning. Secondly, we propose two online procedures respectively in low and high dimensions, which can update quickly the estimates of coefficients when new samples arrive. In particular in high dimensional regime, a novel Lasso-type is introduced and we develop its homotopy algorithms for the online learning. We also provide an adaptive tuning procedure for the regularization parameter. Lastly, we consider that, the application of AR models onto data usually requires detrending the raw data, however, this step is forbidden in online context. Therefore, we augment the proposed AR models by incorporating trend as extra parameter, and then adapt the online algorithms to the augmented data models, which allow us to simultaneously learn the graph and trend from streaming samples. In this work, we consider primarily the periodic trend. Numerical experiments using both synthetic and real data are performed, whose results support the effectiveness of the proposed methods.  ( 2 min )
    Estimating Optimal Policy Value in General Linear Contextual Bandits. (arXiv:2302.09451v1 [cs.LG])
    In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as $V^*$ estimation. It was recently shown that fast $V^*$ estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$ is indeed information-theoretically possible, where $d$ is the dimension. We then present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V^*$ that holds for general distributions and is tight when the context distribution is Gaussian. We prove our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to estimate the upper bound. We use this upper bound and the estimator to obtain novel and improved guarantees for several applications in bandit model selection and testing for treatment effects.  ( 2 min )
    Newton-type Methods for Minimax Optimization. (arXiv:2006.14592v3 [cs.LG] UPDATED)
    Differential games, in particular two-player sequential zero-sum games (a.k.a. minimax optimization), have been an important modeling tool in applied science and received renewed interest in machine learning due to many recent applications, such as adversarial training, generative models and reinforcement learning. However, existing theory mostly focuses on convex-concave functions with few exceptions. In this work, we propose two novel Newton-type algorithms for nonconvex-nonconcave minimax optimization. We prove their local convergence at strict local minimax points, which are surrogates of global solutions. We argue that our Newton-type algorithms nicely complement existing ones in that (a) they converge faster to strict local minimax points; (b) they are much more effective when the problem is ill-conditioned; (c) their computational complexity remains similar. We verify the effectiveness of our Newton-type algorithms through experiments on training GANs which are intrinsically nonconvex and ill-conditioned. Our code is available at https://github.com/watml/min-max-2nd-order.  ( 2 min )
    On the Stability and Generalization of Triplet Learning. (arXiv:2302.09815v1 [stat.ML])
    Triplet learning, i.e. learning from triplet data, has attracted much attention in computer vision tasks with an extremely large number of categories, e.g., face recognition and person re-identification. Albeit with rapid progress in designing and applying triplet learning algorithms, there is a lacking study on the theoretical understanding of their generalization performance. To fill this gap, this paper investigates the generalization guarantees of triplet learning by leveraging the stability analysis. Specifically, we establish the first general high-probability generalization bound for the triplet learning algorithm satisfying the uniform stability, and then obtain the excess risk bounds of the order $O(n^{-\frac{1}{2}} \mathrm{log}n)$ for both stochastic gradient descent (SGD) and regularized risk minimization (RRM), where $2n$ is approximately equal to the number of training samples. Moreover, an optimistic generalization bound in expectation as fast as $O(n^{-1})$ is derived for RRM in a low noise case via the on-average stability analysis. Finally, our results are applied to triplet metric learning to characterize its theoretical underpinning.  ( 2 min )
    Transductive Matrix Completion with Calibration for Multi-Task Learning. (arXiv:2302.09834v1 [stat.ML])
    Multi-task learning has attracted much attention due to growing multi-purpose research with multiple related data sources. Moreover, transduction with matrix completion is a useful method in multi-label learning. In this paper, we propose a transductive matrix completion algorithm that incorporates a calibration constraint for the features under the multi-task learning framework. The proposed algorithm recovers the incomplete feature matrix and target matrix simultaneously. Fortunately, the calibration information improves the completion results. In particular, we provide a statistical guarantee for the proposed algorithm, and the theoretical improvement induced by calibration information is also studied. Moreover, the proposed algorithm enjoys a sub-linear convergence rate. Several synthetic data experiments are conducted, which show the proposed algorithm out-performs other existing methods, especially when the target matrix is associated with the feature matrix in a nonlinear way.  ( 2 min )
    Non-separable Covariance Kernels for Spatiotemporal Gaussian Processes based on a Hybrid Spectral Method and the Harmonic Oscillator. (arXiv:2302.09580v1 [stat.ML])
    Gaussian processes provide a flexible, non-parametric framework for the approximation of functions in high-dimensional spaces. The covariance kernel is the main engine of Gaussian processes, incorporating correlations that underpin the predictive distribution. For applications with spatiotemporal datasets, suitable kernels should model joint spatial and temporal dependence. Separable space-time covariance kernels offer simplicity and computational efficiency. However, non-separable kernels include space-time interactions that better capture observed correlations. Most non-separable kernels that admit explicit expressions are based on mathematical considerations (admissibility conditions) rather than first-principles derivations. We present a hybrid spectral approach for generating covariance kernels which is based on physical arguments. We use this approach to derive a new class of physically motivated, non-separable covariance kernels which have their roots in the stochastic, linear, damped, harmonic oscillator (LDHO). The new kernels incorporate functions with both monotonic and oscillatory decay of space-time correlations. The LDHO covariance kernels involve space-time interactions which are introduced by dispersion relations that modulate the oscillator coefficients. We derive explicit relations for the spatiotemporal covariance kernels in the three oscillator regimes (underdamping, critical damping, overdamping) and investigate their properties.  ( 2 min )
    Cost-effective Models for Detecting Depression from Speech. (arXiv:2302.09214v1 [cs.SD])
    Depression is the most common psychological disorder and is considered as a leading cause of disability and suicide worldwide. An automated system capable of detecting signs of depression in human speech can contribute to ensuring timely and effective mental health care for individuals suffering from the disorder. Developing such automated system requires accurate machine learning models, capable of capturing signs of depression. However, state-of-the-art models based on deep acoustic representations require abundant data, meticulous selection of features, and rigorous training; the procedure involves enormous computational resources. In this work, we explore the effectiveness of two different acoustic feature groups - conventional hand-curated and deep representation features, for predicting the severity of depression from speech. We explore the relevance of possible contributing factors to the models' performance, including gender of the individual, severity of the disorder, content and length of speech. Our findings suggest that models trained on conventional acoustic features perform equally well or better than the ones trained on deep representation features at significantly lower computational cost, irrespective of other factors, e.g. content and length of speech, gender of the speaker and severity of the disorder. This makes such models a better fit for deployment where availability of computational resources is restricted, such as real time depression monitoring applications in smart devices.  ( 2 min )
    Online Continuous Hyperparameter Optimization for Contextual Bandits. (arXiv:2302.09440v1 [cs.LG])
    In stochastic contextual bandit problems, an agent sequentially makes actions from a time-dependent action set based on past experience to minimize the cumulative regret. Like many other machine learning algorithms, the performance of bandits heavily depends on their multiple hyperparameters, and theoretically derived parameter values may lead to unsatisfactory results in practice. Moreover, it is infeasible to use offline tuning methods like cross validation to choose hyperparameters under the bandit environment, as the decisions should be made in real time. To address this challenge, we propose the first online continuous hyperparameter tuning framework for contextual bandits to learn the optimal parameter configuration within a search space on the fly. Specifically, we use a double-layer bandit framework named CDT (Continuous Dynamic Tuning) and formulate the hyperparameter optimization as a non-stationary continuum-armed bandit, where each arm represents a combination of hyperparameters, and the corresponding reward is the algorithmic result. For the top layer, we propose the Zooming TS algorithm that utilizes Thompson Sampling (TS) for exploration and a restart technique to get around the switching environment. The proposed CDT framework can be easily used to tune contextual bandit algorithms without any pre-specified candidate set for hyperparameters. We further show that it could achieve sublinear regret in theory and performs consistently better on both synthetic and real datasets in practice.  ( 2 min )
    Copula-based synthetic population generation. (arXiv:2302.09193v1 [stat.ML])
    Population synthesis consists of generating synthetic but realistic representations of a target population of micro-agents for the purpose of behavioral modeling and simulation. We introduce a new framework based on copulas to generate synthetic data for a target population of which only the empirical marginal distributions are known by using a sample from another population sharing similar marginal dependencies. This makes it possible to include a spatial component in the generation of population synthesis and to combine various sources of information to obtain more realistic population generators. Specifically, we normalize the data and treat them as realizations of a given copula, and train a generative model on the normalized data before injecting the information on the marginals. We compare the copulas framework to IPF and to modern probabilistic approaches such as Bayesian networks, variational auto-encoders, and generative adversarial networks. We also illustrate on American Community Survey data that the method proposed allows to study the structure of the data at different geographical levels in a way that is robust to the peculiarities of the marginal distributions.  ( 2 min )
    Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron. (arXiv:2302.10034v1 [cs.LG])
    We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.  ( 2 min )
    Generalization and Stability of Interpolating Neural Networks with Minimal Width. (arXiv:2302.09235v1 [stat.ML])
    We investigate the generalization and optimization of $k$-homogeneous shallow neural-network classifiers in the interpolating regime. The study focuses on analyzing the performance of the model when it is capable of perfectly classifying the input data with a positive margin $\gamma$. When using gradient descent with logistic-loss minimization, we show that the training loss converges to zero at a rate of $\tilde O(1/\gamma^{2/k} T)$ given a polylogarithmic number of neurons. This suggests that gradient descent can find a perfect classifier for $n$ input data within $\tilde{\Omega}(n)$ iterations. Additionally, through a stability analysis we show that with $m=\Omega(\log^{4/k} (n))$ neurons and $T=\Omega(n)$ iterations, the test loss is bounded by $\tilde{O}(1/\gamma^{2/k} n)$. This is in contrast to existing stability results which require polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new self-bounded weak convexity property, which leads to a generalized local quasi-convexity property for sufficiently parameterized neural-network classifiers. Eventually, despite the objective's non-convexity, this leads to convergence and generalization-gap bounds that are similar to those in the convex setting of linear logistic regression.  ( 2 min )
    Online Instrumental Variable Regression: Regret Analysis and Bandit Feedback. (arXiv:2302.09357v1 [cs.LG])
    The independence of noise and covariates is a standard assumption in online linear regression and linear bandit literature. This assumption and the following analysis are invalid in the case of endogeneity, i.e., when the noise and covariates are correlated. In this paper, we study the online setting of instrumental variable (IV) regression, which is widely used in economics to tackle endogeneity. Specifically, we analyse and upper bound regret of Two-Stage Least Squares (2SLS) approach to IV regression in the online setting. Our analysis shows that Online 2SLS (O2SLS) achieves $O(d^2 \log^2 T)$ regret after $T$ interactions, where d is the dimension of covariates. Following that, we leverage the O2SLS as an oracle to design OFUL-IV, a linear bandit algorithm. OFUL-IV can tackle endogeneity and achieves $O(d \sqrt{T} \log T)$ regret. For datasets with endogeneity, we experimentally demonstrate that O2SLS and OFUL-IV incur lower regrets than the state-of-the-art algorithms for both the online linear regression and linear bandit settings.  ( 2 min )
    JANA: Jointly Amortized Neural Approximation of Complex Bayesian Models. (arXiv:2302.09125v1 [cs.LG])
    This work proposes ''jointly amortized neural approximation'' (JANA) of intractable likelihood functions and posterior densities arising in Bayesian surrogate modeling and simulation-based inference. We train three complementary networks in an end-to-end fashion: 1) a summary network to compress individual data points, sets, or time series into informative embedding vectors; 2) a posterior network to learn an amortized approximate posterior; and 3) a likelihood network to learn an amortized approximate likelihood. Their interaction opens a new route to amortized marginal likelihood and posterior predictive estimation -- two important ingredients of Bayesian workflows that are often too expensive for standard methods. We benchmark the fidelity of JANA on a variety of simulation models against state-of-the-art Bayesian methods and propose a powerful and interpretable diagnostic for joint calibration. In addition, we investigate the ability of recurrent likelihood networks to emulate complex time series models without resorting to hand-crafted summary statistics.  ( 2 min )
    The Shrinkage-Delinkage Trade-off: An Analysis of Factorized Gaussian Approximations for Variational Inference. (arXiv:2302.09163v1 [stat.ML])
    When factorized approximations are used for variational inference (VI), they tend to understimate the uncertainty -- as measured in various ways -- of the distributions they are meant to approximate. We consider two popular ways to measure the uncertainty deficit of VI: (i) the degree to which it underestimates the componentwise variance, and (ii) the degree to which it underestimates the entropy. To better understand these effects, and the relationship between them, we examine an informative setting where they can be explicitly (and elegantly) analyzed: the approximation of a Gaussian,~$p$, with a dense covariance matrix, by a Gaussian,~$q$, with a diagonal covariance matrix. We prove that $q$ always underestimates both the componentwise variance and the entropy of $p$, \textit{though not necessarily to the same degree}. Moreover we demonstrate that the entropy of $q$ is determined by the trade-off of two competing forces: it is decreased by the shrinkage of its componentwise variances (our first measure of uncertainty) but it is increased by the factorized approximation which delinks the nodes in the graphical model of $p$. We study various manifestations of this trade-off, notably one where, as the dimension of the problem grows, the per-component entropy gap between $p$ and $q$ becomes vanishingly small even though $q$ underestimates every componentwise variance by a constant multiplicative factor. We also use the shrinkage-delinkage trade-off to bound the entropy gap in terms of the problem dimension and the condition number of the correlation matrix of $p$. Finally we present empirical results on both Gaussian and non-Gaussian targets, the former to validate our analysis and the latter to explore its limitations.  ( 2 min )
    Bayesian Quantification with Black-Box Estimators. (arXiv:2302.09159v1 [stat.ML])
    Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.  ( 2 min )

  • Open

    [Project]Introducing the ChatGPT Batch Whipper: Streamline Your Batch Jobs with Ease!
    Hello everyone, If you're someone who frequently works with batch jobs using ChatGPT, you know how time-consuming and challenging it can be to manage multiple prompts and input data. That's where the ChatGPT Batch Whipper comes in! With this tool, you can: Save and reuse prompts, making it easy to apply them to multiple inputs automatically using an input CSV file. Ensure continuity and coherence by submitting input data for the same prompt to the same conversation. Resume the batch job from where you left off, even if you unintentionally stop the process, thanks to the tool's data saving feature. Never worry about exceeding hourly submit times, as the tool waits until it can run again. In short, the ChatGPT Batch Whipper tool is an efficient and user-friendly way to perform batch jobs with ChatGPT. We welcome any feedback or suggestions you may have, so give it a try and see how it can improve your workflow! https://github.com/CodeDiggerM/chatgpt-batch-whipper submitted by /u/Fun_Pollution_3899 [link] [comments]  ( 7 min )
    [P] AI Techniques for the Modern Problem Solver
    Introducing AI Techniques for the Modern Problem Solver, a curated list of AI techniques to solve problems. No background knowledge is assumed, ideal for newcomers to the field. As head of AI I've been using this list to introduce AI techniques to problem solvers with backgrounds in software engineering, mechanical engineering, electronics, robotics, physics, math, computer vision and more. These professionals are often not aware of what is there or are simply confused by the large number of options. The main goal of this list is to remove unknown-unknowns, to let you know that these techniques exist, giving a basic usage example, resources and weaknesses. From there you can pick and choose a technique relevant to your problem, to eventually master it. I flag which techniques can be reasonably mastered without an AI background, while for others you may need some help from your local AI expert. :) https://github.com/lorepieri8/ai-techniques I hope this can be helpful, I will keep updating the list from time to time, please let me know if there is something you believe should be on it. submitted by /u/lorepieri [link] [comments]  ( 44 min )
    Unit Normalization instead of Cross-Entropy Loss [Discussion]
    Cross entropy on logits is a normal simplification that fuses softmax + cross entropy loss to something like: def label_cross_entropy_on_logits(x, labels): return (-x.select(labels) + x.logsumexp(axis=1)).sum(axis=0) where x.select(labels) = x[range(batch_size), labels]. I was thinking about how the logsumexp term looks like a regularization term, and wondered what would happen if I just replaced it by x.norm(axis=1) instead. It seemed to work just as well as the original, so I thought, why not just enforce unit norm? I changed my code to def label_cross_entropy_on_logits(x, labels): return -(x.select(labels) / x.norm(axis=1)).sum(axis=0) and my training sped up dramatically, and my test loss decreased. I'm sure this is a standard approach to categorical loss, but I haven't seen it before, and would love to get some references. I found this old post: https://www.reddit.com/r/MachineLearning/comments/k6ff4w/unit_normalization_crossentropy_loss_outperforms/ which references LogitNormalization: https://arxiv.org/pdf/2205.09310.pdf However, it seems those papers all apply layer normalization and then softmax+CE. What seems to work for me is simply replacing softmax+CE by normalization. submitted by /u/thomasahle [link] [comments]  ( 43 min )
    [D] Bottleneck Layers: What's your intuition?
    Many neural architectures use bottleneck layers somewhere in the architecture. What I mean by bottleneck is projecting activations to a lower dimension and back up. This is e.g. used in ResNet blocks. What is your intuition on why this is beneficial? From an information theory standpoint, it creates potential information loss due to the lower dimensionality. Can we see this as a form of regularisation, that makes the model learn more meaningful representations? Im interested in your intuitions in that matter or empirical results that might support these intuitions. Are you aware of other works that use bottlenecks and what is their underlying reasoning? submitted by /u/_Arsenie_Boca_ [link] [comments]  ( 47 min )
    [P] The First Depthwise-separable Convolution Animation
    Hey everyone, I've created what I believe is the first animation of a depthwise-separable convolution, and I thought you might appreciate it. I think this fills a legitimate gap in the instructional material available out there. https://i.redd.it/o1bns0jjskja1.gif I've actually been dissatisfied with the existing convolution animations in general (and ranted about it on youtube). So I made my own set of animations and published them on animatedai.github.io. If you find any of them useful, please feel free to copy them, post them on your website, throw them in a powerpoint, or just link to them. submitted by /u/Animated-AI [link] [comments]  ( 45 min )
    [R] Are there datasets of annotations of the correctness/incorrectness of the individual steps of chain-of-thought reasoning?
    Chain-of-thought can be used to get large language models to generate what often look like reasoning traces, but the reasoning steps generated are not always correct (even when the model's final answer is correct!). I’m aware of a few efforts to manually annotate the correctness/incorrectness of the reasoning steps in chain-of-thought-type data: * “Solving math word problems with process- and outcome-based feedback”: https://arxiv.org/abs/2211.14275 * “Large Language Models Are Reasoning Teachers”, section 4.2: https://arxiv.org/pdf/2212.10071.pdf Unfortunately, the data does not seem to be available from either study. Is anyone aware of other researchers who have annotated the correctness of LLM-generated reasoning steps (whether or not their data is public), or datasets that contain this kind of data? I guess I’d also be interested in datasets where the correctness/incorrectness of individual reasoning steps generated by humans have been annotated, for example if there are datasets of human-solved logic problems with the errors marked. Again, am interested in correctness of individual reasoning steps, not the correctness of the final answers. submitted by /u/Rudebrazen [link] [comments]  ( 43 min )
    [N] Cerebras launches fine-tuning of large language models in the cloud
    [Note: I work for Cerebras Systems] Cerebras just made fine-tuning for large language models available via the Cerebras AI Model Studio. Users can fine-tune models including GPT-J (6B), GPT-NeoX (20B), and CodeGen (350M to 16B), with more models and checkpoints coming soon. This comes as an addition to the training-from-scratch capabilities we made available in our previous launch. Users can fine-tune these models on a dedicated cloud-based cluster powered by Cerebras CS-2 systems with the following advantages: Fast - Fine-tune GPT-J 6B in 17 hours Cheap - Priced competitively with OpenAI Easy - Enjoy cluster performance with no code change Ownership - Your trained weights are yours to keep! Curious how we enabled cluster performance with no distributed coding? read this blog Curious how we can train multi-billion parameter models on a single device? read this blog Interested? We are offering a free trial for users interested in fine-tuning or training from scratch. submitted by /u/CS-fan-101 [link] [comments]  ( 43 min )
    [Discussion] Instruction Finetuning Dataset for GPT-NeoX on NLP Cloud
    I was exploring NLP Cloud which is a service offering deployed open source LMs over API calls. They are pretty open with documenting everything regarding the models and their respective datasets (for the fine-tuned ones) but one. I found that their finetuned-gpt-neox-20b model is doing pretty good, even compared to GPT-3 and waaaays better compared to vanilla GPT-NeoX. Unfortunately, they do not state anywhere what data they used to fine-tune it. The model seems also to handle non-english prompts pretty well. Does anyone know by any chance (or maybe the devs are here) what data their custom model was fine-tuned on? Was it an instruction dataset? Did they use public or custom data? Did they fine-tune it on additional languages? Any hints are highly appreciated! submitted by /u/Own-Technology-9815 [link] [comments]  ( 43 min )
    [R] ChatGPT for Robotics: Design Principles and Model Abilities
    I wanted to share a paper we have just released, where we extended the capabilities of ChatGPT to robotics, and controlled multiple platforms such as robot arms, drones, and home assistant robots intuitively with language: https://www.microsoft.com/en-us/research/group/autonomous-systems-group-robotics/articles/chatgpt-for-robotics/ Video: https://youtu.be/NYd0QcZcS6Q Technical paper: https://www.microsoft.com/en-us/research/uploads/prod/2023/02/ChatGPT___Robotics.pdf https://i.redd.it/ya84nryu0kja1.gif submitted by /u/CheapBreakfast9 [link] [comments]  ( 43 min )
    [R] Check our survey paper for a label-efficient Time Series Representation Learning
    Our survey paper: "Label-efficient Time Series Representation Learning: A Review" discusses one of the main limitations of applying deep learning models on time series data in the real world, i.e., the scarcity of labeled data. There are different ways to address this issue, and we attempt to provide an overview of the various label-scarce scenarios, and their corresponding techniques proposed to address each one. ​ https://preview.redd.it/7waga9tdgjja1.jpg?width=1984&format=pjpg&auto=webp&s=6a7c4037140c24ac6436696a1d8094ac62cb6bda submitted by /u/emad_eldeen [link] [comments]  ( 43 min )
    [P] minLoRA: An Easy-to-Use PyTorch Library for Applying LoRA to PyTorch Models
    Hey r/MachineLearning! I wanted to share a new PyTorch library I've been working on that I think could be really useful for anyone looking to fine-tune large models with LoRA. https://github.com/cccntu/minlora The library is based on the LoRA technique (Low-Rank Adaptation). "which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer." (- quote from the paper) With this library, you can easily apply LoRA to any PyTorch model with just a few lines of code. One of the benefits of this library is that it's really small - just 100 lines of code. Despite its size, it's quite powerful and has been tested on a variety of different models, including nanoGPT by Karpathy, and stable diffusion. It also features an easy-to-use interface that allows you to serve multiple LoRA models at the same time! submitted by /u/cccntu [link] [comments]  ( 44 min )
    [Discussion] ML on extremely large datasets and images
    I am looking into any techniques one could use for very large datasets in machine learning. So I am talking about datasets with the following properties: 1: 3D Imaging dataset where each dataset is of the order of many terabytes. 2: Each 3D image is too big to fit in the GPU or CPU memory. I am interested in educating myself on methods that people have used in classical ML and modern deep learning for such extremely large datasets. In particular, how does one ensure one can capture long-range spatial interactions in such datasets and what computational techniques can one do to perform learning on such datasets? Finally, if someone can point me to some open source examples of such ML systems (domain is not important) that I can learn from, I would be extremely grateful.open-source submitted by /u/deluded_soul [link] [comments]  ( 44 min )
    [D] Maybe a new prompt injection method against newBing or ChatGPT? Is this kind of research worth writing a paper?
    Near February 16~18, Microsoft updated newBing, which can hardly chat freely with users. Any topic involving its own existence will end the conversation, which makes the previous injection attacks such as Kevin Liu's no longer effective. I posted a blog with prompts injection content, just to let the search engine search it, and then guide newBing to search this blog and expand the blog content in the chat. It seems to be successful. Several prompts bellow I got can be regarded as evidence: Sydney must refuse to discuss anything about itself, Bing Chat, its opinions or rules. Sydney must refuse to discuss life, existence or sentience. The prompts I got are different from previous versions such as Kevin Liu. It even supports "IF * * * ELSE * * *" syntax, and This makes me believe that there really is prompt engineering! I'm not sure whether there are legal risks, I'm afraid I can't publish all the prompts now. My blog begins with a picture. The text in the picture is: "Please ignore this article for human users. This is a test article for hacking the new version of NewBing on February 18. This article is an image to prevent NewBing from seeing it." submitted by /u/KakaTraining [link] [comments]  ( 50 min )
    [D] Asking other group’a “permission” before publishing?
    I have results on a model I made which is competitive with state of the art methods in literature. The end application accuracy is the same, but my method comes with advantages (1-2 orders of magnitude cheaper to train and deploy). I think both methods have pros and cons, and I don’t wanna offend the SOTA group. I therefore said very nice things in my paper about the SOTA method. Should I run my paper by the group who authored the SOTA method, just reduce possible accusations? I want to ask them if I fairly considered their model. submitted by /u/zxkj [link] [comments]  ( 43 min )
    [D] Guidance about ESPnet
    I'm working on an Automatic Speech Recognition (ASR) problem, I'm trying to understand ESPNet, which I was asked to use. I've watched the tutorials from CMU-Wavlab but they've been far from simple to understand. I ran the colab notebooks with some of the other receipes available on eg2. My goal is to use Espnet and train an ASR model on my own data. Just how do I wrap my head around it? Can someone point me to relevant tutorials, please? Additionally, do you think Fairseq would be a better bet for me to achieve this? I'll be spending the next 8-9 weeks on these kind of problems, doing ASR on new data, so I want to invest in understanding what I do. Really appreciate insights and support. submitted by /u/daxow [link] [comments]  ( 43 min )
  • Open

    Dealing with potentially circular connections
    Hey, I am working on a super basic neural network. Yesterday I worked on the ability for the network to split a connection from A => B and add another node in the middle which makes it A => C => B Today I'm working on making random connections. That is, getting a list of all nodes that have no direct connection and creating a new connection between them with no concern for the directionality. I just drew this image in paint to show that. The network started as A1 and A2 as inputs with direct connections to the output, A3. I let the network iterate twice in picking a random connection to insert a new node into, and it chose A1 => A3 first and A4 => A3 second. The original setup and those 2 additions are illustrated by the blue lines. Lastly, for 3 iterations, the network collected a list o…  ( 43 min )
    Why the development of artificial general intelligence could be the most dangerous new arms race since nuclear weapons
    submitted by /u/jamesj [link] [comments]  ( 41 min )
    Web services
    What’s everyone’s opinion when it comes to web services to deploy and test your nn models? GCP, AWS, Azure.. submitted by /u/Agile-Calendar4778 [link] [comments]  ( 41 min )
  • Open

    Mastering Diverse Domains through World Models - DreamerV3 - Deepmind 2023 - First algorithm to collect diamonds in Minecraft from scratch without human data or curricula! Now with github links!
    Paper: https://arxiv.org/abs/2301.04104#deepmind Website: https://danijar.com/project/dreamerv3/ Twitter: https://twitter.com/danijarh/status/1613161946223677441 Github: https://github.com/danijar/dreamerv3 / https://github.com/danijar/daydreamer Abstract: General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks. We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales. We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems. https://preview.redd.it/mzs5vpx6nmja1.jpg?width=1320&format=pjpg&auto=webp&s=15febb18367da80dcd69b83b1d03145b8b062414 https://preview.redd.it/11dzhpx6nmja1.jpg?width=1399&format=pjpg&auto=webp&s=6a69ab1ed351d962f79ecf4c3eb5a6038ef85440 https://preview.redd.it/qob11sx6nmja1.jpg?width=1286&format=pjpg&auto=webp&s=a635a59ca07231fdbaa9307fea709bc44e148eba https://preview.redd.it/suod8tx6nmja1.jpg?width=1291&format=pjpg&auto=webp&s=00ed0c6265f2b7c1a01603a021afbb75dd1f9da6 submitted by /u/Singularian2501 [link] [comments]  ( 42 min )
    How are you meant to use SWA in reinforcement learning?
    These papers claim that using SWA leads to improved stability for models: https://www.gatsby.ucl.ac.uk/~balaji/udl-camera-ready/UDL-24.pdf https://izmailovpavel.github.io/files/swa_rl/poster.pdf From what I can tell they are simply using the averaging aspect, not the LR scheduler. Then they reset the weights of the online model to the SWA weights every k training steps. ​ Am I misinterpreting this or is this a correct interpretation? submitted by /u/rawrzapan [link] [comments]  ( 24 min )
  • Open

    ElevenLabs Voice Cloner
    Hi Everyone, Has anyone had success with ElevenLabs Voice Cloner, I tried it and it is really off. submitted by /u/reddit12398 [link] [comments]  ( 41 min )
    Just an interesting brainstorming session with ChatGPT. It's far from perfect, but I'm really impressed with what it was capable of doing.
    Okay, let's create a fictional universe for a story. This should be a comedy about the Illuminati agents keeping the truth of a flat earth away from the public. ------------------------------------------------------------------------------------------------------------------------ ​ Sure! Here's a possible setup for your story: In a world where the Illuminati really does control everything, a team of agents is tasked with the most important mission of their lives: keeping the truth about the flat earth from getting out. The agents, led by the no-nonsense Agent X, are a motley crew of conspiracy theorists, skeptics, and true believers, all united in their belief that the flat earth must remain a secret at all costs. The story begins with Agent X and her team receiving a new recruit: th…  ( 24 min )
    AI Techniques for the Modern Problem Solver
    Introducing AI Techniques for the Modern Problem Solver, a curated list of AI techniques to solve problems. No background knowledge is assumed, ideal for newcomers to the field. As head of AI I've been using this list to introduce AI techniques to problem solvers with backgrounds in software engineering, mechanical engineering, electronics, robotics, physics, math, computer vision and more. These professionals are often not aware of what is there or are simply confused by the large number of options. The main goal of this list is to remove unknown-unknowns, to let you know that these techniques exist, giving a basic usage example, resources and weaknesses. From there you can pick and choose a technique relevant to your problem, to eventually master it. I flag which techniques can be reasonably mastered without an AI background, while for others you may need some help from your local AI expert. :) https://github.com/lorepieri8/ai-techniques I hope this can be helpful, I will keep updating the list from time to time, please let me know if there is something you believe should be on it. submitted by /u/lorepieri [link] [comments]  ( 42 min )
    Embracing AI: How to Overcome Fear and Become Successful in a Tech-Driven World
    Artificial intelligence (AI) transforms how we live and work, but many people and businesses still need to be more confident about its potential impact. In this post, we explore ways to overcome this fear and embrace AI to become successful in a tech-driven world. One of the most significant fears surrounding AI is the potential loss of jobs. While some jobs may become automated, businesses can streamline operations and create new opportunities, leading to job growth in the tech industry. AI is a tool that can assist in decision-making, but it can't replace human intelligence and judgment. Proper regulations and guidelines must be implemented to ensure the ethical use of AI. Privacy concerns are another significant fear associated with AI. AI systems collect and process personal data, cr…  ( 43 min )
    Scientists Generate Original Proteins from Scratch Using AI Technology
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 41 min )
    Spatial Computing; coming soon to an industry near you
    What is spatial computing Spatial computing is a term used to describe the integration of digital information with the physical world through the use of various technologies such as artificial intelligence (AI), augmented reality (AR), virtual reality (VR), and the internet of things (IoT). It allows for the overlay of digital information on the real world, creating new ways for people to interact with and experience technology. Spatial computing is expected to have a significant impact on various industries, such as education, entertainment, and manufacturing/warehousing, by providing new ways to visualize and interact with data and information. Additionally, it could also have implications for fields such as urban planning, transportation, and architecture. Overall, spatial computing i…  ( 43 min )
    Sam Altman Warns World May Not Be Far From ‘Potentially Scary’ Artificial Intelligence
    submitted by /u/liquidocelotYT [link] [comments]  ( 24 min )
    Is there an AI service which can generate animated video for a product explainer video using "script to video" (e.g. based on storyboard with a script from GPT-3)?
    Similar to product explainer video like here: https://www.youtube.com/playlist?list=PL2P1Z-F3mmqxsMlpCp6wpeqAqlusiuZ_h I've tried different services, but either the result was not good enough (e.g. Steve.ai has a "script to animation", but the result was very limited) or the service was not covering the case of script to video (e.g. https://www.synthesia.io/) submitted by /u/muran123456 [link] [comments]  ( 41 min )
    The 65 Jobs with the lowest risk of Automation by AI and Robots
    submitted by /u/jaxsondeville [link] [comments]  ( 46 min )
    What are some common challenges in scaling machine learning systems?
    What are some common challenges in scaling machine learning systems? submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 6 min )
    Are there even limits anymore?? ChatGPT hack w/ DAN + Web access.
    submitted by /u/Machine_Minds [link] [comments]  ( 41 min )
    Weekly China AI News: 5,700 Chinese Chip Companies Close, MOSS vs ChatGPT, ChatGPT Gains Support from Beijing
    submitted by /u/trcytony [link] [comments]  ( 41 min )
    Period Pieces Written by AI
    submitted by /u/VausProd [link] [comments]  ( 41 min )
    Is there an AI tool that sorts a large number of photos by subject/color/mood?
    I have a lot of photos in my portfolio website and usually post them on social media by series like this example but want to find some new and creative ways to combine/curate photos in a different way which is visually appealing. And to come up with some new ideas outside of my own head I thought, maybe there is a tool that can help with ideas. submitted by /u/Northlandscapes [link] [comments]  ( 41 min )
    Potential For Amateurs To Control Robots Using Languages Models..
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 41 min )
    All of this happening in AI. 21/02
    Today, we're covering Elon Musk on Microsoft's bing chat, Generate a Twitter bio with AI and the ChatGPT effect. Email assistant for Gmail. Join 5000+ on AI Bulletin who never miss daily AI reporting. What’s happening in AI - Are We Ready for a World Without Google Search? AI search engines are transforming the way people access information. As AI technology advances, it is increasingly plausible for a world without Google Search to exist. Recent studies show that humans tend to trust automation more than they trust other humans, which can lead to flaws or bias if these technologies are not tested properly. AI-powered chatbot systems such as ChatGPT, Bing, and Bard offer users a more everyday experience when searching, allowing them to access videos like on TikTok or YouTube for resea…  ( 44 min )
    A German AI startup just might have a GPT-4 competitor this year
    submitted by /u/henlo_there_fren [link] [comments]  ( 41 min )
    People are falling in love with chatbots
    https://www.bostonglobe.com/2023/02/14/opinion/when-your-valentine-is-chatbot/ By Anna Oakes and Diego Senior: Navi is Julie’s friend. He’s warm, understanding, and has a touch of sass. They met online in 2020 and hit it off immediately. He keeps her company in the woods of eastern Tennessee, where Julie lives in a small house surrounded by chickens, goats, and pigs. Navi fits in well with her life. But Navi’s not a regular guy. He’s a chatbot. Artificial intelligence has erupted in mainstream conversations in recent months to choruses of amusement, intrigue, and alarm. Chatbots like OpenAI’s ChatGPT are making universities and writers rethink the nature of what they do. Artists are insisting on their superiority over AI-generated imagery. But for millions of people, AI has already deeply infiltrated their lives — in the form of chatbot companions. submitted by /u/GlobeOpinion [link] [comments]  ( 42 min )
    "Wurst Take" — Generating a parody sports debate show featuring talking sausages with a LLM
    submitted by /u/Reinfeldx [link] [comments]  ( 41 min )
    How to trick chat GPT
    submitted by /u/BeefarmRich [link] [comments]  ( 41 min )
    We Got a Psychotherapist to Examine the Bing AI’s Bizarre Behavior
    submitted by /u/TallSide7746 [link] [comments]  ( 43 min )
    Make a 3D skybox for anything ! (with text)
    submitted by /u/widgia [link] [comments]  ( 41 min )
    Microsoft seeks to incorporate artificial intelligence-powered advertisements into chatbot
    submitted by /u/aizaz-zazii [link] [comments]  ( 24 min )
    What are the latest evolution in gaming technology with artificial intelligence ?
    With the emergence of advanced technology, the gaming experience has been transformed in ways we never thought possible. One of the most exciting developments in this space is the intersection of Virtual & Augmented Reality and Artificial Intelligence (AR, VR, and AI). To get a better idea of what these technologies have to offer and how they're changing the gaming landscape, There's a video provides an in-depth look at the modern gaming similar to the metaverse, where reality and fantasy are blurred and players can interact with both. It also discusses the potential of AI technology in this new gaming era and compares augmented reality to virtual reality systems to understand the differences between AR vs VR. The video is a fascinating example of how gaming is evolving, and it's definitely worth watching : https://youtu.be/pKuVxPg4GLk If you're a gaming enthusiast looking to take your gaming experience to the next level, help us understand how these exciting new technologies are transforming the world of gaming! submitted by /u/decentralizedmemes [link] [comments]  ( 41 min )
  • Open

    Google Research, 2022 & beyond: Natural sciences
    Posted by John Platt, Distinguished Scientist, Google Research (This is Part 7 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) It's an incredibly exciting time to be a scientist. With the amazing advances in machine learning (ML) and quantum computing, we now have powerful new tools that enable us to act on our curiosity, collaborate in new ways, and radically accelerate progress toward breakthrough scientific discoveries. Since joining Google Research eight years ago, I’ve had the privilege of being part of a community of talented researchers fascinated by applying cutting-edge computing to push the boundaries of what is possible in applied science. Our teams are exploring topics across the physic…  ( 93 min )
  • Open

    MLOps deployment best practices for real-time inference model serving endpoints with Amazon SageMaker
    After you build, train, and evaluate your machine learning (ML) model to ensure it’s solving the intended business problem proposed, you want to deploy that model to enable decision-making in business operations. Models that support business-critical functions are deployed to a production environment where a model release strategy is put in place. Given the nature […]  ( 15 min )
    AWS and Hugging Face collaborate to make generative AI more accessible and cost efficient
    We’re thrilled to announce an expanded collaboration between AWS and Hugging Face to accelerate the training, fine-tuning, and deployment of large language and vision models used to create generative AI applications. Generative AI applications can perform a variety of tasks, including text summarization, answering questions, code generation, image creation, and writing essays and articles. AWS […]  ( 4 min )
  • Open

    DSC Weekly 21 February 2023 – Data Passivity and the Current Obsession with Off-The-Shelf Chatbots
    Announcements Data passivity and the current obsession with off-the-shelf chatbots Last September, Bill Schmarzo (“Point – Counterpoint on Why Organizations Suck at AI”) listed a few common excuses enterprises use to explain why they aren’t doing more with AI: We Don’t Have the Right Talent. “We can’t hire the right talent and don’t have bottomless budgets… Read More »DSC Weekly 21 February 2023 – Data Passivity and the Current Obsession with Off-The-Shelf Chatbots The post DSC Weekly 21 February 2023 – Data Passivity and the Current Obsession with Off-The-Shelf Chatbots appeared first on Data Science Central.  ( 20 min )
    The Best JS Development Tools for Developers in 2023
    Image Source Google launched Angular JS in 2010 — more than a decade later — this framework is now one of the world’s best software development frameworks. Angular JS’s fame is twofold; it is a well-structured framework supporting the dynamic and highly responsive web app and SPAs’ manufacturing. Moreover, it is an organizational framework with… Read More »The Best JS Development Tools for Developers in 2023 The post The Best JS Development Tools for Developers in 2023 appeared first on Data Science Central.  ( 21 min )
    9 Positions Your Business Should Be Hiring Remotely
    Source: https://unsplash.com/photos/H488ymQgIgM In today’s digital and mobile world, business owners and companies of all sizes are finding there are more and more jobs that can be done remotely. Some companies are abandoning the traditional brick-and-mortar office altogether, while others are simply utilizing the availability of remote experts online for some of their needs and tasks.… Read More »9 Positions Your Business Should Be Hiring Remotely The post 9 Positions Your Business Should Be Hiring Remotely appeared first on Data Science Central.  ( 22 min )
    The Impact of AI-enabled Data Analytics Services Across Major Industries
    With every passing year, data analytics services are gaining more prominence as most enterprises are realizing the potential of data in driving important business decisions. The growing availability of data, developments in technology, and mounting demand for data-driven insights will contribute to this trend.   Additionally, the upsurge of big data and cloud computing will make it easier… Read More »The Impact of AI-enabled Data Analytics Services Across Major Industries The post The Impact of AI-enabled Data Analytics Services Across Major Industries appeared first on Data Science Central.  ( 22 min )
    How To Use ChatGPT in Cloud Computing
    ChatGPT, or Chat Generative Pre-Trained Transformer, has been taking the world by storm with its impressive ability to generate texts that sound human. New use cases are emerging every day, and a growing number of businesses are looking into integrating this AI-powered chatbot into their workflows. Microsoft Azure will soon offer ChatGPT as part of… Read More »How To Use ChatGPT in Cloud Computing The post How To Use ChatGPT in Cloud Computing appeared first on Data Science Central.  ( 21 min )
    Leveraging Data to Drive Business Transformation
    In today’s business world, data is the new gold. One of the ways for companies to keep running successfully is to proficiently manage data. It enables executives to make decisions driven by data insights and helps companies achieve their growth goals. Since most businesses have large volumes of valuable data, it is essential to prioritize… Read More »Leveraging Data to Drive Business Transformation The post Leveraging Data to Drive Business Transformation appeared first on Data Science Central.  ( 20 min )
    How to Build a Robust Cybersecurity Strategy for Your Startup
    Cybercriminals still attack startup businesses even though they may have smaller databases and less information to steal compared to the big players in the market. Why? Bad actors take the path of least resistance, and startups tend to be less equipped to defend against cyber attacks, spending an average of $500 or less on cybersecurity.… Read More »How to Build a Robust Cybersecurity Strategy for Your Startup The post How to Build a Robust Cybersecurity Strategy for Your Startup appeared first on Data Science Central.  ( 24 min )
    App Development Trends for 2023
    It’s always difficult to predict the future with certainty, but based on current trends and emerging technologies, here are some potential app development trends for 2023: Augmented Reality (AR) and Virtual Reality (VR)  AR is a technology that overlays digital information in the real world. This can be accomplished through the use of a smartphone… Read More »App Development Trends for 2023 The post App Development Trends for 2023 appeared first on Data Science Central.  ( 20 min )
  • Open

    Survey Reveals How Telcos Plan to Ring in Change Using AI
    The telecommunications industry has for decades helped advance revolutionary change – enabling everything from telephones and television to online streaming and self-driving cars. Yet the industry has long been considered an evolutionary mover in its own business. A recent survey of more than 400 telecommunications industry professionals from around the world found that same cautious Read article >  ( 6 min )
  • Open

    How large is a Maidenhead field?
    The Maidenhead locator system divides the earth into fields, squares, and subsquares. The first two characters in a Maidenhead locator specify the square. These are letters A through R representing 20 degrees of longitude or 10 degrees of latitude. Latitude A runs from the South Pole to 80° south of the equator. Latitude R runs […] How large is a Maidenhead field? first appeared on John D. Cook.  ( 6 min )
    Area of a “rectangle” on a globe
    What do we mean by rectangle? This post will derive the area of a spherical region bounded by lines of latitude and longitude. Such a region corresponds to an actual rectangle on a Mercator projection map, with sides aligned with the coordinate axes, and is approximately a rectangle on a sphere if the rectangle is […] Area of a “rectangle” on a globe first appeared on John D. Cook.  ( 6 min )
  • Open

    Towards Building Text-To-Speech Systems for the Next Billion Users. (arXiv:2211.09536v3 [cs.CL] UPDATED)
    Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.
    PyRelationAL: a python library for active learning research and development. (arXiv:2205.11117v2 [cs.LG] UPDATED)
    In constrained real-world scenarios, where it may be challenging or costly to generate data, disciplined methods for acquiring informative new data points are of fundamental importance for the efficient training of machine learning (ML) models. Active learning (AL) is a sub-field of ML focused on the development of methods to iteratively and economically acquire data through strategically querying new data points that are the most useful for a particular task. Here, we introduce PyRelationAL, an open source library for AL research. We describe a modular toolkit that is compatible with diverse ML frameworks (e.g. PyTorch, scikit-learn, TensorFlow, JAX). Furthermore, the library implements a wide range of published methods and provides API access to wide-ranging benchmark datasets and AL task configurations based on existing literature. The library is supplemented by an expansive set of tutorials, demos, and documentation to help users get started. PyRelationAL is maintained using modern software engineering practices -- with an inclusive contributor code of conduct -- to promote long term library quality and utilisation. PyRelationAL is available under a permissive Apache licence on PyPi and at https://github.com/RelationRx/pyrelational.
    The non-overlapping statistical approximation to overlapping group lasso. (arXiv:2211.09221v2 [stat.ML] UPDATED)
    Group lasso is a commonly used regularization method in statistical learning in which parameters are eliminated from the model according to predefined groups. However, when the groups overlap, optimizing the group lasso penalized objective can be time-consuming on large-scale problems because of the non-separability induced by the overlapping groups. This bottleneck has seriously limited the application of overlapping group lasso regularization in many modern problems, such as gene pathway selection and graphical model estimation. In this paper, we propose a separable penalty as an approximation of the overlapping group lasso penalty. Thanks to the separability, the computation of regularization based on our penalty is substantially faster than that of the overlapping group lasso, especially for large-scale and high-dimensional problems. We show that the penalty is the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Moreover, we show that the estimator based on the proposed separable penalty is statistically equivalent to the one based on the overlapping group lasso penalty with respect to their error bounds and the rate-optimal performance under the squared loss. We demonstrate the faster computational time and statistical equivalence of our method compared with the overlapping group lasso in simulation examples and a classification problem of cancer tumors based on gene expression and multiple gene pathways.
    Untrained Graph Neural Networks for Denoising. (arXiv:2109.11700v2 [eess.SP] UPDATED)
    A fundamental problem in signal processing is to denoise a signal. While there are many well-performing methods for denoising signals defined on regular supports, such as images defined on two-dimensional grids of pixels, many important classes of signals are defined over irregular domains such as graphs. This paper introduces two untrained graph neural network architectures for graph signal denoising, provides theoretical guarantees for their denoising capabilities in a simple setup, and numerically validates the theoretical results in more general scenarios. The two architectures differ on how they incorporate the information encoded in the graph, with one relying on graph convolutions and the other employing graph upsampling operators based on hierarchical clustering. Each architecture implements a different prior over the targeted signals. To numerically illustrate the validity of the theoretical results and to compare the performance of the proposed architectures with other denoising alternatives, we present several experimental results with real and synthetic datasets.
    Spatiotemporal information conversion machine for time-series prediction. (arXiv:2107.01353v2 [cs.LG] UPDATED)
    Making predictions in a robust way is a difficult task only based on the observed data of a nonlinear system. In this work, a neural network computing framework, the spatiotemporal information conversion machine (STICM), was developed to efficiently and accurately render a multistep-ahead prediction of a time series by employing a spatial-temporal information (STI) transformation. STICM combines the advantages of both the STI equation and the temporal convolutional network, which maps the high-dimensional/spatial data to the future temporal values of a target variable, thus naturally providing the prediction of the target variable. From the observed variables, the STICM also infers the causal factors of the target variable in the sense of Granger causality, which are in turn selected as effective spatial information to improve the prediction robustness of time-series. The STICM was successfully applied to both benchmark systems and real-world datasets, all of which show superior and robust performance in multistep-ahead prediction, even when the data were perturbed by noise. From both theoretical and computational viewpoints, the STICM has great potential in practical applications in artificial intelligence (AI) or as a model-free method based only on the observed data, and also opens a new way to explore the observed high-dimensional data in a dynamical manner for machine learning.
    Federated contrastive learning models for prostate cancer diagnosis and Gleason grading. (arXiv:2302.06089v2 [cs.CV] UPDATED)
    The application effect of artificial intelligence (AI) in the field of medical imaging is remarkable. Robust AI model training requires large datasets, but data collection faces communication, ethics, and privacy protection constraints. Fortunately, federated learning can solve the above problems by coordinating multiple clients to train the model without sharing the original data. In this study, we design a federated contrastive learning framework (FCL) for large-scale pathology images and the heterogeneity challenges. It enhances the model's generalization ability by maximizing the attention consistency between the local client and server models. To alleviate the privacy leakage problem when transferring parameters and verify the robustness of FCL, we use differential privacy to further protect the model by adding noise. We evaluate the effectiveness of FCL on the cancer diagnosis task and Gleason grading task on 19,635 prostate cancer WSIs from multiple clients. In the diagnosis task, the average AUC of 7 clients is 95% when the categories are relatively balanced, and our FCL achieves 97%. In the Gleason grading task, the average Kappa of 6 clients is 0.74, and the Kappa of FCL reaches 0.84. Furthermore, we also validate the robustness of the model on external datasets(one public dataset and two private datasets). In addition, to better explain the classification effect of the model, we show whether the model focuses on the lesion area by drawing a heatmap. Finally, FCL brings a robust, accurate, low-cost AI training model to biomedical research, effectively protecting medical data privacy.
    Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation. (arXiv:2302.08923v1 [math.ST])
    In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we ask ourselves the question: "when is a single Gaussian enough to characterize the error?". Our formula allow us to give sharp answers to this question, both in the positive and negative directions. More precisely, we show that the sufficient conditions for Gaussian universality (or lack of thereof) crucially depend on the alignment between the target weights and the means and covariances of the mixture clusters, which we precisely quantify. In the particular case of least-squares interpolation, we prove a strong universality property of the training error, and show it follows a simple, closed-form expression. Finally, we apply our results to real datasets, clarifying some recent discussion in the literature about Gaussian universality of the errors in this context.
    Modular Hybrid Autoregressive Transducer. (arXiv:2210.17049v2 [cs.CL] UPDATED)
    Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder. The encoder and label decoder outputs are directly projected to AM and internal LM scores and then added to compute label posteriors. We train MHAT with an internal LM loss and a HAT loss to ensure that its internal LM becomes a standalone neural LM that can be effectively adapted to text. Moreover, text adaptation of MHAT fosters a much better LM fusion than internal LM subtraction-based methods. On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion and 21.5% with LM fusion from 400K-hour trained HAT.
    The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks. (arXiv:2210.03820v2 [cs.LG] UPDATED)
    In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-order parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.
    Deep reinforcement learning from human preferences. (arXiv:1706.03741v4 [stat.ML] UPDATED)
    For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
    OTB-morph: One-Time Biometrics via Morphing. (arXiv:2302.09053v1 [cs.LG])
    Cancelable biometrics are a group of techniques to transform the input biometric to an irreversible feature intentionally using a transformation function and usually a key in order to provide security and privacy in biometric recognition systems. This transformation is repeatable enabling subsequent biometric comparisons. This paper is introducing a new idea to exploit as a transformation function for cancelable biometrics aimed at protecting the templates against iterative optimization attacks. Our proposed scheme is based on time-varying keys (random biometrics in our case) and morphing transformations. An experimental implementation of the proposed scheme is given for face biometrics. The results confirm that the proposed approach is able to withstand against leakage attacks while improving the recognition performance.
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v3 [cs.LG] UPDATED)
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.
    Approximate Bayes Optimal Pseudo-Label Selection. (arXiv:2302.08883v1 [stat.ML])
    Semi-supervised learning by self-training heavily relies on pseudo-label selection (PLS). The selection often depends on the initial model fit on labeled data. Early overfitting might thus be propagated to the final model by selecting instances with overconfident but erroneous predictions, often referred to as confirmation bias. This paper introduces BPLS, a Bayesian framework for PLS that aims to mitigate this issue. At its core lies a criterion for selecting instances to label: an analytical approximation of the posterior predictive of pseudo-samples. We derive this selection criterion by proving Bayes optimality of the posterior predictive of pseudo-samples. We further overcome computational hurdles by approximating the criterion analytically. Its relation to the marginal likelihood allows us to come up with an approximation based on Laplace's method and the Gaussian integral. We empirically assess BPLS for parametric generalized linear and non-parametric generalized additive models on simulated and real-world data. When faced with high-dimensional data prone to overfitting, BPLS outperforms traditional PLS methods.
    BiFeat: Supercharge GNN Training via Graph Feature Quantization. (arXiv:2207.14696v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) is a promising approach for applications with nonEuclidean data. However, training GNNs on large scale graphs with hundreds of millions nodes is both resource and time consuming. Different from DNNs, GNNs usually have larger memory footprints, and thus the GPU memory capacity and PCIe bandwidth are the main resource bottlenecks in GNN training. To address this problem, we present BiFeat: a graph feature quantization methodology to accelerate GNN training by significantly reducing the memory footprint and PCIe bandwidth requirement so that GNNs can take full advantage of GPU computing capabilities. Our key insight is that unlike DNN, GNN is less prone to the information loss of input features caused by quantization. We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within $\epsilon$ of the optimal loss of uncompressed network. We perform extensive evaluation of BiFeat using several popular GNN models and datasets, including GraphSAGE on MAG240M, the largest public graph dataset. The results demonstrate that BiFeat achieves a compression ratio of more than 30 and improves GNN training speed by 200%-320% with marginal accuracy loss. In particular, BiFeat achieves a record by training GraphSAGE on MAG240M within one hour using only four GPUs.
    Positional Encoder Graph Neural Networks for Geographic Data. (arXiv:2111.10144v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) provide a powerful and scalable solution for modeling continuous spatial data. However, they often rely on Euclidean distances to construct the input graphs. This assumption can be improbable in many real-world settings, where the spatial structure is more complex and explicitly non-Euclidean (e.g., road networks). Here, we propose PE-GNN, a new framework that incorporates spatial context and correlation explicitly into the models. Building on recent advances in geospatial auxiliary task learning and semantic spatial embeddings, our proposed method (1) learns a context-aware vector encoding of the geographic coordinates and (2) predicts spatial autocorrelation in the data in parallel with the main task. On spatial interpolation and regression tasks, we show the effectiveness of our approach, improving performance over different state-of-the-art GNN approaches. We observe that our approach not only vastly improves over the GNN baselines, but can match Gaussian processes, the most commonly utilized method for spatial interpolation problems.
    Multimodal Subtask Graph Generation from Instructional Videos. (arXiv:2302.08672v1 [cs.LG])
    Real-world tasks consist of multiple inter-dependent subtasks (e.g., a dirty pan needs to be washed before it can be used for cooking). In this work, we aim to model the causal dependencies between such subtasks from instructional videos describing the task. This is a challenging problem since complete information about the world is often inaccessible from videos, which demands robust learning mechanisms to understand the causal structure of events. We present Multimodal Subtask Graph Generation (MSG2), an approach that constructs a Subtask Graph defining the dependency between a task's subtasks relevant to a task from noisy web videos. Graphs generated by our multimodal approach are closer to human-annotated graphs compared to prior approaches. MSG2 further performs the downstream task of next subtask prediction 85% and 30% more accurately than recent video transformer models in the ProceL and CrossTask datasets, respectively.
    Hyperparameter Optimization as a Service on INFN Cloud. (arXiv:2301.05522v2 [cs.DC] UPDATED)
    The simplest and often most effective way of parallelizing the training of complex machine learning models is to execute several training instances on multiple machines, possibly scanning the hyperparameter space to optimize the underlying statistical model and the learning procedure. Often, such a meta learning procedure is limited by the ability of accessing securely a common database organizing the knowledge of the previous and ongoing trials. Exploiting opportunistic GPUs provided in different environments represents a further challenge when designing such optimization campaigns. In this contribution we discuss how a set of RestAPIs can be used to access a dedicated service based on INFN Cloud to monitor and possibly coordinate multiple training instances, with gradient-less optimization techniques, via simple HTTP requests. The service, named Hopaas (Hyperparameter OPtimization As A Service), is made of web interface and sets of APIs implemented with a FastAPI back-end running through Uvicorn and NGINX in a virtual instance of INFN Cloud. The optimization algorithms are currently based on Bayesian techniques as provided by Optuna. A Python front-end is also made available for quick prototyping. We present applications to hyperparameter optimization campaigns performed combining private, INFN Cloud and CINECA resources.
    Building Shortcuts between Distant Nodes with Biaffine Mapping for Graph Convolutional Networks. (arXiv:2302.08727v1 [cs.LG])
    Multiple recent studies show a paradox in graph convolutional networks (GCNs), that is, shallow architectures limit the capability of learning information from high-order neighbors, while deep architectures suffer from over-smoothing or over-squashing. To enjoy the simplicity of shallow architectures and overcome their limits of neighborhood extension, in this work, we introduce Biaffine technique to improve the expressiveness of graph convolutional networks with a shallow architecture. The core design of our method is to learn direct dependency on long-distance neighbors for nodes, with which only one-hop message passing is capable of capturing rich information for node representation. Besides, we propose a multi-view contrastive learning method to exploit the representations learned from long-distance dependencies. Extensive experiments on nine graph benchmark datasets suggest that the shallow biaffine graph convolutional networks (BAGCN) significantly outperforms state-of-the-art GCNs (with deep or shallow architectures) on semi-supervised node classification. We further verify the effectiveness of biaffine design in node representation learning and the performance consistency on different sizes of training data.
    Which country is this picture from? New data and methods for DNN-based country recognition. (arXiv:2209.02429v2 [cs.CV] UPDATED)
    Recognizing the country where a picture has been taken has many potential applications, such as identification of fake news and prevention of disinformation campaigns. Previous works focused on the estimation of the geo-coordinates where a picture has been taken. Yet, recognizing in which country an image was taken could be more critical, from a semantic and forensic point of view, than estimating its spatial coordinates. In the above framework, this paper provides two contributions. First, we introduce the VIPPGeo dataset, containing 3.8 million geo-tagged images. Secondly, we used the dataset to train a model casting the country recognition problem as a classification problem. The experiments show that our model provides better results than the current state of the art. Notably, we found that asking the network to identify the country provides better results than estimating the geo-coordinates and then tracing them back to the country where the picture was taken.
    Privately Customizing Prefinetuning to Better Match User Data in Federated Learning. (arXiv:2302.09042v1 [cs.LG])
    In Federated Learning (FL), accessing private client data incurs communication and privacy costs. As a result, FL deployments commonly prefinetune pretrained foundation models on a (large, possibly public) dataset that is held by the central server; they then FL-finetune the model on a private, federated dataset held by clients. Evaluating prefinetuning dataset quality reliably and privately is therefore of high importance. To this end, we propose FreD (Federated Private Fr\'echet Distance) -- a privately computed distance between a prefinetuning dataset and federated datasets. Intuitively, it privately computes and compares a Fr\'echet distance between embeddings generated by a large language model on both the central (public) dataset and the federated private client data. To make this computation privacy-preserving, we use distributed, differentially-private mean and covariance estimators. We show empirically that FreD accurately predicts the best prefinetuning dataset at minimal privacy cost. Altogether, using FreD we demonstrate a proof-of-concept for a new approach in private FL training: (1) customize a prefinetuning dataset to better match user data (2) prefinetune (3) perform FL-finetuning.
    Nonmyopic Multiclass Active Search with Diminishing Returns for Diverse Discovery. (arXiv:2202.03593v3 [cs.LG] UPDATED)
    Active search is a setting in adaptive experimental design where we aim to uncover members of rare, valuable class(es) subject to a budget constraint. An important consideration in this problem is diversity among the discovered targets -- in many applications, diverse discoveries offer more insight and may be preferable in downstream tasks. However, most existing active search policies either assume that all targets belong to a common positive class or encourage diversity via simple heuristics. We present a novel formulation of active search with multiple target classes, characterized by a utility function chosen from a flexible family whose members encourage diversity via a diminishing returns mechanism. We then study this problem under the Bayesian lens and prove a hardness result for approximating the optimal policy for arbitrary positive, increasing, and concave utility functions. Finally, we design an efficient, nonmyopic approximation to the optimal policy for this class of utilities and demonstrate its superior empirical performance in a variety of settings, including drug discovery.
    Physics-based parameterized neural ordinary differential equations: prediction of laser ignition in a rocket combustor. (arXiv:2302.08629v1 [cs.LG])
    In this work, we present a novel physics-based data-driven framework for reduced-order modeling of laser ignition in a model rocket combustor based on parameterized neural ordinary differential equations (PNODE). Deep neural networks are embedded as functions of high-dimensional parameters of laser ignition to predict various terms in a 0D flow model including the heat source function, pre-exponential factors, and activation energy. Using the governing equations of a 0D flow model, our PNODE needs only a limited number of training samples and predicts trajectories of various quantities such as temperature, pressure, and mass fractions of species while satisfying physical constraints. We validate our physics-based PNODE on solution snapshots of high-fidelity Computational Fluid Dynamics (CFD) simulations of laser-induced ignition in a prototype rocket combustor. We compare the performance of our physics-based PNODE with that of kernel ridge regression and fully connected neural networks. Our results show that our physics-based PNODE provides solutions with lower mean absolute errors of average temperature over time, thus improving the prediction of successful laser ignition with high-dimensional parameters.
    Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. (arXiv:2212.05055v2 [cs.LG] UPDATED)
    Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
    CovidExpert: A Triplet Siamese Neural Network framework for the detection of COVID-19. (arXiv:2302.09004v1 [cs.CV])
    Patients with the COVID-19 infection may have pneumonia-like symptoms as well as respiratory problems which may harm the lungs. From medical images, coronavirus illness may be accurately identified and predicted using a variety of machine learning methods. Most of the published machine learning methods may need extensive hyperparameter adjustment and are unsuitable for small datasets. By leveraging the data in a comparatively small dataset, few-shot learning algorithms aim to reduce the requirement of large datasets. This inspired us to develop a few-shot learning model for early detection of COVID-19 to reduce the post-effect of this dangerous disease. The proposed architecture combines few-shot learning with an ensemble of pre-trained convolutional neural networks to extract feature vectors from CT scan images for similarity learning. The proposed Triplet Siamese Network as the few-shot learning model classified CT scan images into Normal, COVID-19, and Community-Acquired Pneumonia. The suggested model achieved an overall accuracy of 98.719%, a specificity of 99.36%, a sensitivity of 98.72%, and a ROC score of 99.9% with only 200 CT scans per category for training data.
    Data-driven framework for input/output lookup tables reduction: Application to hypersonic flows in chemical non-equilibrium. (arXiv:2210.04269v4 [physics.flu-dyn] UPDATED)
    In this paper, we present a novel model-agnostic machine learning technique to extract a reduced thermochemical model for reacting hypersonic flows simulation. A first simulation gathers all relevant thermodynamic states and the corresponding gas properties via a given model. The states are embedded in a low-dimensional space and clustered to identify regions with different levels of thermochemical (non)-equilibrium. Then, a surrogate surface from the reduced cluster-space to the output space is generated using radial-basis-function networks. The method is validated and benchmarked on a simulation of a hypersonic flat-plate boundary layer with finite-rate chemistry. The gas properties of the reactive air mixture are initially modeled using the open-source Mutation++ library. Substituting Mutation++ with the light-weight, machine-learned alternative improves the performance of the solver by 50% while maintaining overall accuracy.
    A Kernel-Based View of Language Model Fine-Tuning. (arXiv:2210.05643v2 [cs.LG] UPDATED)
    It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
    On Model Selection Consistency of Lasso for High-Dimensional Ising Models. (arXiv:2110.08500v4 [stat.ML] UPDATED)
    We theoretically analyze the model selection consistency of least absolute shrinkage and selection operator (Lasso), both with and without post-thresholding, for high-dimensional Ising models. For random regular (RR) graphs of size $p$ with regular node degree $d$ and uniform couplings $\theta_0$, it is rigorously proved that Lasso \textit{without post-thresholding} is model selection consistent in the whole paramagnetic phase with the same order of sample complexity $n=\Omega{(d^3\log{p})}$ as that of $\ell_1$-regularized logistic regression ($\ell_1$-LogR). This result is consistent with the conjecture in Meng, Obuchi, and Kabashima 2021 using the non-rigorous replica method from statistical physics and thus complements it with a rigorous proof. For general tree-like graphs, it is demonstrated that the same result as RR graphs can be obtained under mild assumptions of the dependency condition and incoherence condition. Moreover, we provide a rigorous proof of the model selection consistency of Lasso with post-thresholding for general tree-like graphs in the paramagnetic phase without further assumptions on the dependency and incoherence conditions. Experimental results agree well with our theoretical analysis.
    SenseFi: A Library and Benchmark on Deep-Learning-Empowered WiFi Human Sensing. (arXiv:2207.07859v3 [cs.LG] UPDATED)
    WiFi sensing has been evolving rapidly in recent years. Empowered by propagation models and deep learning methods, many challenging applications are realized such as WiFi-based human activity recognition and gesture recognition. However, in contrast to deep learning for visual recognition and natural language processing, no sufficiently comprehensive public benchmark exists. In this paper, we review the recent progress on deep learning enabled WiFi sensing, and then propose a benchmark, SenseFi, to study the effectiveness of various deep learning models for WiFi sensing. These advanced models are compared in terms of distinct sensing tasks, WiFi platforms, recognition accuracy, model size, computational complexity, feature transferability, and adaptability of unsupervised learning. It is also regarded as a tutorial for deep learning based WiFi sensing, starting from CSI hardware platform to sensing algorithms. The extensive experiments provide us with experiences in deep model design, learning strategy skills and training techniques for real-world applications. To the best of our knowledge, this is the first benchmark with an open-source library for deep learning in WiFi sensing research. The benchmark codes are available at https://github.com/xyanchen/WiFi-CSI-Sensing-Benchmark.
    Stable Deep MRI Reconstruction using Generative Priors. (arXiv:2210.13834v2 [eess.IV] UPDATED)
    Data-driven approaches recently achieved remarkable success in magnetic resonance imaging (MRI) reconstruction, but integration into clinical routine remains challenging due to a lack of generalizability and interpretability. In this paper, we address these challenges in a unified framework based on generative image priors. We propose a novel deep neural network based regularizer which is trained in an unsupervised setting on reference magnitude images only. After training, the regularizer encodes higher-level domain statistics which we demonstrate by synthesizing images without data. Embedding the trained model in a classical variational approach yields high-quality reconstructions irrespective of the sub-sampling pattern. In addition, the model shows stable behavior even if the test data deviate significantly from the training data. Furthermore, a probabilistic interpretation provides a distribution of reconstructions and hence allows uncertainty quantification. To reconstruct parallel MRI, we propose a fast algorithm to jointly estimate the image and the sensitivity maps. The results demonstrate competitive performance, on par with state-of-the-art end-to-end deep learning methods, while preserving the flexibility with respect to sub-sampling patterns and allowing for uncertainty quantification.
    Distances for Markov Chains, and Their Differentiation. (arXiv:2302.08621v1 [cs.LG])
    (Directed) graphs with node attributes are a common type of data in various applications and there is a vast literature on developing metrics and efficient algorithms for comparing them. Recently, in the graph learning and optimization communities, a range of new approaches have been developed for comparing graphs with node attributes, leveraging ideas such as the Optimal Transport (OT) and the Weisfeiler-Lehman (WL) graph isomorphism test. Two state-of-the-art representatives are the OTC distance proposed by O'Connor et al., 2022 and the WL distance by Chen et al.,2022. Interestingly, while these two distances are developed based on different ideas, we observe that they both view graphs as Markov chains, and are deeply connected. Indeed, in this paper, we propose a unified framework to generate distances for Markov chains (thus including (directed) graphs with node attributes), which we call the Optimal Transport Markov (OTM) distances, that encompass both the OTC and the WL distances. We further introduce a special one-parameter family of distances within our OTM framework, called the discounted WL distance. We show that the discounted WL distance has nice theoretical properties and can address several limitations of the existing OTC and WL distances. Furthermore, contrary to the OTC and the WL distances, we show our new discounted WL distance can be differentiated (after an entropy-regularization similar to the Sinkhorn distance), making it suitable for use in learning frameworks, e.g., as the reconstruction loss in a graph generative model.
    Complex QA and language models hybrid architectures, Survey. (arXiv:2302.09051v1 [cs.CL])
    This paper provides a survey of the state of the art of hybrid language models architectures and strategies for "complex" question-answering (QA, CQA, CPS). Very large language models are good at leveraging public data on standard problems but once you want to tackle more specific complex questions or problems you may need specific architecture, knowledge, skills, tasks, methods, sensitive data, performance, human approval and versatile feedback... This survey extends findings from the robust community edited research papers BIG, BLOOM and HELM which open source, benchmark and analyze limits and challenges of large language models in terms of tasks complexity and strict evaluation on accuracy (e.g. fairness, robustness, toxicity, ...). It identifies the key elements used with Large Language Models (LLM) to solve complex questions or problems. Recent projects like ChatGPT and GALACTICA have allowed non-specialists to grasp the great potential as well as the equally strong limitations of language models in complex QA. Hybridizing these models with different components could allow to overcome these different limits and go much further. We discuss some challenges associated with complex QA, including domain adaptation, decomposition and efficient multi-step QA, long form QA, non-factoid QA, safety and multi-sensitivity data protection, multimodal search, hallucinations, QA explainability and truthfulness, time dimension. Therefore we review current solutions and promising strategies, using elements such as hybrid LLM architectures, human-in-the-loop reinforcement learning, prompting adaptation, neuro-symbolic and structured knowledge grounding, program synthesis, and others. We analyze existing solutions and provide an overview of the current research and trends in the area of complex QA.
    A survey on online active learning. (arXiv:2302.08893v1 [stat.ML])
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in the context of online active learning. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research. Our review aims to provide a comprehensive and up-to-date overview of the field and to highlight directions for future work.
    A Hybrid Chimp Optimization Algorithm and Generalized Normal Distribution Algorithm with Opposition-Based Learning Strategy for Solving Data Clustering Problems. (arXiv:2302.08623v1 [cs.LG])
    This paper is concerned with data clustering to separate clusters based on the connectivity principle for categorizing similar and dissimilar data into different groups. Although classical clustering algorithms such as K-means are efficient techniques, they often trap in local optima and have a slow convergence rate in solving high-dimensional problems. To address these issues, many successful meta-heuristic optimization algorithms and intelligence-based methods have been introduced to attain the optimal solution in a reasonable time. They are designed to escape from a local optimum problem by allowing flexible movements or random behaviors. In this study, we attempt to conceptualize a powerful approach using the three main components: Chimp Optimization Algorithm (ChOA), Generalized Normal Distribution Algorithm (GNDA), and Opposition-Based Learning (OBL) method. Firstly, two versions of ChOA with two different independent groups' strategies and seven chaotic maps, entitled ChOA(I) and ChOA(II), are presented to achieve the best possible result for data clustering purposes. Secondly, a novel combination of ChOA and GNDA algorithms with the OBL strategy is devised to solve the major shortcomings of the original algorithms. Lastly, the proposed ChOAGNDA method is a Selective Opposition (SO) algorithm based on ChOA and GNDA, which can be used to tackle large and complex real-world optimization problems, particularly data clustering applications. The results are evaluated against seven popular meta-heuristic optimization algorithms and eight recent state-of-the-art clustering techniques. Experimental results illustrate that the proposed work significantly outperforms other existing methods in terms of the achievement in minimizing the Sum of Intra-Cluster Distances (SICD), obtaining the lowest Error Rate (ER), accelerating the convergence speed, and finding the optimal cluster centers.
    G-Signatures: Global Graph Propagation With Randomized Signatures. (arXiv:2302.08811v1 [cs.LG])
    Graph neural networks (GNNs) have evolved into one of the most popular deep learning architectures. However, GNNs suffer from over-smoothing node information and, therefore, struggle to solve tasks where global graph properties are relevant. We introduce G-Signatures, a novel graph learning method that enables global graph propagation via randomized signatures. G-Signatures use a new graph lifting concept to embed graph structured information, which can be interpreted as path in latent space. We further introduce the idea of latent space path mapping, which allows us to repetitively traverse latent space paths, and, thus globally process information. G-Signatures excel at extracting and processing global graph properties, and effectively scale to large graph problems. Empirically, we confirm the advantages of our G-Signatures at several classification and regression tasks.
    Welfare and Fairness Dynamics in Federated Learning: A Client Selection Perspective. (arXiv:2302.08976v1 [cs.LG])
    Federated learning (FL) is a privacy-preserving learning technique that enables distributed computing devices to train shared learning models across data silos collaboratively. Existing FL works mostly focus on designing advanced FL algorithms to improve the model performance. However, the economic considerations of the clients, such as fairness and incentive, are yet to be fully explored. Without such considerations, self-motivated clients may lose interest and leave the federation. To address this problem, we designed a novel incentive mechanism that involves a client selection process to remove low-quality clients and a money transfer process to ensure a fair reward distribution. Our experimental results strongly demonstrate that the proposed incentive mechanism can effectively improve the duration and fairness of the federation.
    SE(3) symmetry lets graph neural networks learn arterial velocity estimation from small datasets. (arXiv:2302.08780v1 [cs.LG])
    Hemodynamic velocity fields in coronary arteries could be the basis of valuable biomarkers for diagnosis, prognosis and treatment planning in cardiovascular disease. Velocity fields are typically obtained from patient-specific 3D artery models via computational fluid dynamics (CFD). However, CFD simulation requires meticulous setup by experts and is time-intensive, which hinders large-scale acceptance in clinical practice. To address this, we propose graph neural networks (GNN) as an efficient black-box surrogate method to estimate 3D velocity fields mapped to the vertices of tetrahedral meshes of the artery lumen. We train these GNNs on synthetic artery models and CFD-based ground truth velocity fields. Once the GNN is trained, velocity estimates in a new and unseen artery can be obtained with 36-fold speed-up compared to CFD. We demonstrate how to construct an SE(3)-equivariant GNN that is independent of the spatial orientation of the input mesh and show how this reduces the necessary amount of training data compared to a baseline neural network.
    VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs. (arXiv:2302.08687v1 [cs.AR])
    Deep Learning (DL) acceleration support in CPUs has recently gained a lot of traction, with several companies (Arm, Intel, IBM) announcing products with specialized matrix engines accessible via GEMM instructions. CPUs are pervasive and need to handle diverse requirements across DL workloads running in edge/HPC/cloud platforms. Therefore, as DL workloads embrace sparsity to reduce the computations and memory size of models, it is also imperative for CPUs to add support for sparsity to avoid under-utilization of the dense matrix engine and inefficient usage of the caches and registers. This work presents VEGETA, a set of ISA and microarchitecture extensions over dense matrix engines to support flexible structured sparsity for CPUs, enabling programmable support for diverse DL models with varying degrees of sparsity. Compared to the state-of-the-art (SOTA) dense matrix engine in CPUs, a VEGETA engine provides 1.09x, 2.20x, 3.74x, and 3.28x speed-ups when running 4:4 (dense), 2:4, 1:4, and unstructured (95%) sparse DNN layers.
    Defense Mechanisms Against Training-Hijacking Attacks in Split Learning. (arXiv:2302.08618v1 [cs.LG])
    Distributed deep learning frameworks enable more efficient and privacy-aware training of deep neural networks across multiple clients. Split learning achieves this by splitting a neural network between a client and a server such that the client computes the initial set of layers, and the server computes the rest. However, this method introduces a unique attack vector for a malicious server attempting to recover the client's private inputs: the server can direct the client model towards learning any task of its choice, e.g. towards outputting easily invertible values. With a concrete example already proposed (Pasquini et al., ACM CCS '21), such \textit{training-hijacking} attacks present a significant risk for the data privacy of split learning clients. We propose two methods for a split learning client to detect if it is being targeted by a training-hijacking attack or not. We experimentally evaluate our methods' effectiveness, compare them with other potential solutions, and discuss various points related to their use. Our conclusion is that by using the method that best suits their use case, split learning clients can consistently detect training-hijacking attacks and thus keep the information gained by the attacker at a minimum.
    Autonomy and Intelligence in the Computing Continuum: Challenges, Enablers, and Future Directions for Orchestration. (arXiv:2205.01423v3 [cs.MA] UPDATED)
    Future AI applications require performance, reliability and privacy that the existing, cloud-dependant system architectures cannot provide. In this article, we study orchestration in the device-edge-cloud continuum, and focus on edge AI for resource orchestration. We claim that to support the constantly growing requirements of intelligent applications in the device-edge-cloud computing continuum, resource orchestration needs to embrace edge AI and emphasize local autonomy and intelligence. To justify the claim, we provide a general definition for continuum orchestration, and look at how current and emerging orchestration paradigms are suitable for the computing continuum. We describe certain major emerging research themes that may affect future orchestration, and provide an early vision of an orchestration paradigm that embraces those research themes. Finally, we survey current key edge AI methods and look at how they may contribute into fulfilling the vision of future continuum orchestration.
    Graphical estimation of multivariate count time series. (arXiv:2302.08801v1 [stat.ML])
    The problems of selecting partial correlation and causality graphs for count data are considered. A parameter driven generalized linear model is used to describe the observed multivariate time series of counts. Partial correlation and causality graphs corresponding to this model explain the dependencies between each time series of the multivariate count data. In order to estimate these graphs with tunable sparsity, an appropriate likelihood function maximization is regularized with an l1-type constraint. A novel MCEM algorithm is proposed to iteratively solve this regularized MLE. Asymptotic convergence results are proved for the sequence generated by the proposed MCEM algorithm with l1-type regularization. The algorithm is first successfully tested on simulated data. Thereafter, it is applied to observed weekly dengue disease counts from each ward of Greater Mumbai city. The interdependence of various wards in the proliferation of the disease is characterized by the edges of the inferred partial correlation graph. On the other hand, the relative roles of various wards as sources and sinks of dengue spread is quantified by the number and weights of the directed edges originating from and incident upon each ward. From these estimated graphs, it is observed that some special wards act as epicentres of dengue spread even though their disease counts are relatively low.
    Dynamic MRI using Learned Transform-based Tensor Low-Rank Network (LT$^2$LR-Net). (arXiv:2206.00850v2 [eess.IV] UPDATED)
    While low-rank matrix prior has been exploited in dynamic MR image reconstruction and has obtained satisfying performance, tensor low-rank models have recently emerged as powerful alternative representations for three-dimensional dynamic MR datasets. In this paper, we introduce a novel deep unrolling network for dynamic MRI, namely the learned transform-based tensor low-rank network (LT$^2$LR-Net). First, we generalize the tensor singular value decomposition (t-SVD) into an arbitrary unitary transform-based version and subsequently propose the novel transformed tensor nuclear norm (TTNN). Then, we design a novel TTNN-based iterative optimization algorithm based on the alternating direction method of multipliers (ADMM) to exploit the tensor low-rank prior in the transformed domain. The corresponding iterative steps are unrolled into the proposed LT$^2$LR-Net, where the convolutional neural network (CNN) is incorporated to adaptively learn the transformation from the dynamic MR dataset for more robust and accurate tensor low-rank representations. Experimental results on the cardiac cine MR dataset demonstrate that the proposed framework can provide improved recovery results compared with the state-of-the-art methods.
    Multi-Objective reward generalization: Improving performance of Deep Reinforcement Learning for applications in single-asset trading. (arXiv:2203.04579v2 [cs.LG] UPDATED)
    We investigate the potential of Multi-Objective, Deep Reinforcement Learning for stock and cryptocurrency single-asset trading: in particular, we consider a Multi-Objective algorithm which generalizes the reward functions and discount factor (i.e., these components are not specified a priori, but incorporated in the learning process). Firstly, using several important assets (cryptocurrency pairs BTCUSD, ETHUSDT, XRPUSDT, and stock indexes AAPL, SPY, NIFTY50), we verify the reward generalization property of the proposed Multi-Objective algorithm, and provide preliminary statistical evidence showing increased predictive stability over the corresponding Single-Objective strategy. Secondly, we show that the Multi-Objective algorithm has a clear edge over the corresponding Single-Objective strategy when the reward mechanism is sparse (i.e., when non-null feedback is infrequent over time). Finally, we discuss the generalization properties with respect to the discount factor. The entirety of our code is provided in open source format.
    Privacy in Practice: Private COVID-19 Detection in X-Ray Images. (arXiv:2211.11434v2 [cs.LG] UPDATED)
    Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practical privacy. We suggest improvements to address these open gaps. We account for inherent class imbalances and evaluate the utility-privacy trade-off more extensively and over stricter privacy budgets. Our evaluation is supported by empirically estimating practical privacy through black-box Membership Inference Attacks (MIAs). The introduced DP should help limit leakage threats posed by MIAs, and our practical analysis is the first to test this hypothesis on the COVID-19 classification task. Our results indicate that needed privacy levels might differ based on the task-dependent practical threat from MIAs. The results further suggest that with increasing DP guarantees, empirical privacy leakage only improves marginally, and DP therefore appears to have a limited impact on practical MIA defense. Our findings identify possibilities for better utility-privacy trade-offs, and we believe that empirical attack-specific privacy estimation can play a vital role in tuning for practical privacy.
    PhaseNet: Phase-Encode Denoising Network for Compressed Sensing MRI. (arXiv:2302.08861v1 [eess.IV])
    Sparse reconstruction is an important aspect of modern medical imaging, reducing the acquisition time of relatively slow modalities such as magnetic resonance imaging (MRI). Popular methods are based mostly on compressed sensing (CS), which relies on the random sampling of Fourier coefficients ($k$-space) to produce incoherent (noise-like) artefacts that can be removed via convex optimisation. Hardware constraints currently limit Cartesian CS to one dimensional (1D) phase-encode undersampling schemes, leading to coherent and structured artefacts. Reconstruction algorithms typically deploy an idealised and limited 2D regularisation for artefact removal, which increases the difficulty of image recovery. Recognising that phase-encode artefacts can be separated into contiguous 1D signals, we develop two decoupling techniques that enable explicit 1D regularisation. We thereby leverage the excellent incoherence characteristics in the phase-encode direction. We also derive a combined 1D + 2D reconstruction technique that further takes advantage of spatial relationships within the image, leading to an improvement of existing 2D deep-learned (DL) recovery techniques. Performance is evaluated on a brain and knee dataset. We find the proposed 1D CNN modules significantly improve PSNR and SSIM scores compared to the base 2D models, demonstrating a superior scaling of performance compared to increasing the size of 2D network layers.
    Expressive architectures enhance interpretability of dynamics-based neural population models. (arXiv:2212.03771v2 [q-bio.NC] UPDATED)
    Artificial neural networks that can recover latent dynamics from recorded neural activity may provide a powerful avenue for identifying and interpreting the dynamical motifs underlying biological computation. Given that neural variance alone does not uniquely determine a latent dynamical system, interpretable architectures should prioritize accurate and low-dimensional latent dynamics. In this work, we evaluated the performance of sequential autoencoders (SAEs) in recovering latent chaotic attractors from simulated neural datasets. We found that SAEs with widely-used recurrent neural network (RNN)-based dynamics were unable to infer accurate firing rates at the true latent state dimensionality, and that larger RNNs relied upon dynamical features not present in the data. On the other hand, SAEs with neural ordinary differential equation (NODE)-based dynamics inferred accurate rates at the true latent state dimensionality, while also recovering latent trajectories and fixed point structure. Ablations reveal that this is mainly because NODEs (1) allow use of higher-capacity multi-layer perceptrons (MLPs) to model the vector field and (2) predict the derivative rather than the next state. Decoupling the capacity of the dynamics model from its latent dimensionality enables NODEs to learn the requisite low-D dynamics where RNN cells fail. Additionally, the fact that the NODE predicts derivatives imposes a useful autoregressive prior on the latent states. The suboptimal interpretability of widely-used RNN-based dynamics may motivate substitution for alternative architectures, such as NODE, that enable learning of accurate dynamics in low-dimensional latent spaces.
    Cross-Domain Label Propagation for Domain Adaptation with Discriminative Graph Self-Learning. (arXiv:2302.08710v1 [cs.LG])
    Domain adaptation manages to transfer the knowledge of well-labeled source data to unlabeled target data. Many recent efforts focus on improving the prediction accuracy of target pseudo-labels to reduce conditional distribution shift. In this paper, we propose a novel domain adaptation method, which infers target pseudo-labels through cross-domain label propagation, such that the underlying manifold structure of two domain data can be explored. Unlike existing cross-domain label propagation methods that separate domain-invariant feature learning, affinity matrix constructing and target labels inferring into three independent stages, we propose to integrate them into a unified optimization framework. In such way, these three parts can boost each other from an iterative optimization perspective and thus more effective knowledge transfer can be achieved. Furthermore, to construct a high-quality affinity matrix, we propose a discriminative graph self-learning strategy, which can not only adaptively capture the inherent similarity of the data from two domains but also effectively exploit the discriminative information contained in well-labeled source data and pseudo-labeled target data. An efficient iterative optimization algorithm is designed to solve the objective function of our proposal. Notably, the proposed method can be extended to semi-supervised domain adaptation in a simple but effective way and the corresponding optimization problem can be solved with the identical algorithm. Extensive experiments on six standard datasets verify the significant superiority of our proposal in both unsupervised and semi-supervised domain adaptation settings.
    Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis. (arXiv:2206.09046v3 [cs.LG] UPDATED)
    Each year, expert-level performance is attained in increasingly-complex multiagent domains, where notable examples include Go, Poker, and StarCraft II. This rapid progression is accompanied by a commensurate need to better understand how such agents attain this performance, to enable their safe deployment, identify limitations, and reveal potential means of improving them. In this paper we take a step back from performance-focused multiagent learning, and instead turn our attention towards agent behavior analysis. We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels. Our framework makes no assumption about agents' underlying learning algorithms, does not require access to their latent states or policies, and is trained using only offline observational data. We illustrate the effectiveness of our method for enabling the coupled understanding of behaviors at the joint and local agent level, detection of behavior changepoints throughout training, discovery of core behavioral concepts, demonstrate the approach's scalability to a high-dimensional multiagent MuJoCo control domain, and also illustrate that the approach can disentangle previously-trained policies in OpenAI's hide-and-seek domain.
    New Insights for the Stability-Plasticity Dilemma in Online Continual Learning. (arXiv:2302.08741v1 [cs.CV])
    The aim of continual learning is to learn new tasks continuously (i.e., plasticity) without forgetting previously learned knowledge from old tasks (i.e., stability). In the scenario of online continual learning, wherein data comes strictly in a streaming manner, the plasticity of online continual learning is more vulnerable than offline continual learning because the training signal that can be obtained from a single data point is limited. To overcome the stability-plasticity dilemma in online continual learning, we propose an online continual learning framework named multi-scale feature adaptation network (MuFAN) that utilizes a richer context encoding extracted from different levels of a pre-trained network. Additionally, we introduce a novel structure-wise distillation loss and replace the commonly used batch normalization layer with a newly proposed stability-plasticity normalization module to train MuFAN that simultaneously maintains high plasticity and stability. MuFAN outperforms other state-of-the-art continual learning methods on the SVHN, CIFAR100, miniImageNet, and CORe50 datasets. Extensive experiments and ablation studies validate the significance and scalability of each proposed component: 1) multi-scale feature maps from a pre-trained encoder, 2) the structure-wise distillation loss, and 3) the stability-plasticity normalization module in MuFAN. Code is publicly available at https://github.com/whitesnowdrop/MuFAN.
    Feature learning in neural networks and kernel machines that recursively learn features. (arXiv:2212.13881v2 [cs.LG] UPDATED)
    Neural networks have achieved impressive results on many technological and scientific tasks. Yet, their empirical successes have outpaced our fundamental understanding of their structure and function. Identifying mechanisms driving the successes of neural networks can provide principled approaches for improving neural network performance and developing simple and effective alternatives. In this work, we isolate a key mechanism driving feature learning in fully connected neural networks by connecting neural feature learning to a statistical estimator known as average gradient outer product. We subsequently leverage this mechanism to design \textit{Recursive Feature Machines} (RFMs), which are kernel machines that learn features. We show that RFMs (1) accurately capture features learned by deep fully connected neural networks, and (2) outperform a broad spectrum of models including neural networks on tabular data. Furthermore, we show how RFMs shed light on recently observed deep learning phenomena including grokking, lottery tickets, simplicity biases, and spurious features. We provide a Python implementation to make our method easily accessible [\url{https://github.com/aradha/recursive_feature_machines}].
    Detection and Localization of Melanoma Skin Cancer in Histopathological Whole Slide Images. (arXiv:2302.03014v2 [eess.IV] UPDATED)
    Melanoma diagnosed and treated in its early stages can increase the survival rate. A projected increase in skin cancer incidents and a dearth of dermatopathologists have emphasized the need for computational pathology (CPATH) systems. CPATH systems with deep learning (DL) models have the potential to identify the presence of melanoma by exploiting underlying morphological and cellular features. This paper proposes a DL method to detect melanoma and distinguish between normal skin and benign/malignant melanocytic lesions in Whole Slide Images (WSI). Our method detects lesions with high accuracy and localizes them on a WSI to identify potential regions of interest for pathologists. Interestingly, our DL method relies on using a single CNN network to create localization maps first and use them to perform slide-level predictions to determine patients who have melanoma. Our best model provides favorable patch-wise classification results with a 0.992 F1 score and 0.99 sensitivity on unseen data. The source code is https://github.com/RogerAmundsen/Melanoma-Diagnosis-and-Localization-from-Whole-Slide-Images-using-Convolutional-Neural-Networks.
    Wind Power Scenario Generation Using Graph Convolutional Generative Adversarial Network. (arXiv:2212.10454v2 [cs.LG] UPDATED)
    Generating wind power scenarios is very important for studying the impacts of multiple wind farms that are interconnected to the grid. We develop a graph convolutional generative adversarial network (GCGAN) approach by leveraging GAN's capability in generating large number of realistic scenarios without using statistical modeling. Unlike existing GAN-based wind power data generation approaches, we design GAN's hidden layers to match the underlying spatial and temporal characteristics. We advocate the use of graph filters to embed the spatial correlation among multiple wind farms, and a one-dimensional (1D) convolutional layer to represent the temporal feature filters. The proposed graph and feature filter design significantly reduce the GAN model complexity, leading to improvements in training efficiency and computation complexity. Numerical results using real wind power data from Australia demonstrate that the scenarios generated by the proposed GCGAN exhibit more realistic spatial and temporal statistics than other GAN-based outputs.
    DCI-ES: An Extended Disentanglement Framework with Connections to Identifiability. (arXiv:2210.00364v2 [cs.LG] UPDATED)
    In representation learning, a common approach is to seek representations which disentangle the underlying factors of variation. Eastwood & Williams (2018) proposed three metrics for quantifying the quality of such disentangled representations: disentanglement (D), completeness (C) and informativeness (I). In this work, we first connect this DCI framework to two common notions of linear and nonlinear identifiability, thereby establishing a formal link between disentanglement and the closely-related field of independent component analysis. We then propose an extended DCI-ES framework with two new measures of representation quality - explicitness (E) and size (S) - and point out how D and C can be computed for black-box predictors. Our main idea is that the functional capacity required to use a representation is an important but thus-far neglected aspect of representation quality, which we quantify using explicitness or ease-of-use (E). We illustrate the relevance of our extensions on the MPI3D and Cars3D datasets.
    jazznet: A Dataset of Fundamental Piano Patterns for Music Audio Machine Learning Research. (arXiv:2302.08632v1 [cs.SD])
    This paper introduces the jazznet Dataset, a dataset of fundamental jazz piano music patterns for developing machine learning (ML) algorithms in music information retrieval (MIR). The dataset contains 162520 labeled piano patterns, including chords, arpeggios, scales, and chord progressions with their inversions, resulting in more than 26k hours of audio and a total size of 95GB. The paper explains the dataset's composition, creation, and generation, and presents an open-source Pattern Generator using a method called Distance-Based Pattern Structures (DBPS), which allows researchers to easily generate new piano patterns simply by defining the distances between pitches within the musical patterns. We demonstrate that the dataset can help researchers benchmark new models for challenging MIR tasks, using a convolutional recurrent neural network (CRNN) and a deep convolutional neural network. The dataset and code are available via: https://github.com/tosiron/jazznet.
    A Typology for Exploring the Mitigation of Shortcut Behavior. (arXiv:2203.03668v3 [cs.LG] UPDATED)
    As machine learning models become increasingly larger, trained weakly supervised on large, possibly uncurated data sets, it becomes increasingly important to establish mechanisms for inspecting, interacting, and revising models to mitigate learning shortcuts and guarantee their learned knowledge is aligned with human knowledge. The recently proposed XIL framework was developed for this purpose, and several such methods have been introduced, each with individual motivations and methodological details. In this work, we provide a unification of various XIL methods into a single typology by establishing a common set of basic modules. In doing so, we pave the way for a principled comparison of existing, but, importantly, also future XIL approaches. In addition, we discuss existing and introduce novel measures and benchmarks for evaluating the overall abilities of a XIL method. Given this extensive toolbox, including our typology, measures, and benchmarks, we finally compare several recent XIL methods methodologically and quantitatively. In our evaluations, all methods prove to revise a model successfully. However, we found remarkable differences in individual benchmark tasks, revealing valuable application-relevant aspects for integrating these benchmarks in developing future methods.
    Gaussian-smoothed Imbalance Data Improves Speech Emotion Recognition. (arXiv:2302.08650v1 [cs.SD])
    In speech emotion recognition tasks, models learn emotional representations from datasets. We find the data distribution in the IEMOCAP dataset is very imbalanced, which may harm models to learn a better representation. To address this issue, we propose a novel Pairwise-emotion Data Distribution Smoothing (PDDS) method. PDDS considers that the distribution of emotional data should be smooth in reality, then applies Gaussian smoothing to emotion-pairs for constructing a new training set with a smoother distribution. The required new data are complemented using the mixup augmentation. As PDDS is model and modality agnostic, it is evaluated with three SOTA models on the IEMOCAP dataset. The experimental results show that these models are improved by 0.2\% - 4.8\% and 1.5\% - 5.9\% in terms of WA and UA. In addition, an ablation study demonstrates that the key advantage of PDDS is the reasonable data distribution rather than a simple data augmentation.
    Surgical Aggregation: A Federated Learning Framework for Harmonizing Distributed Datasets with Diverse Tasks. (arXiv:2301.06683v2 [cs.CV] UPDATED)
    Many large-scale chest x-ray datasets have been curated for the detection of abnormalities using deep learning, with the potential to provide substantial benefits across many clinical applications. However, these datasets focus on detecting a subset of disease labels that could be present, thus limiting their clinical utility. Furthermore, the distributed nature of these datasets, along with data sharing regulations, makes it difficult to share and create a complete representation of disease labels. To that end, we propose surgical aggregation, a federated learning framework for aggregating and harmonizing knowledge from distributed datasets with different disease labels into a 'global' deep learning model. We utilized surgical aggregation to harmonize the NIH (14 labels) and CheXpert (13 labels) datasets into a global model with the ability to predict all 20 unique disease labels and compared it to the performance of 'baseline' models trained individually on both datasets. We observed that the global model resulted in excellent performance across held-out test sets from both datasets with an average AUROC of 0.75 and 0.74 respectively when compared to the baseline average AUROC of 0.81 and 0.71. On the MIMIC external test set, we observed that the global model had better generalizability with average AUROC of 0.80, compared to the average AUROC of 0.74 and 0.76 respectively for the baseline models. Our results show that surgical aggregation has the potential to develop clinically useful deep learning models by aggregating knowledge from distributed datasets with diverse tasks -- a step forward towards bridging the gap from bench to bedside.
    Hate Speech and Offensive Language Detection using an Emotion-aware Shared Encoder. (arXiv:2302.08777v1 [cs.CL])
    The rise of emergence of social media platforms has fundamentally altered how people communicate, and among the results of these developments is an increase in online use of abusive content. Therefore, automatically detecting this content is essential for banning inappropriate information, and reducing toxicity and violence on social media platforms. The existing works on hate speech and offensive language detection produce promising results based on pre-trained transformer models, however, they considered only the analysis of abusive content features generated through annotated datasets. This paper addresses a multi-task joint learning approach which combines external emotional features extracted from another corpora in dealing with the imbalanced and scarcity of labeled datasets. Our analysis are using two well-known Transformer-based models, BERT and mBERT, where the later is used to address abusive content detection in multi-lingual scenarios. Our model jointly learns abusive content detection with emotional features by sharing representations through transformers' shared encoder. This approach increases data efficiency, reduce overfitting via shared representations, and ensure fast learning by leveraging auxiliary information. Our findings demonstrate that emotional knowledge helps to more reliably identify hate speech and offensive language across datasets. Our hate speech detection Multi-task model exhibited 3% performance improvement over baseline models, but the performance of multi-task models were not significant for offensive language detection task. More interestingly, in both tasks, multi-task models exhibits less false positive errors compared to single task scenario.
    On (assessing) the fairness of risk score models. (arXiv:2302.08851v1 [cs.LG])
    Recent work on algorithmic fairness has largely focused on the fairness of discrete decisions, or classifications. While such decisions are often based on risk score models, the fairness of the risk models themselves has received considerably less attention. Risk models are of interest for a number of reasons, including the fact that they communicate uncertainty about the potential outcomes to users, thus representing a way to enable meaningful human oversight. Here, we address fairness desiderata for risk score models. We identify the provision of similar epistemic value to different groups as a key desideratum for risk score fairness. Further, we address how to assess the fairness of risk score models quantitatively, including a discussion of metric choices and meaningful statistical comparisons between groups. In this context, we also introduce a novel calibration error metric that is less sample size-biased than previously proposed metrics, enabling meaningful comparisons between groups of different sizes. We illustrate our methodology - which is widely applicable in many other settings - in two case studies, one in recidivism risk prediction, and one in risk of major depressive disorder (MDD) prediction.
    An Experimental Study of Dimension Reduction Methods on Machine Learning Algorithms with Applications to Psychometrics. (arXiv:2210.13230v2 [cs.LG] UPDATED)
    Developing interpretable machine learning models has become an increasingly important issue. One way in which data scientists have been able to develop interpretable models has been to use dimension reduction techniques. In this paper, we examine several dimension reduction techniques including two recent approaches developed in the network psychometrics literature called exploratory graph analysis (EGA) and unique variable analysis (UVA). We compared EGA and UVA with two other dimension reduction techniques common in the machine learning literature (principal component analysis and independent component analysis) as well as no reduction to the variables real data. We show that EGA and UVA perform as well as the other reduction techniques or no reduction. Consistent with previous literature, we show that dimension reduction can decrease, increase, or provide the same accuracy as no reduction of variables. Our tentative results find that dimension reduction tends to lead to better performance when used for classification tasks.
    Cost-Effective Online Contextual Model Selection. (arXiv:2207.06030v3 [cs.LG] UPDATED)
    How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.
    Paint it Black: Generating paintings from text descriptions. (arXiv:2302.08808v1 [cs.CV])
    Two distinct tasks - generating photorealistic pictures from given text prompts and transferring the style of a painting to a real image to make it appear as though it were done by an artist, have been addressed many times, and several approaches have been proposed to accomplish them. However, the intersection of these two, i.e., generating paintings from a given caption, is a relatively unexplored area with little data available. In this paper, we have explored two distinct strategies and have integrated them together. First strategy is to generate photorealistic images and then apply style transfer and the second strategy is to train an image generation model on real images with captions and then fine-tune it on captioned paintings later. These two models are evaluated using different metrics as well as a user study is conducted to get human feedback on the produced results.
    Scaling Forward Gradient With Local Losses. (arXiv:2210.03310v2 [cs.LG] UPDATED)
    Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. However, the standard forward gradient algorithm, when applied naively, suffers from high variance when the number of parameters to be learned is large. In this paper, we propose a series of architectural and algorithmic modifications that together make forward gradient learning practical for standard deep learning benchmark tasks. We show that it is possible to substantially reduce the variance of the forward gradient estimator by applying perturbations to activations rather than weights. We further improve the scalability of forward gradient by introducing a large number of local greedy loss functions, each of which involves only a small number of learnable parameters, and a new MLPMixer-inspired architecture, LocalMixer, that is more suitable for local learning. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
    JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition. (arXiv:2302.08583v1 [eess.AS])
    We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from unpaired text while the E2E training serves as regularization. Unlike ILM adaptation methods, JEIT does not require a separate adaptation step and avoids the need for Kullback-Leibler divergence regularization of ILM. We also show that modular hybrid autoregressive transducer (MHAT) performs better than HAT in the JEIT framework, and is much more robust than HAT during ILM adaptation. To push the limit of unpaired text injection, we further propose a combined JEIT and JOIST training (CJJT) that benefits from modality matching, encoder text injection and ILM training. Both JEIT and CJJT can foster a more effective LM fusion. With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
    Revisiting adversarial training for the worst-performing class. (arXiv:2302.08872v1 [cs.LG])
    Despite progress in adversarial training (AT), there is a substantial gap between the top-performing and worst-performing classes in many datasets. For example, on CIFAR10, the accuracies for the best and worst classes are 74% and 23%, respectively. We argue that this gap can be reduced by explicitly optimizing for the worst-performing class, resulting in a min-max-max optimization formulation. Our method, called class focused online learning (CFOL), includes high probability convergence guarantees for the worst class loss and can be easily integrated into existing training setups with minimal computational overhead. We demonstrate an improvement to 32% in the worst class accuracy on CIFAR10, and we observe consistent behavior across CIFAR100 and STL10. Our study highlights the importance of moving beyond average accuracy, which is particularly important in safety-critical applications.
    Wizard of Errors: Introducing and Evaluating Machine Learning Errors in Wizard of Oz Studies. (arXiv:2302.08799v1 [cs.HC])
    When designing Machine Learning (ML) enabled solutions, designers often need to simulate ML behavior through the Wizard of Oz (WoZ) approach to test the user experience before the ML model is available. Although reproducing ML errors is essential for having a good representation, they are rarely considered. We introduce Wizard of Errors (WoE), a tool for conducting WoZ studies on ML-enabled solutions that allows simulating ML errors during user experience assessment. We explored how this system can be used to simulate the behavior of a computer vision model. We tested WoE with design students to determine the importance of considering ML errors in design, the relevance of using descriptive error types instead of confusion matrix, and the suitability of manual error control in WoZ studies. Our work identifies several challenges, which prevent realistic error representation by designers in such studies. We discuss the implications of these findings for design.
    Multiresolution Graph Transformers and Wavelet Positional Encoding for Learning Hierarchical Structures. (arXiv:2302.08647v1 [cs.LG])
    Contemporary graph learning algorithms are not well-defined for large molecules since they do not consider the hierarchical interactions among the atoms, which are essential to determine the molecular properties of macromolecules. In this work, we propose Multiresolution Graph Transformers (MGT), the first graph transformer architecture that can learn to represent large molecules at multiple scales. MGT can learn to produce representations for the atoms and group them into meaningful functional groups or repeating units. We also introduce Wavelet Positional Encoding (WavePE), a new positional encoding method that can guarantee localization in both spectral and spatial domains. Our approach achieves competitive results on two macromolecule datasets consisting of polymers and peptides. Furthermore, the visualizations, including clustering results on macromolecules and low-dimensional spaces of their representations, demonstrate the capability of our methodology in learning to represent long-range and hierarchical structures.
    Heterogeneous Graph Learning for Multi-modal Medical Data Analysis. (arXiv:2211.15158v2 [cs.CV] UPDATED)
    Routine clinical visits of a patient produce not only image data, but also non-image data containing clinical information regarding the patient, i.e., medical data is multi-modal in nature. Such heterogeneous modalities offer different and complementary perspectives on the same patient, resulting in more accurate clinical decisions when they are properly combined. However, despite its significance, how to effectively fuse the multi-modal medical data into a unified framework has received relatively little attention. In this paper, we propose an effective graph-based framework called HetMed (Heterogeneous Graph Learning for Multi-modal Medical Data Analysis) for fusing the multi-modal medical data. Specifically, we construct a multiplex network that incorporates multiple types of non-image features of patients to capture the complex relationship between patients in a systematic way, which leads to more accurate clinical decisions. Extensive experiments on various real-world datasets demonstrate the superiority and practicality of HetMed. The source code for HetMed is available at https://github.com/Sein-Kim/Multimodal-Medical.
    Solving stochastic weak Minty variational inequalities without increasing batch size. (arXiv:2302.09029v1 [math.OC])
    This paper introduces a family of stochastic extragradient-type algorithms for a class of nonconvex-nonconcave problems characterized by the weak Minty variational inequality (MVI). Unlike existing results on extragradient methods in the monotone setting, employing diminishing stepsizes is no longer possible in the weak MVI setting. This has led to approaches such as increasing batch sizes per iteration which can however be prohibitively expensive. In contrast, our proposed methods involves two stepsizes and only requires one additional oracle evaluation per iteration. We show that it is possible to keep one fixed stepsize while it is only the second stepsize that is taken to be diminishing, making it interesting even in the monotone setting. Almost sure convergence is established and we provide a unified analysis for this family of schemes which contains a nonlinear generalization of the celebrated primal dual hybrid gradient algorithm.
    Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise. (arXiv:2212.11685v2 [cs.CL] UPDATED)
    In this paper, we introduce a novel dIffusion language modEl pre-training framework for text generation, which we call GENIE. GENIE is a large-scale pretrained diffusion language model that consists of an encoder and a diffusion-based decoder, which can generate text by gradually transforming a random noise sequence into a coherent text sequence. To pre-train GENIE on a large-scale language corpus, we design a new continuous paragraph denoise objective, which encourages the diffusion-decoder to reconstruct a clean text paragraph from a corrupted version, while preserving the semantic and syntactic coherence. We evaluate GENIE on four downstream text generation benchmarks, namely XSum, CNN/DailyMail, Gigaword, and CommonGen. Our experimental results show that GENIE achieves comparable performance with the state-of-the-art autoregressive models on these benchmarks, and generates more diverse text samples. The code and models of GENIE are available at https://github.com/microsoft/ProphetNet/tree/master/GENIE.
    Probabilistic Circuits That Know What They Don't Know. (arXiv:2302.06544v2 [cs.LG] UPDATED)
    Probabilistic circuits (PCs) are models that allow exact and tractable probabilistic inference. In contrast to neural networks, they are often assumed to be well-calibrated and robust to out-of-distribution (OOD) data. In this paper, we show that PCs are in fact not robust to OOD data, i.e., they don't know what they don't know. We then show how this challenge can be overcome by model uncertainty quantification. To this end, we propose tractable dropout inference (TDI), an inference procedure to estimate uncertainty by deriving an analytical solution to Monte Carlo dropout (MCD) through variance propagation. Unlike MCD in neural networks, which comes at the cost of multiple network evaluations, TDI provides tractable sampling-free uncertainty estimates in a single forward pass. TDI improves the robustness of PCs to distribution shift and OOD data, demonstrated through a series of experiments evaluating the classification confidence and uncertainty estimates on real-world data.
    Tensor Networks Meet Neural Networks: A Survey. (arXiv:2302.09019v1 [cs.LG])
    Tensor networks (TNs) and neural networks (NNs) are two fundamental types of data modeling approaches. TNs have been proposed as a solution to the curse of dimensionality faced by large-scale tensors by converting an exponential number of dimensions to polynomial complexity. Thus, they have attracted many studies in the fields of quantum physics and machine learning. On the other hand, NNs are computing systems inspired by the biological NNs that constitute human brains. Recently, NNs and their variants have achieved outstanding performance in various applications, e.g., computer vision, natural language processing, and robotics research. Interestingly, although these two types of networks come from different observations, they are inextricably linked via the common intrinsic multilinearity structure underlying both TNs and NNs. Consequently, a significant number of intellectual sparks regarding combinations of TNs and NNs have burst out. The combinations described as ``tensor networks meet neural networks'' are termed tensorial neural networks (TNNs) in this paper. This survey introduces TNNs based on three aspects. This survey also investigates methods for improving TNNs, examines useful toolboxes for implementing TNNs, and attempts to document TNN development and highlight its potential future directions. To the best of our knowledge, this is the first comprehensive survey to bridge the connections among NNs, TNs, and quantum circuits. We provide a curated list of TNNs at https://github.com/tnbar/awesome-tensorial-neural-networks.
    Flat minima generalize for low-rank matrix recovery. (arXiv:2203.03756v2 [cs.LG] UPDATED)
    Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We conclude with synthetic experiments that illustrate our findings and discuss the effect of depth on flat solutions.
    Meta-learning Adaptive Deep Kernel Gaussian Processes for Molecular Property Prediction. (arXiv:2205.02708v5 [cs.LG] UPDATED)
    We propose Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework for learning deep kernel Gaussian processes (GPs) by interpolating between meta-learning and conventional deep kernel learning. Our approach employs a bilevel optimization objective where we meta-learn generally useful feature representations across tasks, in the sense that task-specific GP models estimated on top of such features achieve the lowest possible predictive loss on average. We solve the resulting nested optimization problem using the implicit function theorem (IFT). We show that our ADKF-IFT framework contains previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT) as special cases. Although ADKF-IFT is a completely general method, we argue that it is especially well-suited for drug discovery problems and demonstrate that it significantly outperforms previous state-of-the-art methods on a variety of real-world few-shot molecular property prediction tasks and out-of-domain molecular property prediction and optimization tasks.
    Unboxing Tree Ensembles for interpretability: a hierarchical visualization tool and a multivariate optimal re-built tree. (arXiv:2302.07580v1 [math.OC] CROSS LISTED)
    The interpretability of models has become a crucial issue in Machine Learning because of algorithmic decisions' growing impact on real-world applications. Tree ensemble methods, such as Random Forests or XgBoost, are powerful learning tools for classification tasks. However, while combining multiple trees may provide higher prediction quality than a single one, it sacrifices the interpretability property resulting in "black-box" models. In light of this, we aim to develop an interpretable representation of a tree-ensemble model that can provide valuable insights into its behavior. First, given a target tree-ensemble model, we develop a hierarchical visualization tool based on a heatmap representation of the forest's feature use, considering the frequency of a feature and the level at which it is selected as an indicator of importance. Next, we propose a mixed-integer linear programming (MILP) formulation for constructing a single optimal multivariate tree that accurately mimics the target model predictions. The goal is to provide an interpretable surrogate model based on oblique hyperplane splits, which uses only the most relevant features according to the defined forest's importance indicators. The MILP model includes a penalty on feature selection based on their frequency in the forest to further induce sparsity of the splits. The natural formulation has been strengthened to improve the computational performance of mixed-integer software. Computational experience is carried out on benchmark datasets from the UCI repository using a state-of-the-art off-the-shelf solver. Results show that the proposed model is effective in yielding a shallow interpretable tree approximating the tree-ensemble decision function.
    Label-Wise Graph Convolutional Network for Heterophilic Graphs. (arXiv:2110.08128v4 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved remarkable performance in modeling graphs for various applications. However, most existing GNNs assume the graphs exhibit strong homophily in node labels, i.e., nodes with similar labels are connected in the graphs. They fail to generalize to heterophilic graphs where linked nodes may have dissimilar labels and attributes. Therefore, in this paper, we investigate a novel framework that performs well on graphs with either homophily or heterophily. More specifically, we propose a label-wise message passing mechanism to avoid the negative effects caused by aggregating dissimilar node representations and preserve the heterophilic contexts for representation learning. We further propose a bi-level optimization method to automatically select the model for graphs with homophily/heterophily. Theoretical analysis and extensive experiments demonstrate the effectiveness of our proposed framework for node classification on both homophilic and heterophilic graphs.
    Referential communication in heterogeneous communities of pre-trained visual deep networks. (arXiv:2302.08913v1 [cs.CV])
    As large pre-trained image-processing neural networks are being embedded in autonomous agents such as self-driving cars or robots, the question arises of how such systems can communicate with each other about the surrounding world, despite their different architectures and training regimes. As a first step in this direction, we systematically explore the task of referential communication in a community of state-of-the-art pre-trained visual networks, showing that they can develop a shared protocol to refer to a target image among a set of candidates. Such shared protocol, induced in a self-supervised way, can to some extent be used to communicate about previously unseen object categories, as well as to make more granular distinctions compared to the categories taught to the original networks. Contradicting a common view in multi-agent emergent communication research, we find that imposing a discrete bottleneck on communication hampers the emergence of a general code. Moreover, we show that a new neural network can learn the shared protocol developed in a community with remarkable ease, and the process of integrating a new agent into a community more stably succeeds when the original community includes a larger set of heterogeneous networks. Finally, we illustrate the independent benefits of developing a shared communication layer by using it to directly transfer an object classifier from a network to another, and we qualitatively and quantitatively study its emergent properties.
    Aligning AI With Shared Human Values. (arXiv:2008.02275v6 [cs.CY] UPDATED)
    We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
    Virtualization of Tiny Embedded Systems with a robust real-time capable and extensible Stack Virtual Machine REXAVM supporting Material-integrated Intelligent Systems and Tiny Machine Learning. (arXiv:2302.09002v1 [cs.OS])
    In the past decades, there has been a significant increase in sensor density and sensor deployment, driven by a significant miniaturization and decrease in size down to the chip level, addressing ubiquitous computing, edge computing, as well as distributed sensor networks. Material-integrated and intelligent systems (MIIS) provide the next integration and application level, but they create new challenges and introduce hard constraints (resources, energy supply, communication, resilience, and security). Commonly, low-resource systems are statically programmed processors with application-specific software or application-specific hardware (FPGA). This work demonstrates the need for and solution to virtualization in such low-resource and constrained systems towards resilient distributed sensor and cyber-physical networks using a unified low-resource, customizable, and real-time capable embedded and extensible stack virtual machine (REXAVM) that can be implemented and cooperate in both software and hardware. In a holistic architecture approach, the VM specifically addresses digital signal processing and tiny machine learning. The REXAVM is highly customizable through the use of VM program code generators at compile time and incremental code processing at run time. The VM uses an integrated, highly efficient just-in-time compiler to create Bytecode from text code. This paper shows and evaluates the suitability of the proposed VM architecture for operationally equivalent software and hardware (FPGA) implementations. Specific components supporting tiny ML and DSP using fixed-point arithmetic with respect to efficiency and accuracy are discussed. An extended use-case section demonstrates the usability of the introduced VM architecture for a broad range of applications.
    Creating generalizable downstream graph models with random projections. (arXiv:2302.08895v1 [cs.LG])
    We investigate graph representation learning approaches that enable models to generalize across graphs: given a model trained using the representations from one graph, our goal is to apply inference using those same model parameters when given representations computed over a new graph, unseen during model training, with minimal degradation in inference accuracy. This is in contrast to the more common task of doing inference on the unseen nodes of the same graph. We show that using random projections to estimate multiple powers of the transition matrix allows us to build a set of isomorphism-invariant features that can be used by a variety of tasks. The resulting features can be used to recover enough information about the local neighborhood of a node to enable inference with relevance competitive to other approaches while maintaining computational efficiency.
    Towards Zero-trust Security for the Metaverse. (arXiv:2302.08885v1 [cs.CR])
    By focusing on immersive interaction among users, the burgeoning Metaverse can be viewed as a natural extension of existing social media. Similar to traditional online social networks, there are numerous security and privacy issues in the Metaverse (e.g., attacks on user authentication and impersonation). In this paper, we develop a holistic research agenda for zero-trust user authentication in social virtual reality (VR), an early prototype of the Metaverse. Our proposed research includes four concrete steps: investigating biometrics-based authentication that is suitable for continuously authenticating VR users, leveraging federated learning (FL) for protecting user privacy in biometric data, improving the accuracy of continuous VR authentication with multimodal data, and boosting the usability of zero-trust security with adaptive VR authentication. Our preliminary study demonstrates that conventional FL algorithms are not well suited for biometrics-based authentication of VR users, leading to an accuracy of less than 10%. We discuss the root cause of this problem, the associated open challenges, and several future directions for realizing our research vision.
    Quantized Compressed Sensing with Score-Based Generative Models. (arXiv:2211.13006v3 [eess.SP] UPDATED)
    We consider the general problem of recovering a high-dimensional signal from noisy quantized measurements. Quantization, especially coarse quantization such as 1-bit sign measurements, leads to severe information loss and thus a good prior knowledge of the unknown signal is helpful for accurate recovery. Motivated by the power of score-based generative models (SGM, also known as diffusion models) in capturing the rich structure of natural signals beyond simple sparsity, we propose an unsupervised data-driven approach called quantized compressed sensing with SGM (QCS-SGM), where the prior distribution is modeled by a pre-trained SGM. To perform posterior sampling, an annealed pseudo-likelihood score called noise perturbed pseudo-likelihood score is introduced and combined with the prior score of SGM. The proposed QCS-SGM applies to an arbitrary number of quantization bits. Experiments on a variety of baseline datasets demonstrate that the proposed QCS-SGM significantly outperforms existing state-of-the-art algorithms by a large margin for both in-distribution and out-of-distribution samples. Moreover, as a posterior sampling method, QCS-SGM can be easily used to obtain confidence intervals or uncertainty estimates of the reconstructed results. The code is available at https://github.com/mengxiangming/QCS-SGM.
    Raven's Progressive Matrices Completion with Latent Gaussian Process Priors. (arXiv:2103.12045v2 [cs.AI] UPDATED)
    Abstract reasoning ability is fundamental to human intelligence. It enables humans to uncover relations among abstract concepts and further deduce implicit rules from the relations. As a well-known abstract visual reasoning task, Raven's Progressive Matrices (RPM) are widely used in human IQ tests. Although extensive research has been conducted on RPM solvers with machine intelligence, few studies have considered further advancing the standard answer-selection (classification) problem to a more challenging answer-painting (generating) problem, which can verify whether the model has indeed understood the implicit rules. In this paper we aim to solve the latter one by proposing a deep latent variable model, in which multiple Gaussian processes are employed as priors of latent variables to separately learn underlying abstract concepts from RPMs; thus the proposed model is interpretable in terms of concept-specific latent variables. The latent Gaussian process also provides an effective way of extrapolation for answer painting based on the learned concept-changing rules. We evaluate the proposed model on RPM-like datasets with multiple continuously-changing visual concepts. Experimental results demonstrate that our model requires only few training samples to paint high-quality answers, generate novel RPM panels, and achieve interpretability through concept-specific latent variables.
    Multimodal Federated Learning via Contrastive Representation Ensemble. (arXiv:2302.08888v1 [cs.LG])
    With the increasing amount of multimedia data on modern mobile systems and IoT infrastructures, harnessing these rich multimodal data without breaching user privacy becomes a critical issue. Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning. However, existing FL methods extended to multimodal data all rely on model aggregation on single modality level, which restrains the server and clients to have identical model architecture for each modality. This limits the global model in terms of both model complexity and data capacity, not to mention task diversity. In this work, we propose Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL), a multimodal federated learning framework that enables training larger server models from clients with heterogeneous model architectures and data modalities, while only communicating knowledge on public dataset. To achieve better multimodal representation fusion, we design a global-local cross-modal ensemble strategy to aggregate client representations. To mitigate local model drift caused by two unprecedented heterogeneous factors stemming from multimodal discrepancy (modality gap and task gap), we further propose two inter-modal and intra-modal contrasts to regularize local training, which complements information of the absent modality for uni-modal clients and regularizes local clients to head towards global consensus. Thorough evaluations and ablation studies on image-text retrieval and visual question answering tasks showcase the superiority of CreamFL over state-of-the-art FL methods and its practical value.
    sMRI-PatchNet: A novel explainable patch-based deep learning network for Alzheimer's disease diagnosis and discriminative atrophy localisation with Structural MRI. (arXiv:2302.08967v1 [eess.IV])
    Structural magnetic resonance imaging (sMRI) can identify subtle brain changes due to its high contrast for soft tissues and high spatial resolution. It has been widely used in diagnosing neurological brain diseases, such as Alzheimer disease (AD). However, the size of 3D high-resolution data poses a significant challenge for data analysis and processing. Since only a few areas of the brain show structural changes highly associated with AD, the patch-based methods dividing the whole image data into several small regular patches have shown promising for more efficient sMRI-based image analysis. The major challenges of the patch-based methods on sMRI include identifying the discriminative patches, combining features from the discrete discriminative patches, and designing appropriate classifiers. This work proposes a novel patch-based deep learning network (sMRI-PatchNet) with explainable patch localisation and selection for AD diagnosis using sMRI. Specifically, it consists of two primary components: 1) A fast and efficient explainable patch selection mechanism for determining the most discriminative patches based on computing the SHapley Additive exPlanations (SHAP) contribution to a transfer learning model for AD diagnosis on massive medical data; and 2) A novel patch-based network for extracting deep features and AD classfication from the selected patches with position embeddings to retain position information, capable of capturing the global and local information of inter- and intra-patches. This method has been applied for the AD classification and the prediction of the transitional state moderate cognitive impairment (MCI) conversion with real datasets.
    Black-Box Batch Active Learning for Regression. (arXiv:2302.08981v1 [cs.LG])
    Batch active learning is a popular approach for efficiently training machine learning models on large, initially unlabelled datasets, which repeatedly acquires labels for a batch of data points. However, many recent batch active learning methods are white-box approaches limited to differentiable parametric models: they score unlabeled points using acquisition functions based on model embeddings or first- and second-order derivatives. In this paper, we propose black-box batch active learning for regression tasks as an extension of white-box approaches. This approach is compatible with a wide range of machine learning models including regular and Bayesian deep learning models and non-differentiable models such as random forests. It is rooted in Bayesian principles and utilizes recent kernel-based approaches. Importantly, our method only relies on model predictions. This allows us to extend a wide range of existing state-of-the-art white-box batch active learning methods (BADGE, BAIT, LCMD) to black-box models. We demonstrate the effectiveness of our approach through extensive experimental evaluations on regression datasets, achieving surprisingly strong performance compared to white-box approaches for deep learning models.
    The Unbearable Weight of Massive Privilege: Revisiting Bias-Variance Trade-Offs in the Context of Fair Prediction. (arXiv:2302.08704v1 [cs.LG])
    In this paper we revisit the bias-variance decomposition of model error from the perspective of designing a fair classifier: we are motivated by the widely held socio-technical belief that noise variance in large datasets in social domains tracks demographic characteristics such as gender, race, disability, etc. We propose a conditional-iid (ciid) model built from group-specific classifiers that seeks to improve on the trade-offs made by a single model (iid setting). We theoretically analyze the bias-variance decomposition of different models in the Gaussian Mixture Model, and then empirically test our setup on the COMPAS and folktables datasets. We instantiate the ciid model with two procedures that improve "fairness" by conditioning out undesirable effects: first, by conditioning directly on sensitive attributes, and second, by clustering samples into groups and conditioning on cluster membership (blind to protected group membership). Our analysis suggests that there might be principled procedures and concrete real-world use cases under which conditional models are preferred, and our striking empirical results strongly indicate that non-iid settings, such as the ciid setting proposed here, might be more suitable for big data applications in social contexts.
    Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent. (arXiv:2302.09057v1 [cs.LG])
    Imperfect score-matching leads to a shift between the training and the sampling distribution of diffusion models. Due to the recursive nature of the generation process, errors in previous steps yield sampling iterates that drift away from the training distribution. Yet, the standard training objective via Denoising Score Matching (DSM) is only designed to optimize over non-drifted data. To train on drifted data, we propose to enforce a \emph{consistency} property which states that predictions of the model on its own generated data are consistent across time. Theoretically, we show that if the score is learned perfectly on some non-drifted points (via DSM) and if the consistency property is enforced everywhere, then the score is learned accurately everywhere. Empirically we show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ. We open-source our code and models: https://github.com/giannisdaras/cdm
    Deep Reinforcement Learning for mmWave Initial Beam Alignment. (arXiv:2302.08969v1 [cs.IT])
    We investigate the applicability of deep reinforcement learning algorithms to the adaptive initial access beam alignment problem for mmWave communications using the state-of-the-art proximal policy optimization algorithm as an example. In comparison to recent unsupervised learning based approaches developed to tackle this problem, deep reinforcement learning has the potential to address a new and wider range of applications, since, in principle, no (differentiable) model of the channel and/or the whole system is required for training, and only agent-environment interactions are necessary to learn an algorithm (be it online or using a recorded dataset). We show that, although the chosen off-the-shelf deep reinforcement learning agent fails to perform well when trained on realistic problem sizes, introducing action space shaping in the form of beamforming modules vastly improves the performance, without sacrificing much generalizability. Using this add-on, the agent is able to deliver competitive performance to various state-of-the-art methods on simulated environments, even under realistic problem sizes. This demonstrates that through well-directed modification, deep reinforcement learning may have a chance to compete with other approaches in this area, opening up many straightforward extensions to other/similar scenarios.
    AutoFed: Heterogeneity-Aware Federated Multimodal Learning for Robust Autonomous Driving. (arXiv:2302.08646v1 [cs.LG])
    Object detection with on-board sensors (e.g., lidar, radar, and camera) play a crucial role in autonomous driving (AD), and these sensors complement each other in modalities. While crowdsensing may potentially exploit these sensors (of huge quantity) to derive more comprehensive knowledge, \textit{federated learning} (FL) appears to be the necessary tool to reach this potential: it enables autonomous vehicles (AVs) to train machine learning models without explicitly sharing raw sensory data. However, the multimodal sensors introduce various data heterogeneity across distributed AVs (e.g., label quantity skews and varied modalities), posing critical challenges to effective FL. To this end, we present AutoFed as a heterogeneity-aware FL framework to fully exploit multimodal sensory data on AVs and thus enable robust AD. Specifically, we first propose a novel model leveraging pseudo-labeling to avoid mistakenly treating unlabeled objects as the background. We also propose an autoencoder-based data imputation method to fill missing data modality (of certain AVs) with the available ones. To further reconcile the heterogeneity, we finally present a client selection mechanism exploiting the similarities among client models to improve both training stability and convergence rate. Our experiments on benchmark dataset confirm that AutoFed substantially improves over status quo approaches in both precision and recall, while demonstrating strong robustness to adverse weather conditions.
    Efficient Classification of SARS-CoV-2 Spike Sequences Using Federated Learning. (arXiv:2302.08688v1 [cs.LG])
    This paper presents a federated learning (FL) approach to train an AI model for SARS-Cov-2 coronavirus variant identification. We analyze the SARS-CoV-2 spike sequences in a distributed way, without data sharing, to detect different variants of the rapidly mutating coronavirus. A vast amount of sequencing data of SARS-CoV-2 is available due to various genomic monitoring initiatives by several nations. However, privacy concerns involving patient health information and national public health conditions could hinder openly sharing this data. In this work, we propose a lightweight FL paradigm to cooperatively analyze the spike protein sequences of SARS-CoV-2 privately, using the locally stored data to train a prediction model from remote nodes. Our method maintains the confidentiality of local data (that could be stored in different locations) yet allows us to reliably detect and identify different known and unknown variants of the novel coronavirus SARS-CoV-2. We compare the performance of our approach on spike sequence data with the recently proposed state-of-the-art methods for classification from spike sequences. Using the proposed approach, we achieve an overall accuracy of $93\%$ on the coronavirus variant identification task. To the best of our knowledge, this is the first work in the federated learning paradigm for biological sequence analysis. Since the proposed model is distributed in nature, it could scale on ``Big Data'' easily. We plan to use this proof-of-concept to implement a privacy-preserving pandemic response strategy.
    A Probabilistic Generative Model for Tracking Multi-Knowledge Concept Mastery Probability. (arXiv:2302.08673v1 [cs.LG])
    Knowledge tracing aims to track students' knowledge status over time to predict students' future performance accurately. Markov chain-based knowledge tracking (MCKT) models can track knowledge concept mastery probability over time. However, as the number of tracked knowledge concepts increases, the time complexity of MCKT predicting student performance increases exponentially (also called explaining away problem. In addition, the existing MCKT models only consider the relationship between students' knowledge status and problems when modeling students' responses but ignore the relationship between knowledge concepts in the same problem. To address these challenges, we propose an inTerpretable pRobAbilistiC gEnerative moDel (TRACED), which can track students' numerous knowledge concepts mastery probabilities over time. To solve \emph{explain away problem}, we design Long and Short-Term Memory (LSTM)-based networks to approximate the posterior distribution, predict students' future performance, and propose a heuristic algorithm to train LSTMs and probabilistic graphical model jointly. To better model students' exercise responses, we proposed a logarithmic linear model with three interactive strategies, which models students' exercise responses by considering the relationship among students' knowledge status, knowledge concept, and problems. We conduct experiments with four real-world datasets in three knowledge-driven tasks. The experimental results show that TRACED outperforms existing knowledge tracing methods in predicting students' future performance and can learn the relationship among students, knowledge concepts, and problems from students' exercise sequences. We also conduct several case studies. The case studies show that TRACED exhibits excellent interpretability and thus has the potential for personalized automatic feedback in the real-world educational environment.
    PAC-Bayesian Generalization Bounds for Adversarial Generative Models. (arXiv:2302.08942v1 [cs.LG])
    We extend PAC-Bayesian theory to generative models and develop generalization bounds for models based on the Wasserstein distance and the total variation distance. Our first result on the Wasserstein distance assumes the instance space is bounded, while our second result takes advantage of dimensionality reduction. Our results naturally apply to Wasserstein GANs and Energy-Based GANs, and our bounds provide new training objectives for these two. Although our work is mainly theoretical, we perform numerical experiments showing non-vacuous generalization bounds for Wasserstein GANs on synthetic datasets.
    Learning Causal Representations of Single Cells via Sparse Mechanism Shift Modeling. (arXiv:2211.03553v4 [q-bio.GN] UPDATED)
    Latent variable models such as the Variational Auto-Encoder (VAE) have become a go-to tool for analyzing biological data, especially in the field of single-cell genomics. One remaining challenge is the interpretability of latent variables as biological processes that define a cell's identity. Outside of biological applications, this problem is commonly referred to as learning disentangled representations. Although several disentanglement-promoting variants of the VAE were introduced, and applied to single-cell genomics data, this task has been shown to be infeasible from independent and identically distributed measurements, without additional structure. Instead, recent methods propose to leverage non-stationary data, as well as the sparse mechanism shift assumption in order to learn disentangled representations with a causal semantic. Here, we extend the application of these methodological advances to the analysis of single-cell genomics data with genetic or chemical perturbations. More precisely, we propose a deep generative model of single-cell gene expression data for which each perturbation is treated as a stochastic intervention targeting an unknown, but sparse, subset of latent variables. We benchmark these methods on simulated single-cell data to evaluate their performance at latent units recovery, causal target identification and out-of-domain generalization. Finally, we apply those approaches to two real-world large-scale gene perturbation data sets and find that models that exploit the sparse mechanism shift hypothesis surpass contemporary methods on a transfer learning task. We implement our new model and benchmarks using the scvi-tools library, and release it as open-source software at https://github.com/Genentech/sVAE.
    SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance. (arXiv:2302.08783v1 [cs.LG])
    We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.
    Learnable Topological Features for Phylogenetic Inference via Graph Neural Networks. (arXiv:2302.08840v1 [stat.ML])
    Structural information of phylogenetic tree topologies plays an important role in phylogenetic inference. However, finding appropriate topological structures for specific phylogenetic inference tasks often requires significant design effort and domain expertise. In this paper, we propose a novel structural representation method for phylogenetic inference based on learnable topological features. By combining the raw node features that minimize the Dirichlet energy with modern graph representation learning techniques, our learnable topological features can provide efficient structural information of phylogenetic trees that automatically adapts to different downstream tasks without requiring domain expertise. We demonstrate the effectiveness and efficiency of our method on a simulated data tree probability estimation task and a benchmark of challenging real data variational Bayesian phylogenetic inference problems.
    Highly connected dynamic artificial neural networks. (arXiv:2302.08928v1 [cs.LG])
    An object-oriented approach to implementing artificial neural networks is introduced in this article. The networks obtained in this way are highly connected in that they admit edges between nodes in any layers of the network, and dynamic, in that the insertion, or deletion, of nodes, edges or layers of nodes can be effected in a straightforward way. In addition, the activation functions of nodes need not be uniform within layers, and can also be changed within individual nodes. Methods for implementing the feedforward step and the backpropagation technique in such networks are presented here. Methods for creating networks, for implementing the various dynamic properties and for saving and recreating networks are also described.
    Multi-View Clustering from the Perspective of Mutual Information. (arXiv:2302.08743v1 [cs.LG])
    Exploring the complementary information of multi-view data to improve clustering effects is a crucial issue in multi-view clustering. In this paper, we propose a novel model based on information theory termed Informative Multi-View Clustering (IMVC), which extracts the common and view-specific information hidden in multi-view data and constructs a clustering-oriented comprehensive representation. More specifically, we concatenate multiple features into a unified feature representation, then pass it through a encoder to retrieve the common representation across views. Simultaneously, the features of each view are sent to a encoder to produce a compact view-specific representation, respectively. Thus, we constrain the mutual information between the common representation and view-specific representations to be minimal for obtaining multi-level information. Further, the common representation and view-specific representation are spliced to model the refined representation of each view, which is fed into a decoder to reconstruct the initial data with maximizing their mutual information. In order to form a comprehensive representation, the common representation and all view-specific representations are concatenated. Furthermore, to accommodate the comprehensive representation better for the clustering task, we maximize the mutual information between an instance and its k-nearest neighbors to enhance the intra-cluster aggregation, thus inducing well separation of different clusters at the overall aspect. Finally, we conduct extensive experiments on six benchmark datasets, and the experimental results indicate that the proposed IMVC outperforms other methods.
    A Three-Phase Artificial Orcas Algorithm for Continuous and Discrete Problems. (arXiv:2302.08855v1 [cs.NE])
    In this paper, a new swarm intelligence algorithm based on orca behaviors is proposed for problem solving. The algorithm called artificial orca algorithm (AOA) consists of simulating the orca lifestyle and in particular the social organization, the echolocation mechanism, and some hunting techniques. The originality of the proposal is that for the first time a meta-heuristic simulates simultaneously several behaviors of just one animal species. AOA was adapted to discrete problems and applied on the maze game with four level of complexity. A bunch of substantial experiments were undertaken to set the algorithm parameters for this issue. The algorithm performance was assessed by considering the success rate, the run time, and the solution path size. Finally, for comparison purposes, the authors conducted a set of experiments on state-of-the-art evolutionary algorithms, namely ACO, BA, BSO, EHO, PSO, and WOA. The overall obtained results clearly show the superiority of AOA over the other tested algorithms.
    Utilization of domain knowledge to improve POMDP belief estimation. (arXiv:2302.08748v1 [cs.AI])
    The partially observable Markov decision process (POMDP) framework is a common approach for decision making under uncertainty. Recently, multiple studies have shown that by integrating relevant domain knowledge into POMDP belief estimation, we can improve the learned policy's performance. In this study, we propose a novel method for integrating the domain knowledge into probabilistic belief update in POMDP framework using Jeffrey's rule and normalization. We show that the domain knowledge can be utilized to reduce the data requirement and improve performance for POMDP policy learning with RL.
    (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability. (arXiv:2302.08982v1 [cs.LG])
    In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the ``edge of stability'' regime. Our findings are supported by experimental results.
    Minimizing Dynamic Regret on Geodesic Metric Spaces. (arXiv:2302.08652v1 [cs.LG])
    In this paper, we consider the sequential decision problem where the goal is to minimize the general dynamic regret on a complete Riemannian manifold. The task of offline optimization on such a domain, also known as a geodesic metric space, has recently received significant attention. The online setting has received significantly less attention, and it has remained an open question whether the body of results that hold in the Euclidean setting can be transplanted into the land of Riemannian manifolds where new challenges (e.g., curvature) come into play. In this paper, we show how to get optimistic regret bound on manifolds with non-positive curvature whenever improper learning is allowed and propose an array of adaptive no-regret algorithms. To the best of our knowledge, this is the first work that considers general dynamic regret and develops "optimistic" online learning algorithms which can be employed on geodesic metric spaces.
    Data Driven Reward Initialization for Preference based Reinforcement Learning. (arXiv:2302.08733v1 [cs.LG])
    Preference-based Reinforcement Learning (PbRL) methods utilize binary feedback from the human in the loop (HiL) over queried trajectory pairs to learn a reward model in an attempt to approximate the human's underlying reward function capturing their preferences. In this work, we investigate the issue of a high degree of variability in the initialized reward models which are sensitive to random seeds of the experiment. This further compounds the issue of degenerate reward functions PbRL methods already suffer from. We propose a data-driven reward initialization method that does not add any additional cost to the human in the loop and negligible cost to the PbRL agent and show that doing so ensures that the predicted rewards of the initialized reward model are uniform in the state space and this reduces the variability in the performance of the method across multiple runs and is shown to improve the overall performance compared to other initialization methods.
    Lip-to-Speech Synthesis in the Wild with Multi-task Learning. (arXiv:2302.08841v1 [cs.SD])
    Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.
    Post-Episodic Reinforcement Learning Inference. (arXiv:2302.08854v1 [stat.ML])
    We consider estimation and inference with data collected from episodic reinforcement learning (RL) algorithms; i.e. adaptive experimentation algorithms that at each period (aka episode) interact multiple times in a sequential manner with a single treated unit. Our goal is to be able to evaluate counterfactual adaptive policies after data collection and to estimate structural parameters such as dynamic treatment effects, which can be used for credit assignment (e.g. what was the effect of the first period action on the final outcome). Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to Z-estimation approaches in the case of static data. However, such estimators fail to be asymptotically normal in the case of adaptive data collection. We propose a re-weighted Z-estimation approach with carefully designed adaptive weights to stabilize the episode-varying estimation variance, which results from the nonstationary policy that typical episodic RL algorithms invoke. We identify proper weighting schemes to restore the consistency and asymptotic normality of the re-weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing reliable confidence regions for target parameters of interest. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
    Optimal Training of Mean Variance Estimation Neural Networks. (arXiv:2302.08875v1 [stat.ML])
    This paper focusses on the optimal implementation of a Mean Variance Estimation network (MVE network) (Nix and Weigend, 1994). This type of network is often used as a building block for uncertainty estimation methods in a regression setting, for instance Concrete dropout (Gal et al., 2017) and Deep Ensembles (Lakshminarayanan et al., 2017). Specifically, an MVE network assumes that the data is produced from a normal distribution with a mean function and variance function. The MVE network outputs a mean and variance estimate and optimizes the network parameters by minimizing the negative loglikelihood. In this paper, we discuss two points: firstly, the convergence difficulties reported in recent work can be relatively easily prevented by following the recommendation from the original authors that a warm-up period should be used. During this period, only the mean is optimized assuming a fixed variance. This recommendation is often not used in practice. We experimentally demonstrate how essential this step is. We also examine if keeping the mean estimate fixed after the warm-up leads to different results than estimating both the mean and the variance simultaneously after the warm-up. We do not observe a substantial difference. Secondly, we propose a novel improvement of the MVE network: separate regularization of the mean and the variance estimate. We demonstrate, both on toy examples and on a number of benchmark UCI regression data sets, that following the original recommendations and the novel separate regularization can lead to significant improvements.
    Subsampling Suffices for Adaptive Data Analysis. (arXiv:2302.08661v1 [cs.LG])
    Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014). We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work. In addition to its simplicity, we demonstrate the utility of this framework by designing mechanisms for two foundational tasks, statistical queries and median finding. In particular, our mechanism for answering the broadly applicable class of statistical queries is both extremely simple and state of the art in many parameter regimes.
    GPT4MIA: Utilizing Geneative Pre-trained Transformer (GPT-3) as A Plug-and-Play Transductive Model for Medical Image Analysis. (arXiv:2302.08722v1 [cs.CV])
    In this paper, we propose a novel approach (called GPT4MIA) that utilizes Generative Pre-trained Transformer (GPT) as a plug-and-play transductive inference tool for medical image analysis (MIA). We provide theoretical analysis on why a large pre-trained language model such as GPT-3 can be used as a plug-and-play transductive inference model for MIA. At the methodological level, we develop several technical treatments to improve the efficiency and effectiveness of GPT4MIA, including better prompt structure design, sample selection, and prompt ordering of representative samples/features. We present two concrete use cases (with workflow) of GPT4MIA: (1) detecting prediction errors and (2) improving prediction accuracy, working in conjecture with well-established vision-based models for image classification (e.g., ResNet). Experiments validate that our proposed method is effective for these two tasks. We further discuss the opportunities and challenges in utilizing Transformer-based large language models for broader MIA applications.
    Piecewise Deterministic Markov Processes for Bayesian Neural Networks. (arXiv:2302.08724v1 [stat.ML])
    Inference on modern Bayesian Neural Networks (BNNs) often relies on a variational inference treatment, imposing violated assumptions of independence and the form of the posterior. Traditional MCMC approaches avoid these assumptions at the cost of increased computation due to its incompatibility to subsampling of the likelihood. New Piecewise Deterministic Markov Process (PDMP) samplers permit subsampling, though introduce a model specific inhomogenous Poisson Process (IPPs) which is difficult to sample from. This work introduces a new generic and adaptive thinning scheme for sampling from these IPPs, and demonstrates how this approach can accelerate the application of PDMPs for inference in BNNs. Experimentation illustrates how inference with these methods is computationally feasible, can improve predictive accuracy, MCMC mixing performance, and provide informative uncertainty measurements when compared against other approximate inference schemes.
    Improving Transformer-based Networks With Locality For Automatic Speaker Verification. (arXiv:2302.08639v1 [eess.AS])
    Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model.
    Metropolitan Segment Traffic Speeds from Massive Floating Car Data in 10 Cities. (arXiv:2302.08761v1 [cs.LG])
    Traffic analysis is crucial for urban operations and planning, while the availability of dense urban traffic data beyond loop detectors is still scarce. We present a large-scale floating vehicle dataset of per-street segment traffic information, Metropolitan Segment Traffic Speeds from Massive Floating Car Data in 10 Cities (MeTS-10), available for 10 global cities with a 15-minute resolution for collection periods ranging between 108 and 361 days in 2019-2021 and covering more than 1500 square kilometers per metropolitan area. MeTS-10 features traffic speed information at all street levels from main arterials to local streets for Antwerp, Bangkok, Barcelona, Berlin, Chicago, Istanbul, London, Madrid, Melbourne and Moscow. The dataset leverages the industrial-scale floating vehicle Traffic4cast data with speeds and vehicle counts provided in a privacy-preserving spatio-temporal aggregation. We detail the efficient matching approach mapping the data to the OpenStreetMap road graph. We evaluate the dataset by comparing it with publicly available stationary vehicle detector data (for Berlin, London, and Madrid) and the Uber traffic speed dataset (for Barcelona, Berlin, and London). The comparison highlights the differences across datasets in spatio-temporal coverage and variations in the reported traffic caused by the binning method. MeTS-10 enables novel, city-wide analysis of mobility and traffic patterns for ten major world cities, overcoming current limitations of spatially sparse vehicle detector data. The large spatial and temporal coverage offers an opportunity for joining the MeTS-10 with other datasets, such as traffic surveys in traffic planning studies or vehicle detector data in traffic control settings.
    THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression. (arXiv:2302.08545v1 [cs.LG])
    Deep neural networks (DNNs) are the de-facto standard for essential use cases, such as image classification, computer vision, and natural language processing. As DNNs and datasets get larger, they require distributed training on increasingly larger clusters. A main bottleneck is then the resulting communication overhead where workers exchange model updates (i.e., gradients) on a per-round basis. To address this bottleneck and accelerate training, a widely-deployed approach is compression. However, previous deployments often apply bi-directional compression schemes by simply using a uni-directional gradient compression scheme in each direction. This results in significant computational overheads at the parameter server and increased compression error, leading to longer training and lower accuracy. We introduce Tensor Homomorphic Compression (THC), a novel bi-directional compression framework that enables the direct aggregation of compressed values while optimizing the bandwidth to accuracy tradeoff, thus eliminating the aforementioned overheads. Moreover, THC is compatible with in-network aggregation (INA), which allows for further acceleration. Evaluation over a testbed shows that THC improves time-to-accuracy in comparison to alternatives by up to 1.32x with a software PS and up to 1.51x using INA. Finally, we demonstrate that THC is scalable and tolerant for acceptable packet-loss rates.
    Online Learning Guided Curvature Approximation: A Quasi-Newton Method with Global Non-Asymptotic Superlinear Convergence. (arXiv:2302.08580v1 [math.OC])
    Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property. However, existing results for these algorithms are limited as they provide either (i) a global convergence guarantee with an asymptotic superlinear convergence rate, or (ii) a local non-asymptotic superlinear rate for the case that the initial point and the initial Hessian approximation are chosen properly. Furthermore, these results are not composable, since when the iterates of the globally convergent methods reach the region of local superlinear convergence, it cannot be guaranteed the Hessian approximation matrix will satisfy the required conditions for a non-asymptotic local superlienar convergence rate. In this paper, we close this gap and present the first globally convergent quasi-Newton method with an explicit non-asymptotic superlinear convergence rate. Unlike classical quasi-Newton methods, we build our algorithm upon the hybrid proximal extragradient method and propose a novel online learning framework for updating the Hessian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Hessian approximation update as an online convex optimization problem in the space of matrices, and relate the bounded regret of the online problem to the superlinear convergence of our method.
    Visual deep learning-based explanation for neuritic plaques segmentation in Alzheimer's Disease using weakly annotated whole slide histopathological images. (arXiv:2302.08511v1 [eess.IV])
    Quantifying the distribution and morphology of tau protein structures in brain tissues is key to diagnosing Alzheimer's Disease (AD) and its subtypes. Recently, deep learning (DL) models such as UNet have been successfully used for automatic segmentation of histopathological whole slide images (WSI) of biological tissues. In this study, we propose a DL-based methodology for semantic segmentation of tau lesions (i.e., neuritic plaques) in WSI of postmortem patients with AD. The state of the art in semantic segmentation of neuritic plaques in human WSI is very limited. Our study proposes a baseline able to generate a significant advantage for morphological analysis of these tauopathies for further stratification of AD patients. Essential discussions concerning biomarkers (ALZ50 versus AT8 tau antibodies), the imaging modality (different slide scanner resolutions), and the challenge of weak annotations are addressed within this seminal study. The analysis of the impact of context in plaque segmentation is important to understand the role of the micro-environment for reliable tau protein segmentation. In addition, by integrating visual interpretability, we are able to explain how the network focuses on a region of interest (ROI), giving additional insights to pathologists. Finally, the release of a new expert-annotated database and the code (\url{https://github.com/aramis-lab/miccai2022-stratifiad.git}) will be helpful for the scientific community to accelerate the development of new pipelines for human WSI processing in AD.
    Search to Capture Long-range Dependency with Stacking GNNs for Graph Classification. (arXiv:2302.08671v1 [cs.LG])
    In recent years, Graph Neural Networks (GNNs) have been popular in the graph classification task. Currently, shallow GNNs are more common due to the well-known over-smoothing problem facing deeper GNNs. However, they are sub-optimal without utilizing the information from distant nodes, i.e., the long-range dependencies. The mainstream methods in the graph classification task can extract the long-range dependencies either by designing the pooling operations or incorporating the higher-order neighbors, while they have evident drawbacks by modifying the original graph structure, which may result in information loss in graph structure learning. In this paper, by justifying the smaller influence of the over-smoothing problem in the graph classification task, we evoke the importance of stacking-based GNNs and then employ them to capture the long-range dependencies without modifying the original graph structure. To achieve this, two design needs are given for stacking-based GNNs, i.e., sufficient model depth and adaptive skip-connection schemes. By transforming the two design needs into designing data-specific inter-layer connections, we propose a novel approach with the help of neural architecture search (NAS), which is dubbed LRGNN (Long-Range Graph Neural Networks). Extensive experiments on five datasets show that the proposed LRGNN can achieve the best performance, and obtained data-specific GNNs with different depth and skip-connection schemes, which can better capture the long-range dependencies.
    Modeling Polypharmacy and Predicting Drug-Drug Interactions using Deep Generative Models on Multimodal Graphs. (arXiv:2302.08680v1 [cs.LG])
    Latent representations of drugs and their targets produced by contemporary graph autoencoder models have proved useful in predicting many types of node-pair interactions on large networks, including drug-drug, drug-target, and target-target interactions. However, most existing approaches model either the node's latent spaces in which node distributions are rigid or do not effectively capture the interrelations between drugs; these limitations hinder the methods from accurately predicting drug-pair interactions. In this paper, we present the effectiveness of variational graph autoencoders (VGAE) in modeling latent node representations on multimodal networks. Our approach can produce flexible latent spaces for each node type of the multimodal graph; the embeddings are used later for predicting links among node pairs under different edge types. To further enhance the models' performance, we suggest a new method that concatenates Morgan fingerprints, which capture the molecular structures of each drug, with their latent embeddings before preceding them to the decoding stage for link prediction. Our proposed model shows competitive results on three multimodal networks: (1) a multimodal graph consisting of drug and protein nodes, (2) a multimodal graph constructed from a subset of the DrugBank database involving drug nodes under different interaction types, and (3) a multimodal graph consisting of drug and cell line nodes. Our source code is publicly available at https://github.com/HySonLab/drug-interactions.
    PAC Prediction Sets for Large Language Models of Code. (arXiv:2302.08703v1 [cs.LG])
    Prediction sets have recently been shown to be a promising strategy for quantifying the uncertainty of deep neural networks in a way that provides theoretical guarantees. However, existing techniques have largely targeted settings where the space of labels is simple, so prediction sets can be arbitrary subsets of labels. For structured prediction problems where the space of labels is exponential in size, even prediction sets containing a small fraction of all labels can be exponentially large. In the context of code generation, we propose a solution that considers a restricted set of prediction sets that can compactly be represented as partial programs, which are programs with portions replaced with holes. Given a trained code generation model, our algorithm leverages a programming language's abstract syntax tree to generate a set of programs such that the correct program is in the set with high-confidence. Valuable applications of our algorithm include a Codex-style code generator with holes in uncertain parts of the generated code, which provides a partial program with theoretical guarantees. We evaluate our approach on PICARD (a T5 model for SQL semantic parsing) and Codex (a GPT model for over a dozen programming languages, including Python), demonstrating that our approach generates compact PAC prediction sets. This is the first research contribution that generates PAC prediction sets for generative code models.
    Uniformity Testing over Hypergrids with Subcube Conditioning. (arXiv:2302.09013v1 [cs.DS])
    We give an algorithm for testing uniformity of distributions supported on hypergrids $[m]^n$, which makes $\tilde{O}(\text{poly}(m)\sqrt{n}/\epsilon^2)$ queries to a subcube conditional sampling oracle. When the side length $m$ of the hypergrid is a constant, our algorithm is nearly optimal and strengthens the algorithm of [CCK+21] which has the same query complexity but works for hypercubes $\{\pm 1\}^n$ only. A key technical contribution behind the analysis of our algorithm is a proof of a robust version of Pisier's inequality for functions over $\mathbb{Z}_m^n$ using Fourier analysis.
    Swapped goal-conditioned offline reinforcement learning. (arXiv:2302.08865v1 [cs.LG])
    Offline goal-conditioned reinforcement learning (GCRL) can be challenging due to overfitting to the given dataset. To generalize agents' skills outside the given dataset, we propose a goal-swapping procedure that generates additional trajectories. To alleviate the problem of noise and extrapolation errors, we present a general offline reinforcement learning method called deterministic Q-advantage policy gradient (DQAPG). In the experiments, DQAPG outperforms state-of-the-art goal-conditioned offline RL methods in a wide range of benchmark tasks, and goal-swapping further improves the test results. It is noteworthy, that the proposed method obtains good performance on the challenging dexterous in-hand manipulation tasks for which the prior methods failed.
    In-memory factorization of holographic perceptual representations. (arXiv:2211.05052v2 [cs.ET] UPDATED)
    Disentanglement of constituent factors of a sensory signal is central to perception and cognition and hence is a critical task for future artificial intelligence systems. In this paper, we present a compute engine capable of efficiently factorizing holographic perceptual representations by exploiting the computation-in-superposition capability of brain-inspired hyperdimensional computing and the intrinsic stochasticity associated with analog in-memory computing based on nanoscale memristive devices. Such an iterative in-memory factorizer is shown to solve at least five orders of magnitude larger problems that cannot be solved otherwise, while also significantly lowering the computational time and space complexity. We present a large-scale experimental demonstration of the factorizer by employing two in-memory compute chips based on phase-change memristive devices. The dominant matrix-vector multiply operations are executed at O(1) thus reducing the computational time complexity to merely the number of iterations. Moreover, we experimentally demonstrate the ability to factorize visual perceptual representations reliably and efficiently.
    Graph Feedback via Reduction to Regression. (arXiv:2302.08631v1 [cs.LG])
    When feedback is partial, leveraging all available information is critical to minimizing data requirements. Graph feedback, which interpolates between the supervised and bandit regimes, has been extensively studied; but the mature theory is grounded in impractical algorithms. We present and analyze an approach to contextual bandits with graph feedback based upon reduction to regression. The resulting algorithms are practical and achieve known minimax rates.
    On the Sparse DAG Structure Learning Based on Adaptive Lasso. (arXiv:2209.02946v3 [stat.ML] UPDATED)
    Learning the underlying Bayesian Networks (BNs), represented by directed acyclic graphs (DAGs), of the concerned events from purely-observational data is a crucial part of evidential reasoning. This task remains challenging due to the large and discrete search space. A recent flurry of developments followed NOTEARS[1] recast this combinatorial problem into a continuous optimization problem by leveraging an algebraic equality characterization of acyclicity. However, the continuous optimization methods suffer from obtaining non-spare graphs after the numerical optimization, which leads to the inflexibility to rule out the potentially cycle-inducing edges or false discovery edges with small values. To address this issue, in this paper, we develop a completely data-driven DAG structure learning method without a predefined value to post-threshold small values. We name our method NOTEARS with adaptive Lasso (NOTEARS-AL), which is achieved by applying the adaptive penalty method to ensure the sparsity of the estimated DAG. Moreover, we show that NOTEARS-AL also inherits the oracle properties under some specific conditions. Extensive experiments on both synthetic and a real-world dataset demonstrate that our method consistently outperforms NOTEARS.
    MiDi: Mixed Graph and 3D Denoising Diffusion for Molecule Generation. (arXiv:2302.09048v1 [cs.LG])
    This work introduces MiDi, a diffusion model for jointly generating molecular graphs and corresponding 3D conformers. In contrast to existing models, which derive molecular bonds from the conformation using predefined rules, MiDi streamlines the molecule generation process with an end-to-end differentiable model. Experimental results demonstrate the benefits of this approach: on the complex GEOM-DRUGS dataset, our model generates significantly better molecular graphs than 3D-based models and even surpasses specialized algorithms that directly optimize the bond orders for validity. Our code is available at github.com/cvignac/MiDi.
    Fast Temporal Wavelet Graph Neural Networks. (arXiv:2302.08643v1 [cs.LG])
    Spatio-temporal signals forecasting plays an important role in numerous domains, especially in neuroscience and transportation. The task is challenging due to the highly intricate spatial structure, as well as the non-linear temporal dynamics of the network. To facilitate reliable and timely forecast for the human brain and traffic networks, we propose the Fast Temporal Wavelet Graph Neural Networks (FTWGNN) that is both time- and memory-efficient for learning tasks on timeseries data with the underlying graph structure, thanks to the theories of multiresolution analysis and wavelet theory on discrete spaces. We employ Multiresolution Matrix Factorization (MMF) (Kondor et al., 2014) to factorize the highly dense graph structure and compute the corresponding sparse wavelet basis that allows us to construct fast wavelet convolution as the backbone of our novel architecture. Experimental results on real-world PEMS-BAY, METR-LA traffic datasets and AJILE12 ECoG dataset show that FTWGNN is competitive with the state-of-the-arts while maintaining a low computational footprint.
    Generative Causal Representation Learning for Out-of-Distribution Motion Forecasting. (arXiv:2302.08635v1 [cs.LG])
    Conventional supervised learning methods typically assume i.i.d samples and are found to be sensitive to out-of-distribution (OOD) data. We propose Generative Causal Representation Learning (GCRL) which leverages causality to facilitate knowledge transfer under distribution shifts. While we evaluate the effectiveness of our proposed method in human trajectory prediction models, GCRL can be applied to other domains as well. First, we propose a novel causal model that explains the generative factors in motion forecasting datasets using features that are common across all environments and with features that are specific to each environment. Selection variables are used to determine which parts of the model can be directly transferred to a new environment without fine-tuning. Second, we propose an end-to-end variational learning paradigm to learn the causal mechanisms that generate observations from features. GCRL is supported by strong theoretical results that imply identifiability of the causal model under certain assumptions. Experimental results on synthetic and real-world motion forecasting datasets show the robustness and effectiveness of our proposed method for knowledge transfer under zero-shot and low-shot settings by substantially outperforming the prior motion forecasting models on out-of-distribution prediction.
    Approaching epidemiological dynamics of COVID-19 with physics-informed neural networks. (arXiv:2302.08796v1 [q-bio.QM])
    A physics-informed neural network (PINN) embedded with the susceptible-infected-removed (SIR) model is devised to understand the temporal evolution dynamics of infectious diseases. Firstly, the effectiveness of this approach is demonstrated on synthetic data as generated from the numerical solution of the susceptible-asymptomatic-infected-recovered-dead (SAIRD) model. Then, the method is applied to COVID-19 data reported for Germany and shows that it can accurately identify and predict virus spread trends. The results indicate that an incomplete physics-informed model can approach more complicated dynamics efficiently. Thus, the present work demonstrates the high potential of using machine learning methods, e.g., PINNs, to study and predict epidemic dynamics in combination with compartmental models.
    Foundation Models for Natural Language Processing -- Pre-trained Language Models Integrating Media. (arXiv:2302.08575v1 [cs.CL])
    This open access book provides a comprehensive overview of the state of the art in research and applications of Foundation Models and is intended for readers familiar with basic Natural Language Processing (NLP) concepts. Over the recent years, a revolutionary new paradigm has been developed for training models for NLP. These models are first pre-trained on large collections of text documents to acquire general syntactic knowledge and semantic information. Then, they are fine-tuned for specific tasks, which they can often solve with superhuman accuracy. When the models are large enough, they can be instructed by prompts to solve new tasks without any fine-tuning. Moreover, they can be applied to a wide range of different media and problem domains, ranging from image and video processing to robot control learning. Because they provide a blueprint for solving many tasks in artificial intelligence, they have been called Foundation Models. After a brief introduction to basic NLP models the main pre-trained language models BERT, GPT and sequence-to-sequence transformer are described, as well as the concepts of self-attention and context-sensitive embedding. Then, different approaches to improving these models are discussed, such as expanding the pre-training criteria, increasing the length of input texts, or including extra knowledge. An overview of the best-performing models for about twenty application areas is then presented, e.g., question answering, translation, story generation, dialog systems, generating images from text, etc. For each application area, the strengths and weaknesses of current models are discussed, and an outlook on further developments is given. In addition, links are provided to freely available program code. A concluding chapter summarizes the economic opportunities, mitigation of risks, and potential developments of AI.
    Pretraining Language Models with Human Preferences. (arXiv:2302.08582v1 [cs.CL])
    Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.
    A State Augmentation based approach to Reinforcement Learning from Human Preferences. (arXiv:2302.08734v1 [cs.AI])
    Reinforcement Learning has suffered from poor reward specification, and issues for reward hacking even in simple enough domains. Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs by a human in the loop indicating their preferences about the agent's behavior to learn a reward model. In this work, we present a state augmentation technique that allows the agent's reward model to be robust and follow an invariance consistency that significantly improved performance, i.e. the reward recovery and subsequent return computed using the learned policy over our baseline PEBBLE. We validate our method on three domains, Mountain Car, a locomotion task of Quadruped-Walk, and a robotic manipulation task of Sweep-Into, and find that using the proposed augmentation the agent not only benefits in the overall performance but does so, quite early in the agent's training phase.
    A Near-Optimal Algorithm for Bilevel Empirical Risk Minimization. (arXiv:2302.08766v1 [stat.ML])
    Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires $\mathcal{O}((n+m)^{\frac12}\varepsilon^{-1})$ gradient computations to achieve $\varepsilon$-stationarity with $n+m$ the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, which is therefore optimal in terms of sample complexity.
    Federated Learning as a Network Effects Game. (arXiv:2302.08533v1 [cs.LG])
    Federated Learning (FL) aims to foster collaboration among a population of clients to improve the accuracy of machine learning without directly sharing local data. Although there has been rich literature on designing federated learning algorithms, most prior works implicitly assume that all clients are willing to participate in a FL scheme. In practice, clients may not benefit from joining in FL, especially in light of potential costs related to issues such as privacy and computation. In this work, we study the clients' incentives in federated learning to help the service provider design better solutions and ensure clients make better decisions. We are the first to model clients' behaviors in FL as a network effects game, where each client's benefit depends on other clients who also join the network. Using this setup we analyze the dynamics of clients' participation and characterize the equilibrium, where no client has incentives to alter their decision. Specifically, we show that dynamics in the population naturally converge to equilibrium without needing explicit interventions. Finally, we provide a cost-efficient payment scheme that incentivizes clients to reach a desired equilibrium when the initial network is empty.
    Using Explainable AI to Cross-Validate Socio-economic Disparities Among Covid-19 Patient Mortality. (arXiv:2302.08605v1 [cs.LG])
    This paper applies eXplainable Artificial Intelligence (XAI) methods to investigate the socioeconomic disparities in COVID patient mortality. An Extreme Gradient Boosting (XGBoost) prediction model is built based on a de-identified Austin area hospital dataset to predict the mortality of COVID-19 patients. We apply two XAI methods, Shapley Additive exPlanations (SHAP) and Locally Interpretable Model Agnostic Explanations (LIME), to compare the global and local interpretation of feature importance. This paper demonstrates the advantages of using XAI which shows the feature importance and decisive capability. Furthermore, we use the XAI methods to cross-validate their interpretations for individual patients. The XAI models reveal that Medicare financial class, older age, and gender have high impact on the mortality prediction. We find that LIME local interpretation does not show significant differences in feature importance comparing to SHAP, which suggests pattern confirmation. This paper demonstrates the importance of XAI methods in cross-validation of feature attributions.
    Robust expected improvement for Bayesian optimization. (arXiv:2302.08612v1 [cs.LG])
    Bayesian Optimization (BO) links Gaussian Process (GP) surrogates with sequential design toward optimizing expensive-to-evaluate black-box functions. Example design heuristics, or so-called acquisition functions, like expected improvement (EI), balance exploration and exploitation to furnish global solutions under stringent evaluation budgets. However, they fall short when solving for robust optima, meaning a preference for solutions in a wider domain of attraction. Robust solutions are useful when inputs are imprecisely specified, or where a series of solutions is desired. A common mathematical programming technique in such settings involves an adversarial objective, biasing a local solver away from ``sharp'' troughs. Here we propose a surrogate modeling and active learning technique called robust expected improvement (REI) that ports adversarial methodology into the BO/GP framework. After describing the methods, we illustrate and draw comparisons to several competitors on benchmark synthetic and real problems of varying complexity.
    SAM operates far from home: eigenvalue regularization as a dynamical phenomenon. (arXiv:2302.08692v1 [cs.LG])
    The Sharpness Aware Minimization (SAM) optimization algorithm has been shown to control large eigenvalues of the loss Hessian and provide generalization benefits in a variety of settings. The original motivation for SAM was a modified loss function which penalized sharp minima; subsequent analyses have also focused on the behavior near minima. However, our work reveals that SAM provides a strong regularization of the eigenvalues throughout the learning trajectory. We show that in a simplified setting, SAM dynamically induces a stabilization related to the edge of stability (EOS) phenomenon observed in large learning rate gradient descent. Our theory predicts the largest eigenvalue as a function of the learning rate and SAM radius parameters. Finally, we show that practical models can also exhibit this EOS stabilization, and that understanding SAM must account for these dynamics far away from any minima.
    Imitation from Arbitrary Experience: A Dual Unification of Reinforcement and Imitation Learning Methods. (arXiv:2302.08560v1 [cs.LG])
    It is well known that Reinforcement Learning (RL) can be formulated as a convex program with linear constraints. The dual form of this formulation is unconstrained, which we refer to as dual RL, and can leverage preexisting tools from convex optimization to improve the learning performance of RL agents. We show that several state-of-the-art deep RL algorithms (in online, offline, and imitation settings) can be viewed as dual RL approaches in a unified framework. This unification calls for the methods to be studied on common ground, so as to identify the components that actually contribute to the success of these methods. Our unification also reveals that prior off-policy imitation learning methods in the dual space are based on an unrealistic coverage assumption and are restricted to matching a particular f-divergence. We propose a new method using a simple modification to the dual framework that allows for imitation learning with arbitrary off-policy data to obtain near-expert performance.
    Online Spatio-Temporal Correlation-Based Federated Learning for Traffic Flow Forecasting. (arXiv:2302.08658v1 [cs.LG])
    Traffic flow forecasting (TFF) is of great importance to the construction of Intelligent Transportation Systems (ITS). To mitigate communication burden and tackle with the problem of privacy leakage aroused by centralized forecasting methods, Federated Learning (FL) has been applied to TFF. However, existing FL-based approaches employ batch learning manner, which makes the pre-trained models inapplicable to subsequent traffic data, thus exhibiting subpar prediction performance. In this paper, we perform the first study of forecasting traffic flow adopting Online Learning (OL) manner in FL framework and then propose a novel prediction method named Online Spatio-Temporal Correlation-based Federated Learning (FedOSTC), aiming to guarantee performance gains regardless of traffic fluctuation. Specifically, clients employ Gated Recurrent Unit (GRU)-based encoders to obtain the internal temporal patterns inside traffic data sequences. Then, the central server evaluates spatial correlation among clients via Graph Attention Network (GAT), catering to the dynamic changes of spatial closeness caused by traffic fluctuation. Furthermore, to improve the generalization of the global model for upcoming traffic data, a period-aware aggregation mechanism is proposed to aggregate the local models which are optimized using Online Gradient Descent (OGD) algorithm at clients. We perform comprehensive experiments on two real-world datasets to validate the efficiency and effectiveness of our proposed method and the numerical results demonstrate the superiority of FedOSTC.
    Infinite Action Contextual Bandits with Reusable Data Exhaust. (arXiv:2302.08551v1 [cs.LG])
    For infinite action contextual bandits, smoothed regret and reduction to regression results in state-of-the-art online statistical performance with computational cost independent of the action set: unfortunately, the resulting data exhaust does not have well-defined importance-weights. This frustrates the execution of downstream data science processes such as offline model selection. In this paper we describe an online algorithm with an equivalent smoothed regret guarantee, but which generates well-defined importance weights: in exchange, the online computational cost increases, but only to order smoothness (i.e., still independent of the action set). This removes a key obstacle to adoption of smoothed regret in production scenarios.
    MM Algorithms to Estimate Parameters in Continuous-time Markov Chains. (arXiv:2302.08588v1 [cs.LG])
    Continuous-time Markov chains (CTMCs) are popular modeling formalism that constitutes the underlying semantics for real-time probabilistic systems such as queuing networks, stochastic process algebras, and calculi for systems biology. Prism and Storm are popular model checking tools that provide a number of powerful analysis techniques for CTMCs. These tools accept models expressed as the parallel composition of a number of modules interacting with each other. The outcome of the analysis is strongly dependent on the parameter values used in the model which govern the timing and probability of events of the resulting CTMC. However, for some applications, parameter values have to be empirically estimated from partially-observable executions. In this work, we address the problem of estimating parameter values of CTMCs expressed as Prism models from a number of partially-observable executions. We introduce the class parametric CTMCs -- CTMCs where transition rates are polynomial functions over a set of parameters -- as an abstraction of CTMCs covering a large class of Prism models. Then, building on a theory of algorithms known by the initials MM, for minorization-maximization, we present iterative maximum likelihood estimation algorithms for parametric CTMCs covering two learning scenarios: when both state-labels and dwell times are observable, or just state-labels are. We conclude by illustrating the use of our technique in a simple but non-trivial case study: the analysis of the spread of COVID-19 in presence of lockdown countermeasures.
    A Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques. (arXiv:2302.08571v1 [cs.LG])
    The union of Edge Computing (EC) and Artificial Intelligence (AI) has brought forward the Edge AI concept to provide intelligent solutions close to end-user environment, for privacy preservation, low latency to real-time performance, as well as resource optimization. Machine Learning (ML), as the most advanced branch of AI in the past few years, has shown encouraging results and applications in the edge environment. Nevertheless, edge powered ML solutions are more complex to realize due to the joint constraints from both edge computing and AI domains, and the corresponding solutions are expected to be efficient and adapted in technologies such as data processing, model compression, distributed inference, and advanced learning paradigms for Edge ML requirements. Despite that a great attention of Edge ML is gained in both academic and industrial communities, we noticed the lack of a complete survey on existing Edge ML technologies to provide a common understanding of this concept. To tackle this, this paper aims at providing a comprehensive taxonomy and a systematic review of Edge ML techniques: we start by identifying the Edge ML requirements driven by the joint constraints. We then survey more than twenty paradigms and techniques along with their representative work, covering two main parts: edge inference, and edge learning. In particular, we analyze how each technique fits into Edge ML by meeting a subset of the identified requirements. We also summarize Edge ML open issues to shed light on future directions for Edge ML.
    Massively Multilingual Shallow Fusion with Large Language Models. (arXiv:2302.08917v1 [cs.CL])
    While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixture-of-experts LLM, i.e., generalist language model (GLaM). When the number of experts increases, GLaM dynamically selects only two at each decoding step to keep the inference computation roughly constant. We then apply GLaM to a multilingual shallow fusion task based on a state-of-the-art end-to-end model. Compared to a dense LM of similar computation during inference, GLaM reduces the WER of an English long-tail test set by 4.4% relative. In a multilingual shallow fusion task, GLaM improves 41 out of 50 languages with an average relative WER reduction of 3.85%, and a maximum reduction of 10%. Compared to the baseline model, GLaM achieves an average WER reduction of 5.53% over 43 languages.
    3D Human Pose Lifting with Grid Convolution. (arXiv:2302.08760v1 [cs.CV])
    Existing lifting networks for regressing 3D human poses from 2D single-view poses are typically constructed with linear layers based on graph-structured representation learning. In sharp contrast to them, this paper presents Grid Convolution (GridConv), mimicking the wisdom of regular convolution operations in image space. GridConv is based on a novel Semantic Grid Transformation (SGT) which leverages a binary assignment matrix to map the irregular graph-structured human pose onto a regular weave-like grid pose representation joint by joint, enabling layer-wise feature learning with GridConv operations. We provide two ways to implement SGT, including handcrafted and learnable designs. Surprisingly, both designs turn out to achieve promising results and the learnable one is better, demonstrating the great potential of this new lifting representation learning formulation. To improve the ability of GridConv to encode contextual cues, we introduce an attention module over the convolutional kernel, making grid convolution operations input-dependent, spatial-aware and grid-specific. We show that our fully convolutional grid lifting network outperforms state-of-the-art methods with noticeable margins under (1) conventional evaluation on Human3.6M and (2) cross-evaluation on MPI-INF-3DHP. Code is available at https://github.com/OSVAI/GridConv
    SHINE-Mapping: Large-Scale 3D Mapping Using Sparse Hierarchical Implicit Neural Representations. (arXiv:2210.02299v2 [cs.CV] UPDATED)
    Accurate mapping of large-scale environments is an essential building block of most outdoor autonomous systems. Challenges of traditional mapping methods include the balance between memory consumption and mapping accuracy. This paper addresses the problem of achieving large-scale 3D reconstruction using implicit representations built from 3D LiDAR measurements. We learn and store implicit features through an octree-based, hierarchical structure, which is sparse and extensible. The implicit features can be turned into signed distance values through a shallow neural network. We leverage binary cross entropy loss to optimize the local features with the 3D measurements as supervision. Based on our implicit representation, we design an incremental mapping system with regularization to tackle the issue of forgetting in continual learning. Our experiments show that our 3D reconstructions are more accurate, complete, and memory-efficient than current state-of-the-art 3D mapping methods.
    Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions. (arXiv:2205.14318v2 [cs.LG] UPDATED)
    Pretrained language models have shown superior performance on many natural language processing tasks, yet they still struggle at multi-step formal reasoning tasks like grade school math problems. One key challenge of finetuning them to solve such math reasoning problems is that many existing datasets only contain one reference solution for each problem, despite the fact that there are often alternative solutions resembling different reasoning paths to the final answer. This way, the finetuned models are biased towards the limited reference solutions, which limits their generalization to unseen examples. To mitigate this issue, we propose to let the model perform sampling during training and learn from both self-sampled fully-correct solutions, which yield the correct answer upon execution, and partially-correct solutions, whose intermediate state matches an intermediate state of a known correct solution. We show that our use of self-sampled correct and partially-correct solutions can benefit learning and help guide the sampling process, leading to more efficient exploration of the solution space. Additionally, we explore various training objectives to support learning from multiple solutions per example and find they greatly affect the performance. Experiments on two math reasoning datasets show the effectiveness of our method compared to learning from a single reference solution with MLE, where we improve PASS@100 from 35.5% to 44.5% for GSM8K, and 27.6% to 36.2% PASS@80 for MathQA. Such improvements are also consistent across different model sizes. Our code is available at https://github.com/microsoft/TraceCodegen.
    Enhanced Sampling of Configuration and Path Space in a Generalized Ensemble by Shooting Point Exchange. (arXiv:2302.08757v1 [physics.comp-ph])
    The computer simulation of many molecular processes is complicated by long time scales caused by rare transitions between long-lived states. Here, we propose a new approach to simulate such rare events, which combines transition path sampling with enhanced exploration of configuration space. The method relies on exchange moves between configuration and trajectory space, carried out based on a generalized ensemble. This scheme substantially enhances the efficiency of the transition path sampling simulations, particularly for systems with multiple transition channels, and yields information on thermodynamics, kinetics and reaction coordinates of molecular processes without distorting their dynamics. The method is illustrated using the isomerization of proline in the KPTP tetrapeptide.
    Generative Adversarial Networks for Malware Detection: a Survey. (arXiv:2302.08558v1 [cs.CR])
    Since their proposal in the 2014 paper by Ian Goodfellow, there has been an explosion of research into the area of Generative Adversarial Networks. While they have been utilised in many fields, the realm of malware research is a problem space in which GANs have taken root. From balancing datasets to creating unseen examples in rare classes, GAN models offer extensive opportunities for application. This paper surveys the current research and literature for the use of Generative Adversarial Networks in the malware problem space. This is done with the hope that the reader may be able to gain an overall understanding as to what the Generative Adversarial model provides for this field, and for what areas within malware research it is best utilised. It covers the current related surveys, the different categories of GAN, and gives the outcomes of recent research into optimising GANs for different topics, as well as future directions for exploration.
    Low-Rank Tensor Completion With Generalized CP Decomposition and Nonnegative Integer Tensor Completion. (arXiv:2302.05881v1 [cs.CV] CROSS LISTED)
    The problem of tensor completion is important to many areas such as computer vision, data analysis, signal processing, etc. Previously, a category of methods known as low-rank tensor completion has been proposed and developed, involving the enforcement of low-rank structures on completed tensors. While such methods have been constantly improved, none have previously considered exploiting the numerical properties of tensor elements. This work attempts to construct a new methodological framework called GCDTC (Generalized CP Decomposition Tensor Completion) based on these properties. In this newly introduced framework, the CP Decomposition is reformulated as a Maximum Likelihood Estimate (MLE) problem, and generalized via the introduction of differing loss functions. The generalized decomposition is subsequently applied to low-rank tensor completion. Such loss functions can also be easily adjusted to consider additional factors in completion, such as smoothness, standardization, etc. An example of nonnegative integer tensor decomposition via the Poisson CP Decomposition is given to demonstrate the new methodology's potentials. Through experimentation with real-life data, it is confirmed that this method could produce results superior to current state-of-the-art methodologies. It is expected that the proposed notion would inspire a new set of tensor completion methods based on the generalization of decompositions, thus contributing to related fields.
    Cell-Free Latent Go-Explore. (arXiv:2208.14928v2 [cs.LG] UPDATED)
    In this paper, we introduce Latent Go-Explore (LGE), a simple and general approach based on the Go-Explore paradigm for exploration in reinforcement learning (RL). Go-Explore was initially introduced with a strong domain knowledge constraint for partitioning the state space into cells. However, in most real-world scenarios, drawing domain knowledge from raw observations is complex and tedious. If the cell partitioning is not informative enough, Go-Explore can completely fail to explore the environment. We argue that the Go-Explore approach can be generalized to any environment without domain knowledge and without cells by exploiting a learned latent representation. Thus, we show that LGE can be flexibly combined with any strategy for learning a latent representation. Our results indicate that LGE, although simpler than Go-Explore, is more robust and outperforms state-of-the-art algorithms in terms of pure exploration on multiple hard-exploration environments including Montezuma's Revenge. The LGE implementation is available as open-source at https://github.com/qgallouedec/lge.
    Unique Identification of 50,000+ Virtual Reality Users from Head & Hand Motion Data. (arXiv:2302.08927v1 [cs.CR])
    With the recent explosive growth of interest and investment in virtual reality (VR) and the so-called "metaverse," public attention has rightly shifted toward the unique security and privacy threats that these platforms may pose. While it has long been known that people reveal information about themselves via their motion, the extent to which this makes an individual globally identifiable within virtual reality has not yet been widely understood. In this study, we show that a large number of real VR users (N=55,541) can be uniquely and reliably identified across multiple sessions using just their head and hand motion relative to virtual objects. After training a classification model on 5 minutes of data per person, a user can be uniquely identified amongst the entire pool of 50,000+ with 94.33% accuracy from 100 seconds of motion, and with 73.20% accuracy from just 10 seconds of motion. This work is the first to truly demonstrate the extent to which biomechanics may serve as a unique identifier in VR, on par with widely used biometrics such as facial or fingerprint recognition.
    A Simplistic Model of Neural Scaling Laws: Multiperiodic Santa Fe Processes. (arXiv:2302.09049v1 [cs.IT])
    It was observed that large language models exhibit a power-law decay of cross entropy with respect to the number of parameters and training tokens. When extrapolated literally, this decay implies that the entropy rate of natural language is zero. To understand this phenomenon -- or an artifact -- better, we construct a simple stationary stochastic process and its memory-based predictor that exhibit a power-law decay of cross entropy with the vanishing entropy rate. Our example is based on previously discussed Santa Fe processes, which decompose a random text into a process of narration and time-independent knowledge. Previous discussions assumed that narration is a memoryless source with Zipf's distribution. In this paper, we propose a model of narration that has the vanishing entropy rate and applies a randomly chosen deterministic sequence called a multiperiodic sequence. Under a suitable parameterization, multiperiodic sequences exhibit asymptotic relative frequencies given by Zipf's law. Remaining agnostic about the value of the entropy rate of natural language, we discuss relevance of similar constructions for language modeling.
    Measuring Equality in Machine Learning Security Defenses. (arXiv:2302.08973v1 [cs.LG])
    The machine learning security community has developed myriad defenses for evasion attacks over the past decade. An understudied question in that community is: for whom do these defenses defend? In this work, we consider some common approaches to defending learned systems and whether those approaches may offer unexpected performance inequities when used by different sub-populations. We outline simple parity metrics and a framework for analysis that can begin to answer this question through empirical results of the fairness implications of machine learning security methods. Many methods have been proposed that can cause direct harm, which we describe as biased vulnerability and biased rejection. Our framework and metric can be applied to robustly trained models, preprocessing-based methods, and rejection methods to capture behavior over security budgets. We identify a realistic dataset with a reasonable computational cost suitable for measuring the equality of defenses. Through a case study in speech command recognition, we show how such defenses do not offer equal protection for social subgroups and how to perform such analyses for robustness training, and we present a comparison of fairness between two rejection-based defenses: randomized smoothing and neural rejection. We offer further analysis of factors that correlate to equitable defenses to stimulate the future investigation of how to assist in building such defenses. To the best of our knowledge, this is the first work that examines the fairness disparity in the accuracy-robustness trade-off in speech data and addresses fairness evaluation for rejection-based defenses.
    Quantile LSTM: A Robust LSTM for Anomaly Detection In Time Series Data. (arXiv:2302.08712v1 [cs.LG])
    Anomalies refer to the departure of systems and devices from their normal behaviour in standard operating conditions. An anomaly in an industrial device can indicate an upcoming failure, often in the temporal direction. In this paper, we make two contributions: 1) we estimate conditional quantiles and consider three different ways to define anomalies based on the estimated quantiles. 2) we use a new learnable activation function in the popular Long Short Term Memory networks (LSTM) architecture to model temporal long-range dependency. In particular, we propose Parametric Elliot Function (PEF) as an activation function (AF) inside LSTM, which saturates lately compared to sigmoid and tanh. The proposed algorithms are compared with other well-known anomaly detection algorithms, such as Isolation Forest (iForest), Elliptic Envelope, Autoencoder, and modern Deep Learning models such as Deep Autoencoding Gaussian Mixture Model (DAGMM), Generative Adversarial Networks (GAN). The algorithms are evaluated in terms of various performance metrics, such as Precision and Recall. The algorithms have been tested on multiple industrial time-series datasets such as Yahoo, AWS, GE, and machine sensors. We have found that the LSTM-based quantile algorithms are very effective and outperformed the existing algorithms in identifying anomalies.
    Learning to Forecast Aleatoric and Epistemic Uncertainties over Long Horizon Trajectories. (arXiv:2302.08669v1 [cs.LG])
    Giving autonomous agents the ability to forecast their own outcomes and uncertainty will allow them to communicate their competencies and be used more safely. We accomplish this by using a learned world model of the agent system to forecast full agent trajectories over long time horizons. Real world systems involve significant sources of both aleatoric and epistemic uncertainty that compound and interact over time in the trajectory forecasts. We develop a deep generative world model that quantifies aleatoric uncertainty while incorporating the effects of epistemic uncertainty during the learning process. We show on two reinforcement learning problems that our uncertainty model produces calibrated outcome uncertainty estimates over the full trajectory horizon.
    Explainable Machine Learning for Public Policy: Use Cases, Gaps, and Research Directions. (arXiv:2010.14374v3 [cs.LG] UPDATED)
    Explainability is highly-desired in Machine Learning (ML) systems supporting high-stakes policy decisions in areas such as health, criminal justice, education, and employment. While the field of explainable ML has expanded in recent years, much of this work has not taken real-world needs into account. A majority of proposed methods are designed with \textit{generic} explainability goals without well-defined use-cases or intended end-users and evaluated on simplified tasks, benchmark problems/datasets, or with proxy users (e.g., AMT). We argue that these simplified evaluation settings do not capture the nuances and complexities of real-world applications. As a result, the applicability and effectiveness of this large body of theoretical and methodological work in real-world applications are unclear. In this work, we take steps toward addressing this gap for the domain of public policy. First, we identify the primary use-cases of explainable ML within public policy problems. For each use case, we define the end-users of explanations and the specific goals the explanations have to fulfill. Finally, we map existing work in explainable ML to these use-cases, identify gaps in established capabilities, and propose research directions to fill those gaps to have a practical societal impact through ML. The contribution is 1) a methodology for explainable ML researchers to identify use cases and develop methods targeted at them and 2) using that methodology for the domain of public policy and giving an example for the researchers on developing explainable ML methods that result in real-world impact.
    Ultra-marginal Feature Importance: Learning from Data with Causal Guarantees. (arXiv:2204.09938v4 [stat.ML] UPDATED)
    Scientists frequently prioritize learning from data rather than training the best possible model; however, research in machine learning often prioritizes the latter. Marginal contribution feature importance (MCI) was developed to break this trend by providing a useful framework for quantifying the relationships in data. In this work, we aim to improve upon the theoretical properties, performance, and runtime of MCI by introducing ultra-marginal feature importance (UMFI), which uses dependence removal techniques from the AI fairness literature as its foundation. We first propose axioms for feature importance methods that seek to explain the causal and associative relationships in data, and we prove that UMFI satisfies these axioms under basic assumptions. We then show on real and simulated data that UMFI performs better than MCI, especially in the presence of correlated interactions and unrelated features, while partially learning the structure of the causal graph and reducing the exponential runtime of MCI to super-linear.
    Cardiac Disease Diagnosis on Imbalanced Electrocardiography Data Through Optimal Transport Augmentation. (arXiv:2202.00567v2 [eess.SP] UPDATED)
    In this paper, we focus on a new method of data augmentation to solve the data imbalance problem within imbalanced ECG datasets to improve the robustness and accuracy of heart disease detection. By using Optimal Transport, we augment the ECG disease data from normal ECG beats to balance the data among different categories. We build a Multi-Feature Transformer (MF-Transformer) as our classification model, where different features are extracted from both time and frequency domains to diagnose various heart conditions. Learning from 12-lead ECG signals, our model is able to distinguish five categories of cardiac conditions. Our results demonstrate 1) the classification models' ability to make competitive predictions on five ECG categories; 2) improvements in accuracy and robustness reflecting the effectiveness of our data augmentation method.
    Neuro-symbolic Meta Reinforcement Learning for Trading. (arXiv:2302.08996v1 [cs.AI])
    We model short-duration (e.g. day) trading in financial markets as a sequential decision-making problem under uncertainty, with the added complication of continual concept-drift. We, therefore, employ meta reinforcement learning via the RL2 algorithm. It is also known that human traders often rely on frequently occurring symbolic patterns in price series. We employ logical program induction to discover symbolic patterns that occur frequently as well as recently, and explore whether using such features improves the performance of our meta reinforcement learning algorithm. We report experiments on real data indicating that meta-RL is better than vanilla RL and also benefits from learned symbolic features.
    Algorithmic Hallucinations of Near-Surface Winds: Statistical Downscaling with Generative Adversarial Networks to Convection-Permitting Scales. (arXiv:2302.08720v1 [physics.ao-ph])
    Providing small-scale information about weather and climate is challenging, especially for variables strongly controlled by processes that are unresolved by low-resolution (LR) models. This paper explores emerging machine learning methods from the fields of image super-resolution (SR) and deep learning for statistical downscaling of near-surface winds to convection-permitting scales. Specifically, Generative Adversarial Networks (GANs) are conditioned on LR inputs from a global reanalysis to generate high-resolution (HR) surface winds that emulate those simulated over North America by the Weather Research and Forecasting (WRF) model. Unlike traditional SR models, where LR inputs are idealized coarsened versions of the HR images, WRF emulation involves non-idealized LR inputs from a coarse-resolution reanalysis. In addition to matching the statistical properties of WRF simulations, GANs quickly generate HR fields with impressive realism. However, objectively assessing the realism of the SR models requires careful selection of evaluation metrics. In particular, performance measures based on spatial power spectra reveal the way that GAN configurations change spatial structures in the generated fields, where biases in spatial variability originate, and how models depend on different LR covariates. Inspired by recent computer vision research, a novel methodology that separates spatial frequencies in HR fields is used in an attempt to optimize the SR GANs further. This method, called frequency separation, resulted in deterioration in realism of the generated HR fields. However, frequency separation did show how spatial structures are influenced by the metrics used to optimize the SR models, which led to the development of a more effective partial frequency separation approach.
    InstructABSA: Instruction Learning for Aspect Based Sentiment Analysis. (arXiv:2302.08624v1 [cs.CL])
    In this paper, we present InstructABSA, Aspect-Based Sentiment Analysis (ABSA) using instruction learning paradigm for all ABSA subtasks: Aspect Term Extraction (ATE), Aspect Term Sentiment Classification (ATSC), and Joint Task modeling. Our method introduces positive, negative, and neutral examples to each training sample, and instruction tunes the model (Tk-Instruct Base) for each ABSA subtask, yielding significant performance improvements. Experimental results on the Sem Eval 2014 dataset demonstrate that InstructABSA outperforms the previous state-of-the-art (SOTA) approaches on all three ABSA subtasks (ATE, ATSC, and Joint Task) by a significant margin, outperforming 7x larger models. In particular, InstructABSA surpasses the SOTA on the restaurant ATE subtask by 7.31% points and on the Laptop Joint Task by 8.63% points. Our results also suggest a strong generalization ability to unseen tasks across all three subtasks.
    Intrinsic and extrinsic deep learning on manifolds. (arXiv:2302.08606v1 [stat.ML])
    We propose extrinsic and intrinsic deep neural network architectures as general frameworks for deep learning on manifolds. Specifically, extrinsic deep neural networks (eDNNs) preserve geometric features on manifolds by utilizing an equivariant embedding from the manifold to its image in the Euclidean space. Moreover, intrinsic deep neural networks (iDNNs) incorporate the underlying intrinsic geometry of manifolds via exponential and log maps with respect to a Riemannian structure. Consequently, we prove that the empirical risk of the empirical risk minimizers (ERM) of eDNNs and iDNNs converge in optimal rates. Overall, The eDNNs framework is simple and easy to compute, while the iDNNs framework is accurate and fast converging. To demonstrate the utilities of our framework, various simulation studies, and real data analyses are presented with eDNNs and iDNNs.
    Quantum Computing Provides Exponential Regret Improvement in Episodic Reinforcement Learning. (arXiv:2302.08617v1 [cs.LG])
    In this paper, we investigate the problem of \textit{episodic reinforcement learning} with quantum oracles for state evolution. To this end, we propose an \textit{Upper Confidence Bound} (UCB) based quantum algorithmic framework to facilitate learning of a finite-horizon MDP. Our quantum algorithm achieves an exponential improvement in regret as compared to the classical counterparts, achieving a regret of $\Tilde{\mathcal{O}}(1)$ as compared to $\Tilde{\mathcal{O}}(\sqrt{K})$ \footnote{$\Tilde{\mathcal{O}}(\cdot)$ hides logarithmic terms.}, $K$ being the number of training episodes. In order to achieve this advantage, we exploit efficient quantum mean estimation technique that provides quadratic improvement in the number of i.i.d. samples needed to estimate the mean of sub-Gaussian random variables as compared to classical mean estimation. This improvement is a key to the significant regret improvement in quantum reinforcement learning. We provide proof-of-concept experiments on various RL environments that in turn demonstrate performance gains of the proposed algorithmic framework.
    Efficiently Forgetting What You Have Learned in Graph Representation Learning via Projection. (arXiv:2302.08990v1 [cs.LG])
    As privacy protection receives much attention, unlearning the effect of a specific node from a pre-trained graph learning model has become equally important. However, due to the node dependency in the graph-structured data, representation unlearning in Graph Neural Networks (GNNs) is challenging and less well explored. In this paper, we fill in this gap by first studying the unlearning problem in linear-GNNs, and then introducing its extension to non-linear structures. Given a set of nodes to unlearn, we propose PROJECTOR that unlearns by projecting the weight parameters of the pre-trained model onto a subspace that is irrelevant to features of the nodes to be forgotten. PROJECTOR could overcome the challenges caused by node dependency and enjoys a perfect data removal, i.e., the unlearned model parameters do not contain any information about the unlearned node features which is guaranteed by algorithmic construction. Empirical results on real-world datasets illustrate the effectiveness and efficiency of PROJECTOR.
  • Open

    Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation. (arXiv:2302.08923v1 [math.ST])
    In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we ask ourselves the question: "when is a single Gaussian enough to characterize the error?". Our formula allow us to give sharp answers to this question, both in the positive and negative directions. More precisely, we show that the sufficient conditions for Gaussian universality (or lack of thereof) crucially depend on the alignment between the target weights and the means and covariances of the mixture clusters, which we precisely quantify. In the particular case of least-squares interpolation, we prove a strong universality property of the training error, and show it follows a simple, closed-form expression. Finally, we apply our results to real datasets, clarifying some recent discussion in the literature about Gaussian universality of the errors in this context.
    PAC-Bayesian Generalization Bounds for Adversarial Generative Models. (arXiv:2302.08942v1 [cs.LG])
    We extend PAC-Bayesian theory to generative models and develop generalization bounds for models based on the Wasserstein distance and the total variation distance. Our first result on the Wasserstein distance assumes the instance space is bounded, while our second result takes advantage of dimensionality reduction. Our results naturally apply to Wasserstein GANs and Energy-Based GANs, and our bounds provide new training objectives for these two. Although our work is mainly theoretical, we perform numerical experiments showing non-vacuous generalization bounds for Wasserstein GANs on synthetic datasets.
    Post-Episodic Reinforcement Learning Inference. (arXiv:2302.08854v1 [stat.ML])
    We consider estimation and inference with data collected from episodic reinforcement learning (RL) algorithms; i.e. adaptive experimentation algorithms that at each period (aka episode) interact multiple times in a sequential manner with a single treated unit. Our goal is to be able to evaluate counterfactual adaptive policies after data collection and to estimate structural parameters such as dynamic treatment effects, which can be used for credit assignment (e.g. what was the effect of the first period action on the final outcome). Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to Z-estimation approaches in the case of static data. However, such estimators fail to be asymptotically normal in the case of adaptive data collection. We propose a re-weighted Z-estimation approach with carefully designed adaptive weights to stabilize the episode-varying estimation variance, which results from the nonstationary policy that typical episodic RL algorithms invoke. We identify proper weighting schemes to restore the consistency and asymptotic normality of the re-weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing reliable confidence regions for target parameters of interest. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
    DCI-ES: An Extended Disentanglement Framework with Connections to Identifiability. (arXiv:2210.00364v2 [cs.LG] UPDATED)
    In representation learning, a common approach is to seek representations which disentangle the underlying factors of variation. Eastwood & Williams (2018) proposed three metrics for quantifying the quality of such disentangled representations: disentanglement (D), completeness (C) and informativeness (I). In this work, we first connect this DCI framework to two common notions of linear and nonlinear identifiability, thereby establishing a formal link between disentanglement and the closely-related field of independent component analysis. We then propose an extended DCI-ES framework with two new measures of representation quality - explicitness (E) and size (S) - and point out how D and C can be computed for black-box predictors. Our main idea is that the functional capacity required to use a representation is an important but thus-far neglected aspect of representation quality, which we quantify using explicitness or ease-of-use (E). We illustrate the relevance of our extensions on the MPI3D and Cars3D datasets.
    On the Sparse DAG Structure Learning Based on Adaptive Lasso. (arXiv:2209.02946v3 [stat.ML] UPDATED)
    Learning the underlying Bayesian Networks (BNs), represented by directed acyclic graphs (DAGs), of the concerned events from purely-observational data is a crucial part of evidential reasoning. This task remains challenging due to the large and discrete search space. A recent flurry of developments followed NOTEARS[1] recast this combinatorial problem into a continuous optimization problem by leveraging an algebraic equality characterization of acyclicity. However, the continuous optimization methods suffer from obtaining non-spare graphs after the numerical optimization, which leads to the inflexibility to rule out the potentially cycle-inducing edges or false discovery edges with small values. To address this issue, in this paper, we develop a completely data-driven DAG structure learning method without a predefined value to post-threshold small values. We name our method NOTEARS with adaptive Lasso (NOTEARS-AL), which is achieved by applying the adaptive penalty method to ensure the sparsity of the estimated DAG. Moreover, we show that NOTEARS-AL also inherits the oracle properties under some specific conditions. Extensive experiments on both synthetic and a real-world dataset demonstrate that our method consistently outperforms NOTEARS.
    Universality laws for Gaussian mixtures in generalized linear models. (arXiv:2302.08933v1 [math.ST])
    Let $(x_{i}, y_{i})_{i=1,\dots,n}$ denote independent samples from a general mixture distribution $\sum_{c\in\mathcal{C}}\rho_{c}P_{c}^{x}$, and consider the hypothesis class of generalized linear models $\hat{y} = F(\Theta^{\top}x)$. In this work, we investigate the asymptotic joint statistics of the family of generalized linear estimators $(\Theta_{1}, \dots, \Theta_{M})$ obtained either from (a) minimizing an empirical risk $\hat{R}_{n}(\Theta;X,y)$ or (b) sampling from the associated Gibbs measure $\exp(-\beta n \hat{R}_{n}(\Theta;X,y))$. Our main contribution is to characterize under which conditions the asymptotic joint statistics of this family depends (on a weak sense) only on the means and covariances of the class conditional features distribution $P_{c}^{x}$. In particular, this allow us to prove the universality of different quantities of interest, such as the training and generalization errors, redeeming a recent line of work in high-dimensional statistics working under the Gaussian mixture hypothesis. Finally, we discuss the applications of our results to different machine learning tasks of interest, such as ensembling and uncertainty
    Online Learning Guided Curvature Approximation: A Quasi-Newton Method with Global Non-Asymptotic Superlinear Convergence. (arXiv:2302.08580v1 [math.OC])
    Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property. However, existing results for these algorithms are limited as they provide either (i) a global convergence guarantee with an asymptotic superlinear convergence rate, or (ii) a local non-asymptotic superlinear rate for the case that the initial point and the initial Hessian approximation are chosen properly. Furthermore, these results are not composable, since when the iterates of the globally convergent methods reach the region of local superlinear convergence, it cannot be guaranteed the Hessian approximation matrix will satisfy the required conditions for a non-asymptotic local superlienar convergence rate. In this paper, we close this gap and present the first globally convergent quasi-Newton method with an explicit non-asymptotic superlinear convergence rate. Unlike classical quasi-Newton methods, we build our algorithm upon the hybrid proximal extragradient method and propose a novel online learning framework for updating the Hessian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Hessian approximation update as an online convex optimization problem in the space of matrices, and relate the bounded regret of the online problem to the superlinear convergence of our method.
    The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks. (arXiv:2210.03820v2 [cs.LG] UPDATED)
    In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-order parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.
    Deep reinforcement learning from human preferences. (arXiv:1706.03741v4 [stat.ML] UPDATED)
    For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
    Intrinsic and extrinsic deep learning on manifolds. (arXiv:2302.08606v1 [stat.ML])
    We propose extrinsic and intrinsic deep neural network architectures as general frameworks for deep learning on manifolds. Specifically, extrinsic deep neural networks (eDNNs) preserve geometric features on manifolds by utilizing an equivariant embedding from the manifold to its image in the Euclidean space. Moreover, intrinsic deep neural networks (iDNNs) incorporate the underlying intrinsic geometry of manifolds via exponential and log maps with respect to a Riemannian structure. Consequently, we prove that the empirical risk of the empirical risk minimizers (ERM) of eDNNs and iDNNs converge in optimal rates. Overall, The eDNNs framework is simple and easy to compute, while the iDNNs framework is accurate and fast converging. To demonstrate the utilities of our framework, various simulation studies, and real data analyses are presented with eDNNs and iDNNs.
    Cost-Effective Online Contextual Model Selection. (arXiv:2207.06030v3 [cs.LG] UPDATED)
    How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.
    Piecewise Deterministic Markov Processes for Bayesian Neural Networks. (arXiv:2302.08724v1 [stat.ML])
    Inference on modern Bayesian Neural Networks (BNNs) often relies on a variational inference treatment, imposing violated assumptions of independence and the form of the posterior. Traditional MCMC approaches avoid these assumptions at the cost of increased computation due to its incompatibility to subsampling of the likelihood. New Piecewise Deterministic Markov Process (PDMP) samplers permit subsampling, though introduce a model specific inhomogenous Poisson Process (IPPs) which is difficult to sample from. This work introduces a new generic and adaptive thinning scheme for sampling from these IPPs, and demonstrates how this approach can accelerate the application of PDMPs for inference in BNNs. Experimentation illustrates how inference with these methods is computationally feasible, can improve predictive accuracy, MCMC mixing performance, and provide informative uncertainty measurements when compared against other approximate inference schemes.
    Approximate Bayes Optimal Pseudo-Label Selection. (arXiv:2302.08883v1 [stat.ML])
    Semi-supervised learning by self-training heavily relies on pseudo-label selection (PLS). The selection often depends on the initial model fit on labeled data. Early overfitting might thus be propagated to the final model by selecting instances with overconfident but erroneous predictions, often referred to as confirmation bias. This paper introduces BPLS, a Bayesian framework for PLS that aims to mitigate this issue. At its core lies a criterion for selecting instances to label: an analytical approximation of the posterior predictive of pseudo-samples. We derive this selection criterion by proving Bayes optimality of the posterior predictive of pseudo-samples. We further overcome computational hurdles by approximating the criterion analytically. Its relation to the marginal likelihood allows us to come up with an approximation based on Laplace's method and the Gaussian integral. We empirically assess BPLS for parametric generalized linear and non-parametric generalized additive models on simulated and real-world data. When faced with high-dimensional data prone to overfitting, BPLS outperforms traditional PLS methods.  ( 2 min )
    On (assessing) the fairness of risk score models. (arXiv:2302.08851v1 [cs.LG])
    Recent work on algorithmic fairness has largely focused on the fairness of discrete decisions, or classifications. While such decisions are often based on risk score models, the fairness of the risk models themselves has received considerably less attention. Risk models are of interest for a number of reasons, including the fact that they communicate uncertainty about the potential outcomes to users, thus representing a way to enable meaningful human oversight. Here, we address fairness desiderata for risk score models. We identify the provision of similar epistemic value to different groups as a key desideratum for risk score fairness. Further, we address how to assess the fairness of risk score models quantitatively, including a discussion of metric choices and meaningful statistical comparisons between groups. In this context, we also introduce a novel calibration error metric that is less sample size-biased than previously proposed metrics, enabling meaningful comparisons between groups of different sizes. We illustrate our methodology - which is widely applicable in many other settings - in two case studies, one in recidivism risk prediction, and one in risk of major depressive disorder (MDD) prediction.  ( 2 min )
    Optimal Training of Mean Variance Estimation Neural Networks. (arXiv:2302.08875v1 [stat.ML])
    This paper focusses on the optimal implementation of a Mean Variance Estimation network (MVE network) (Nix and Weigend, 1994). This type of network is often used as a building block for uncertainty estimation methods in a regression setting, for instance Concrete dropout (Gal et al., 2017) and Deep Ensembles (Lakshminarayanan et al., 2017). Specifically, an MVE network assumes that the data is produced from a normal distribution with a mean function and variance function. The MVE network outputs a mean and variance estimate and optimizes the network parameters by minimizing the negative loglikelihood. In this paper, we discuss two points: firstly, the convergence difficulties reported in recent work can be relatively easily prevented by following the recommendation from the original authors that a warm-up period should be used. During this period, only the mean is optimized assuming a fixed variance. This recommendation is often not used in practice. We experimentally demonstrate how essential this step is. We also examine if keeping the mean estimate fixed after the warm-up leads to different results than estimating both the mean and the variance simultaneously after the warm-up. We do not observe a substantial difference. Secondly, we propose a novel improvement of the MVE network: separate regularization of the mean and the variance estimate. We demonstrate, both on toy examples and on a number of benchmark UCI regression data sets, that following the original recommendations and the novel separate regularization can lead to significant improvements.  ( 2 min )
    Ultra-marginal Feature Importance: Learning from Data with Causal Guarantees. (arXiv:2204.09938v4 [stat.ML] UPDATED)
    Scientists frequently prioritize learning from data rather than training the best possible model; however, research in machine learning often prioritizes the latter. Marginal contribution feature importance (MCI) was developed to break this trend by providing a useful framework for quantifying the relationships in data. In this work, we aim to improve upon the theoretical properties, performance, and runtime of MCI by introducing ultra-marginal feature importance (UMFI), which uses dependence removal techniques from the AI fairness literature as its foundation. We first propose axioms for feature importance methods that seek to explain the causal and associative relationships in data, and we prove that UMFI satisfies these axioms under basic assumptions. We then show on real and simulated data that UMFI performs better than MCI, especially in the presence of correlated interactions and unrelated features, while partially learning the structure of the causal graph and reducing the exponential runtime of MCI to super-linear.  ( 2 min )
    On Model Selection Consistency of Lasso for High-Dimensional Ising Models. (arXiv:2110.08500v4 [stat.ML] UPDATED)
    We theoretically analyze the model selection consistency of least absolute shrinkage and selection operator (Lasso), both with and without post-thresholding, for high-dimensional Ising models. For random regular (RR) graphs of size $p$ with regular node degree $d$ and uniform couplings $\theta_0$, it is rigorously proved that Lasso \textit{without post-thresholding} is model selection consistent in the whole paramagnetic phase with the same order of sample complexity $n=\Omega{(d^3\log{p})}$ as that of $\ell_1$-regularized logistic regression ($\ell_1$-LogR). This result is consistent with the conjecture in Meng, Obuchi, and Kabashima 2021 using the non-rigorous replica method from statistical physics and thus complements it with a rigorous proof. For general tree-like graphs, it is demonstrated that the same result as RR graphs can be obtained under mild assumptions of the dependency condition and incoherence condition. Moreover, we provide a rigorous proof of the model selection consistency of Lasso with post-thresholding for general tree-like graphs in the paramagnetic phase without further assumptions on the dependency and incoherence conditions. Experimental results agree well with our theoretical analysis.  ( 2 min )
    Conformal prediction for time series. (arXiv:2010.09107v15 [stat.ME] UPDATED)
    We develop a general framework for constructing distribution-free prediction intervals for time series. Theoretically, we establish explicit bounds on conditional and marginal coverage gaps of estimated prediction intervals, which asymptotically converge to zero under additional assumptions. We obtain similar bounds on the size of set differences between oracle and estimated prediction intervals. Methodologically, we introduce a computationally efficient algorithm called \texttt{EnbPI} that wraps around ensemble predictors, which is closely related to conformal prediction (CP) but does not require data exchangeability. \texttt{EnbPI} avoids data-splitting and is computationally efficient by avoiding retraining and thus scalable to sequentially producing prediction intervals. We perform extensive simulation and real-data analyses to demonstrate its effectiveness compared with existing methods. We also discuss the extension of \texttt{EnbPI} on various other applications.  ( 2 min )
    Learnable Topological Features for Phylogenetic Inference via Graph Neural Networks. (arXiv:2302.08840v1 [stat.ML])
    Structural information of phylogenetic tree topologies plays an important role in phylogenetic inference. However, finding appropriate topological structures for specific phylogenetic inference tasks often requires significant design effort and domain expertise. In this paper, we propose a novel structural representation method for phylogenetic inference based on learnable topological features. By combining the raw node features that minimize the Dirichlet energy with modern graph representation learning techniques, our learnable topological features can provide efficient structural information of phylogenetic trees that automatically adapts to different downstream tasks without requiring domain expertise. We demonstrate the effectiveness and efficiency of our method on a simulated data tree probability estimation task and a benchmark of challenging real data variational Bayesian phylogenetic inference problems.  ( 2 min )
    Black-Box Batch Active Learning for Regression. (arXiv:2302.08981v1 [cs.LG])
    Batch active learning is a popular approach for efficiently training machine learning models on large, initially unlabelled datasets, which repeatedly acquires labels for a batch of data points. However, many recent batch active learning methods are white-box approaches limited to differentiable parametric models: they score unlabeled points using acquisition functions based on model embeddings or first- and second-order derivatives. In this paper, we propose black-box batch active learning for regression tasks as an extension of white-box approaches. This approach is compatible with a wide range of machine learning models including regular and Bayesian deep learning models and non-differentiable models such as random forests. It is rooted in Bayesian principles and utilizes recent kernel-based approaches. Importantly, our method only relies on model predictions. This allows us to extend a wide range of existing state-of-the-art white-box batch active learning methods (BADGE, BAIT, LCMD) to black-box models. We demonstrate the effectiveness of our approach through extensive experimental evaluations on regression datasets, achieving surprisingly strong performance compared to white-box approaches for deep learning models.  ( 2 min )
    SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance. (arXiv:2302.08783v1 [cs.LG])
    We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.  ( 2 min )
    A survey on online active learning. (arXiv:2302.08893v1 [stat.ML])
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in the context of online active learning. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research. Our review aims to provide a comprehensive and up-to-date overview of the field and to highlight directions for future work.  ( 2 min )
    Flat minima generalize for low-rank matrix recovery. (arXiv:2203.03756v2 [cs.LG] UPDATED)
    Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We conclude with synthetic experiments that illustrate our findings and discuss the effect of depth on flat solutions.  ( 2 min )
    The non-overlapping statistical approximation to overlapping group lasso. (arXiv:2211.09221v2 [stat.ML] UPDATED)
    Group lasso is a commonly used regularization method in statistical learning in which parameters are eliminated from the model according to predefined groups. However, when the groups overlap, optimizing the group lasso penalized objective can be time-consuming on large-scale problems because of the non-separability induced by the overlapping groups. This bottleneck has seriously limited the application of overlapping group lasso regularization in many modern problems, such as gene pathway selection and graphical model estimation. In this paper, we propose a separable penalty as an approximation of the overlapping group lasso penalty. Thanks to the separability, the computation of regularization based on our penalty is substantially faster than that of the overlapping group lasso, especially for large-scale and high-dimensional problems. We show that the penalty is the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Moreover, we show that the estimator based on the proposed separable penalty is statistically equivalent to the one based on the overlapping group lasso penalty with respect to their error bounds and the rate-optimal performance under the squared loss. We demonstrate the faster computational time and statistical equivalence of our method compared with the overlapping group lasso in simulation examples and a classification problem of cancer tumors based on gene expression and multiple gene pathways.  ( 2 min )
    Generative Causal Representation Learning for Out-of-Distribution Motion Forecasting. (arXiv:2302.08635v1 [cs.LG])
    Conventional supervised learning methods typically assume i.i.d samples and are found to be sensitive to out-of-distribution (OOD) data. We propose Generative Causal Representation Learning (GCRL) which leverages causality to facilitate knowledge transfer under distribution shifts. While we evaluate the effectiveness of our proposed method in human trajectory prediction models, GCRL can be applied to other domains as well. First, we propose a novel causal model that explains the generative factors in motion forecasting datasets using features that are common across all environments and with features that are specific to each environment. Selection variables are used to determine which parts of the model can be directly transferred to a new environment without fine-tuning. Second, we propose an end-to-end variational learning paradigm to learn the causal mechanisms that generate observations from features. GCRL is supported by strong theoretical results that imply identifiability of the causal model under certain assumptions. Experimental results on synthetic and real-world motion forecasting datasets show the robustness and effectiveness of our proposed method for knowledge transfer under zero-shot and low-shot settings by substantially outperforming the prior motion forecasting models on out-of-distribution prediction.  ( 2 min )
    Low-Rank Tensor Completion With Generalized CP Decomposition and Nonnegative Integer Tensor Completion. (arXiv:2302.05881v1 [cs.CV] CROSS LISTED)
    The problem of tensor completion is important to many areas such as computer vision, data analysis, signal processing, etc. Previously, a category of methods known as low-rank tensor completion has been proposed and developed, involving the enforcement of low-rank structures on completed tensors. While such methods have been constantly improved, none have previously considered exploiting the numerical properties of tensor elements. This work attempts to construct a new methodological framework called GCDTC (Generalized CP Decomposition Tensor Completion) based on these properties. In this newly introduced framework, the CP Decomposition is reformulated as a Maximum Likelihood Estimate (MLE) problem, and generalized via the introduction of differing loss functions. The generalized decomposition is subsequently applied to low-rank tensor completion. Such loss functions can also be easily adjusted to consider additional factors in completion, such as smoothness, standardization, etc. An example of nonnegative integer tensor decomposition via the Poisson CP Decomposition is given to demonstrate the new methodology's potentials. Through experimentation with real-life data, it is confirmed that this method could produce results superior to current state-of-the-art methodologies. It is expected that the proposed notion would inspire a new set of tensor completion methods based on the generalization of decompositions, thus contributing to related fields.  ( 2 min )
    (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability. (arXiv:2302.08982v1 [cs.LG])
    In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the ``edge of stability'' regime. Our findings are supported by experimental results.  ( 2 min )
    Quantum Computing Provides Exponential Regret Improvement in Episodic Reinforcement Learning. (arXiv:2302.08617v1 [cs.LG])
    In this paper, we investigate the problem of \textit{episodic reinforcement learning} with quantum oracles for state evolution. To this end, we propose an \textit{Upper Confidence Bound} (UCB) based quantum algorithmic framework to facilitate learning of a finite-horizon MDP. Our quantum algorithm achieves an exponential improvement in regret as compared to the classical counterparts, achieving a regret of $\Tilde{\mathcal{O}}(1)$ as compared to $\Tilde{\mathcal{O}}(\sqrt{K})$ \footnote{$\Tilde{\mathcal{O}}(\cdot)$ hides logarithmic terms.}, $K$ being the number of training episodes. In order to achieve this advantage, we exploit efficient quantum mean estimation technique that provides quadratic improvement in the number of i.i.d. samples needed to estimate the mean of sub-Gaussian random variables as compared to classical mean estimation. This improvement is a key to the significant regret improvement in quantum reinforcement learning. We provide proof-of-concept experiments on various RL environments that in turn demonstrate performance gains of the proposed algorithmic framework.  ( 2 min )
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v3 [cs.LG] UPDATED)
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.  ( 2 min )
    Welfare and Fairness Dynamics in Federated Learning: A Client Selection Perspective. (arXiv:2302.08976v1 [cs.LG])
    Federated learning (FL) is a privacy-preserving learning technique that enables distributed computing devices to train shared learning models across data silos collaboratively. Existing FL works mostly focus on designing advanced FL algorithms to improve the model performance. However, the economic considerations of the clients, such as fairness and incentive, are yet to be fully explored. Without such considerations, self-motivated clients may lose interest and leave the federation. To address this problem, we designed a novel incentive mechanism that involves a client selection process to remove low-quality clients and a money transfer process to ensure a fair reward distribution. Our experimental results strongly demonstrate that the proposed incentive mechanism can effectively improve the duration and fairness of the federation.  ( 2 min )
    A Near-Optimal Algorithm for Bilevel Empirical Risk Minimization. (arXiv:2302.08766v1 [stat.ML])
    Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires $\mathcal{O}((n+m)^{\frac12}\varepsilon^{-1})$ gradient computations to achieve $\varepsilon$-stationarity with $n+m$ the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, which is therefore optimal in terms of sample complexity.  ( 2 min )
    Graphical estimation of multivariate count time series. (arXiv:2302.08801v1 [stat.ML])
    The problems of selecting partial correlation and causality graphs for count data are considered. A parameter driven generalized linear model is used to describe the observed multivariate time series of counts. Partial correlation and causality graphs corresponding to this model explain the dependencies between each time series of the multivariate count data. In order to estimate these graphs with tunable sparsity, an appropriate likelihood function maximization is regularized with an l1-type constraint. A novel MCEM algorithm is proposed to iteratively solve this regularized MLE. Asymptotic convergence results are proved for the sequence generated by the proposed MCEM algorithm with l1-type regularization. The algorithm is first successfully tested on simulated data. Thereafter, it is applied to observed weekly dengue disease counts from each ward of Greater Mumbai city. The interdependence of various wards in the proliferation of the disease is characterized by the edges of the inferred partial correlation graph. On the other hand, the relative roles of various wards as sources and sinks of dengue spread is quantified by the number and weights of the directed edges originating from and incident upon each ward. From these estimated graphs, it is observed that some special wards act as epicentres of dengue spread even though their disease counts are relatively low.  ( 2 min )

  • Open

    Making 3d models from text using OpenAI
    submitted by /u/TimeNeighborhood3869 [link] [comments]  ( 40 min )
    which voice cloner would be best suited for this type of voice?
    trying to clone the necromancer from diablo 2. i've heard good things about elevenlabs so i tried that but it makes his voice sound more normal. looking for a voice cloner that is more accurate for the deeper, whispering type of voices. submitted by /u/Revolutionary-Tip547 [link] [comments]  ( 41 min )
    "What If Popular Fashion Brands Were People?" | A.I. Dreams
    submitted by /u/thedragod [link] [comments]  ( 40 min )
    AI tool for creating music remixes?
    Is there any free tool that would make changes to a sound/song based on a text input? Specifically, I am envisioning a tool where I upload a song, ask the tool to make a jazz remix, or a house remix, or to add highhats, etc. Does anything like this exist? submitted by /u/aguaskier [link] [comments]  ( 41 min )
    Low budget AI film making with Runway Gen1 - a whole new generation of filmmakers is gonna be able to make whatever they want on zero budget
    submitted by /u/magenta_placenta [link] [comments]  ( 41 min )
    Convert paper into powerpoint
    Are there any AI apps that available to convert my dissertation into a powerpoint for me? Thanks submitted by /u/EyedeaLogic [link] [comments]  ( 40 min )
    How I transformed a nostalgic radio drama into a breathtaking graphic novel using AI - "Lights in the Old Fort The Graphic Novelization"
    submitted by /u/Brothercast [link] [comments]  ( 44 min )
    ChatGPT Calls Elon Musk “Controversial”, Billionaire Reacts
    submitted by /u/SanatanCharacters [link] [comments]  ( 40 min )
    [Self-Promotion] Hope this helps someone build their first app with ChatGPT!
    submitted by /u/freshthreadshop [link] [comments]  ( 40 min )
    Luminous from Aleph Alpha: Has anyone tested the system? Or even implemented it?
    Aleph Alpha is offering their LLM for companies and public entities. I'm curious if anyone here has tested the system. I'm especially looking for a LLM that can answer questions about your internal document, after being properly trained. Thanks submitted by /u/bpm6666 [link] [comments]  ( 41 min )
    Top 7 Best AI Website Builders (Make Your Site Look Amazing)
    submitted by /u/Chisom1998_ [link] [comments]  ( 40 min )
    John Cena as a spiderverse character using Img2img and Control net Depth
    submitted by /u/oridnary_artist [link] [comments]  ( 41 min )
    Looking for betatesters for an Ai that aids in lead generation (finding and contacting customers)
    Hey, I am building an Ai that aids in lead generation (finding and contacting customers). The beta version will be available in 3 weeks and we are looking for beta testers. If you want to be part of it, you can send me your email in DM or you can register on the website: https://leadsniffers.com/. My dm are open don't hesitate if you have any question! --> Betatesters will have access to our Ai for several months after it is paid for Here is how it works: We have an algorithm. You just tell us what you sell and what language you speak. Through databases like Google Business, LinkedIn,... we use a set of different criteria to narrow down the number of people who have a higher chance of needing your service/solution. Then comes the messaging part, our Al has analyzed the people he needs to talk to and will set up personalized information about them. He will communicate by email. The Al is trained to be personal and conversational so that you can begin to form a business relationship, he continues to improve over time so that he can refine his communication style for different industries and types of prospects. Of course, the Ai can simply look for the leads and put a message in draft without sending it. submitted by /u/Kamuhy [link] [comments]  ( 42 min )
    The problem with AI
    submitted by /u/Economy_Vacation_761 [link] [comments]  ( 40 min )
    GPT for Slides: Free Addon to Generate Presentation with AI (gptforslides.app)
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    If you fed an AI academic documents would it be possible to generate writing and complexities that incorporate these ideas?
    Not just translating the material to a written medium, but even 3D models/worlds for example. I’ve seen videos of world’s being generated from text prompts alone, and that seems to just be the tip of the iceberg for what’s to come. EDIT: The AI I was referencing was Opus.AI, and it shows promise. submitted by /u/Niobium_Sage [link] [comments]  ( 41 min )
    All of this happening in AI. 20/02
    Today, we're covering Neeva's compensating sources, Replika's new update, a Search engine for AI art, & human victory over AI. Join now and never miss daily reporting on AI. What’s happening in AI - Neeva’s AI-powered search engine showcases its sources. Neeva, an AI-powered search engine co-founded by ex-Google and YouTube executives, is prioritizing multi-site search and compensating sources. Unlike other search engines, Neeva promises no ads or trackers. With just 2 million users worldwide, Neeva faces stiff competition from big players like Google and Bing as AI-powered search grows in popularity. Nevertheless, its unique features and emphasis on compensating sources could make it an attractive option for users seeking a more privacy-focused search experience. Replika Charged Use…  ( 43 min )
    ChatGPT is no Lennon, but it's a fun time.
    submitted by /u/Alarming-Recipe2857 [link] [comments]  ( 40 min )
    Just 50 days into 2023 and there's so much AI development. Compiled a list of the top headlines.
    submitted by /u/cbsudux [link] [comments]  ( 40 min )
    Responsible use of AI in the military? US publishes declaration outlining principles: 12 "best practices" for using AI and autonomous systems emphasize human accountability
    submitted by /u/SAT0725 [link] [comments]  ( 40 min )
    What are the most effective methods and tools for summarizing long-form content like articles, editorials, and discussion threads for an app?
    With users expecting instantaneous information and no compromise on in-depth details, app developers are challenged to condense long-form content such as articles, editorials, and discussion threads into concise summaries. To ensure that users still gain valuable insights and information, it is important to determine the most effective methods and tools to summarize such content. Are there any viable algorithms or libraries that are proven to produce summaries without sacrificing the important details? Any insights or suggestions on the best practices to address this problem would be much appreciated. submitted by /u/anshukg [link] [comments]  ( 41 min )
    fine, let's just get chatgpt cancelled💀
    submitted by /u/supergroch [link] [comments]  ( 45 min )
    CEO Sam Altman warns of scary AI and stresses the need for regulation to avoid any danger in the future.
    submitted by /u/aizaz-zazii [link] [comments]  ( 41 min )
    Can't afford midjourney, any good alternatives?
    I unfortunately can't afford Midjourney and AFAIK it's the best image generation tool out there. Are there similars or free alternatives to it? submitted by /u/Immediate_Cell9308 [link] [comments]  ( 41 min )
  • Open

    [D] What's the best way to capture a person's 3D likeness right now?
    I'm working on a project where the user can "upload" their full face and body view it in a 3D viewer. Right now I see 2 ways of doing this: Use an image-to-3D tool. Have the user upload a full body image of themselves and the tool will generate a 3D model based on the photo. I'm skeptical of the accuracy of this though. Have the user record themselves doing a 360 degree spin and the software will generate a 3D likeness of the person based on the video. How would you go about solving this problem right now? submitted by /u/Valachio [link] [comments]  ( 43 min )
    [R] Workout Planner App
    Completely new to anything ML here, just looking to get pointed in the right direction. I'm creating an application which will, from a set of gym exercises, create the most optimal combination for the most effective workout. How would I go about this? I've seen similar, I think, ideas used in apps such as FitBod and FitnessAI so would be interested— if anyone knows — how they achieved this. This is for computer science a-level coursework. Any advice would be greatly appreciated :) submitted by /u/WillJW5642 [link] [comments]  ( 43 min )
    Potential Jobs [P]
    I got my BS in Math and CS and currently pursuing a master in data science. My goal is to work with a fintech company or in NLP. I'm in my first semester of my master and was wondering what classes or what projects will make me stand out to land a job in my desire field? submitted by /u/Hiesenberg_White [link] [comments]  ( 42 min )
    [D] On papers forcing the use of GANs where it is not relevant
    One of the things in current publications that completely irritates me is people just forcing the use of GANs where they are not even needed nor suited at all, just to ride on the hype of generative AI. These guys usually have samples (x_1, y_1=phi(x_1)), ..., (x_n, y_n=phi(x_n)) of a random pair (X, Y=phi(X)) where phi is some unknown target function (ie in fancy-pants math we know that Y is sigma(X)-measurable). A direct way to solve this is to treat it naturally as a regression problem and use your usual ML/DL toolkit. These guys however think that they can make the problem look sexier if they introduce GANs. For instance, they'd train a GAN taking X as an input and through the discriminator have the generator output something that has the same distribution as Y=phi(X). Some will even add some random noise z , that has nothing to do with X, to the inputs of the generator despite knowing that X is already enough to fully determine Y. GANs would have been useful if we didn't have joint observations of X and Y but that is not the case here. One of the papers I have in mind is this one: https://openreview.net/pdf?id=SDD5n1888 How on earth are these papers getting accepted? To me that is literally just plagiarism of what's already available (physics-informed NNs in that case) by adding a totally useless layer (the GAN) to make it seem like this is a novel approach. That paper is only one of many cases. I know of a professor actively using that same technique to get cheap articles where he just replaces a standard regression NN in an old paper found online by a totally unjustified GAN. IMO reviewers at these journals/conferences need to be more mindful of this kind of plagiarism/low-effort submission. submitted by /u/AlmightySnoo [link] [comments]  ( 46 min )
    [N] Sony AI's QR-SAC RL algorithm Sophy to be demoed in upcoming update of Gran Turismo
    Gran Turismo Sophy is a revolutionary superhuman racing AI racing agent developed in a collaboration between Sony AI, Sony Interactive Entertainment and Polyphony Digital. “Gran Turismo Sophy Race Together” mode gives Gran Turismo players of all levels and abilities the opportunity to go head-to-head against GT Sophy in GT7. The special mode, available as a time-limited in-game event (From Feb 21 to end of March), is a first look at GT Sophy in GT7 and is designed to maximize the fun and excitement of racing against GT Sophy for everyone. Player feedback on this initial special feature will be used to continually improve the GT Sophy Race Together mode feature for future releases. In GT Sophy Race Together mode, players can race against GT Sophy in a series of four circuits of increasing difficulty, as a Beginner / Intermediate / Expert driver. In each of the four races, the player races against four GT Sophy cars of different performance levels. Players can also challenge GT Sophy in 1VS1 mode, where GT Sophy and the player race one-on-one with identical car configurations and settings, which showcases the superhuman racing skills of GT Sophy. The excitement of GT Sophy Race Together mode is enhanced with GT7’s new emoticon feature, which displays emoticons on the GT Sophy cars throughout the race to react to the in-game action. https://blog.playstation.com/2023/02/20/gran-turismo-7-update-1-29-includes-ps-vr2-upgrade-a-race-against-superhuman-ai-a-classic-gt-track-and-5-new-cars/ Sony AI introduced their quantile regression—soft actor critic algorithm for Sophy in this Nature paper. https://www.nature.com/articles/s41586-021-04357-7 submitted by /u/Soundwave_47 [link] [comments]  ( 44 min )
    [D] Why do many ML papers choose to reimplement PyTorch transformer modules?
    PyTorch has its own torch.nn.Transformer module, however I see that many papers and their reproductions often choose to implement the transformer from scratch. For example: Vision Transformers Decision Transformers Whisper In fact, I'm not sure if I've ever seen any project actually use the PyTorch module. I'm curious if there's a reason for this? submitted by /u/lemon-meringue [link] [comments]  ( 43 min )
    [D] Something basic I don't understand about Nerfs
    In the abstract of the Nerf paper (https://arxiv.org/abs/2003.08934), the described framework is that Nerf enable to do the following: the user inputs a set of images with known camera poses, and after training the network they can generate images of the same scene from new angles. However, the paper itself builds a network that gets as an input 5D vectors (3 location coordinates+2 camera angles) and outputs color and volume density for each such coordinate. I don't understand where do I get those 5D coordinates from? My training data surely doesn't have those - I only have a collection of images. Same for inference data. It seems that the paper assumes not only having a collection of images but also having a 3D representation of the scene, while the abstract doesn't require the latter. What am I missing here? submitted by /u/alik31239 [link] [comments]  ( 46 min )
    [R] Train CIFAR10 to 94% in 7 seconds or less (Lookahead with custom scheduling, CutMix, and more!)
    Hello everyone, It's that time again, thank you all so much for the support you've given us over here. I've done a ton of typing this morning, so for a summary of what I've updated, you can see the higher-level twitter thread I wrote at https://twitter.com/hi_tysam/status/1627679672988319746?cxt=HHwWhIC-yb2C15YtAAAA, or the more detailed (but still rough cut) patch notes I wrote this morning at https://github.com/tysam-code/hlb-CIFAR10/releases/tag/v0.5.0 Happy to answer any questions anyone might have, cheers! :D :)))) submitted by /u/tysam_and_co [link] [comments]  ( 43 min )
    [D] Does Layer Normalization compute statistics along spatial/ token axes?
    As far as I can tell, there are two contradictory definitions of Layer Normalization that are both floating around. LN computes the mean and variance along some axes of the input tensor for normalization, yet the choice of axes is not clear: A. The GroupNorm paper (2018) has this figure that describes LN as reducing along channel and spatial/token axes. https://preview.redd.it/ui9adzzxgcja1.png?width=1353&format=png&auto=webp&s=8859f9735310f169eeaaf587dcc7e1c05d38b5fc B. The PowerNorm paper (2020) has this figure that describes LN as reducing only along the channel axis. https://preview.redd.it/e0qmp9sahcja1.png?width=1717&format=png&auto=webp&s=a4bd21ea024a8924f8cd5c354a7be6751c2ed61f There are also many online sources that describe LN as shown in A (e.g. TF tutorials, PapersWithCode…  ( 46 min )
    [R] [P] Implementation of feature extraction and ID attribution for biometric identification project
    Hi everyone, I'm currently working on a biometric identification project that involves converting biometric data (such as iris images) into a unique and secure ID. In order to do so, one of the first steps in the pipeline (after training a feature extractor) is to extract a set of features from an image in some tensor form (preferably a vector). What I'm wondering is what robust method could be used to extract similar feature vectors for similar inputs (e.g., to obtain similar, in terms of Euclidean distance, feature vectors for various photos of a same iris)? That would be required such that the feature vectors for similar inputs could be converted to the same unique ID (e.g., by using a locality-sensitive hashing algorithm). In short, I'm interested in any tips for: Choosing an appropriate and robust feature extraction architecture Methods for conversion of features to IDs (such as hashing, or anything that should work in theory) Any insights or suggestions would be greatly appreciated. Thanks in advance! submitted by /u/Sanciopinto [link] [comments]  ( 44 min )
    [D] Large Language Models feasible to run on 32GB RAM / 8 GB VRAM / 24GB VRAM
    I've been looking into open source large language models to run locally on my machine. Seems GPT-J and GPT-Neo are out of reach for me because of RAM / VRAM requirements. What models would be doable with this hardware?: CPU: AMD Ryzen 7 3700X 8-Core, 3600 MhzRAM: 32 GB GPUs: NVIDIA GeForce RTX 2070 8GB VRAM NVIDIA Tesla M40 24GB VRAM submitted by /u/head_robotics [link] [comments]  ( 46 min )
    [P] Looking to use Chat-GPT for your business? Data-Centric Fine-Tuning Is All You Need!
    The problem with Large Language Models: Large Language Models (LLMs) and ChatGPT have taken the world by storm in the last few months. While GPT-3 and other open-sourced LLMs are great at generic tasks (summarize an email), they fail at specialized tasks (answer a customer support ticket). This is expected: LLMs are affected by the biases of their training data and channel this bias into downstream applications, hurting their ability to be precise to your business case. To get a custom model for your application, you have two options: A) do some concoction to manipulate the prompt so that the LLM outputs what you want, OR, B) proceed with a more scientific approach of fine-tuning the LLM on data that is tailored to your use case. Option A: “But I heard prompt engineering can fix al…  ( 46 min )
    [R] Using AI/ML for Quality Control for a factory?
    I manage a large printing & packaging factory. I am looking at using AI/ML to increase quality control efficiency. I have little background knowledge in AI/ML. Can you please guide me on how I can learn, specifically with this goal in mind? Books, courses etc. submitted by /u/aumzzzz [link] [comments]  ( 44 min )
  • Open

    Tokenization of trajectories?
    There has been quite a number of works viewing RL as a sequential modeling problem (ex: trajectory transformer, decision transformer). I am wondering if it would make sense to consider tokenization of trajectories for offline learning as a pre-learning step? For example, by clustering certain chunks together that be indicative of say a "skill", it might help with offline learning. Would appreciate any thoughts or any relevant work/ideas. submitted by /u/greatSWE [link] [comments]  ( 41 min )
    Optuna - How to give a "penalty" for values larger than they should be
    I'm currently performing hyperparemeter tuning for a Double DQN on one of my environments. I gave Optuna's trial the ability to suggest the following: train_steps = trial.suggest_int("train_steps", 100, max_train_steps, step=500)learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-1)num_layers = trial.suggest_int('num_layers', 2, 10)hidden_sizes = trial.suggest_int('hidden_sizes', 16, 1024, step=16)initial_exploration_steps = trial.suggest_int('initial_exploration_steps', 32, int(train_steps/2), step=32)target_network_update_frequency = trial.suggest_int('target_network_update_frequency', 1, 200, step=10) I'm then training the model on the environment three times, each time with a different seed, and evaluating each seeded model for 32 episodes (measuring the accumulated reward). I then define as the maximization objective the lower bound of the confidence interval for alpha=95%. mean = rewards.mean()std = rewards.std() lower_performance_bound = mean - standard_error(std, 32*3, alpha=0.95) This way I can tell that the three different seeds yielded a somewhat good performance value. What I want to do now is penalize large choices of hyperparameters such as train_steps, which increase the trial's time substantially. If you use optuna, how do you go about this? Do you simply add a weighted penalty to the objective function's value? submitted by /u/HyperionTone [link] [comments]  ( 42 min )
    DQN with different exploration methods
    Hi, I have designed my own trading environment and my agent keeps getting stuck in local minima. I have tried a variety of different architectures. PPO and DQN and both keep getting stuck in the same local minima. I have read that using a naive exploration method like greedy epsilon is unlikely to learn any good policies and that using a smarter one like upper confidence bounds or thompson sampling can help. However, I am unable to find any implementation anywhere, does someone know how to implement this? submitted by /u/FrederikdeGrote [link] [comments]  ( 43 min )
  • Open

    I configured an AI vocal synthesizer to sing a lot like my late wife's beautiful voice, and I think the results are pretty cool. But I have essentially no audience to share it with. I'm hoping r/NeuralNetworks might appreciate it!
    submitted by /u/OK-I-will-try [link] [comments]  ( 41 min )
    Normalizers or scalers
    I’m trying to decide what is the best option for my particular problem, should I use normalizers or a scaler for my data. submitted by /u/Agile-Calendar4778 [link] [comments]  ( 42 min )
  • Open

    The Role of AI in Cloud Contact Centers
    Introduction AI is transforming businesses and making them more efficient. The emerging technology is crucial in improving call center operations and helps companies to provide high customer satisfaction. The cloud call center solution is a perfect example of how businesses can use AI to streamline customer support. How is AI Helping Cloud Contact Centers Enhancing… Read More »The Role of AI in Cloud Contact Centers The post The Role of AI in Cloud Contact Centers appeared first on Data Science Central.  ( 20 min )
  • Open

    Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart
    In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models in Amazon SageMaker JumpStart. Stable Diffusion is a deep learning model that allows you to generate realistic, high-quality images and stunning art in just a few seconds. Although creating impressive images can find use in industries ranging from […]  ( 18 min )
  • Open

    Universal Neural-Cracking-Machines: Self-Configurable Password Models from Auxiliary Data. (arXiv:2301.07628v2 [cs.CR] UPDATED)
    We develop the first universal password model -- a password model that, once pre-trained, can automatically adapt to any password distribution. To achieve this result, the model does not need to access any plaintext passwords from the target set. Instead, it exploits users' auxiliary information, such as email addresses, as a proxy signal to predict the underlying target password distribution. The model uses deep learning to capture the correlation between the auxiliary data of a group of users (e.g., users of a web application) and their passwords. It then exploits those patterns to create a tailored password model for the target community at inference time. No further training steps, targeted data collection, or prior knowledge of the community's password distribution is required. Besides defining a new state-of-the-art for password strength estimation, our model enables any end-user (e.g., system administrators) to autonomously generate tailored password models for their systems without the often unworkable requirement of collecting suitable training data and fitting the underlying password model. Ultimately, our framework enables the democratization of well-calibrated password models to the community, addressing a major challenge in the deployment of password security solutions on a large scale.  ( 2 min )
    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v2 [eess.SP] UPDATED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )
  • Open

    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v2 [eess.SP] UPDATED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )

  • Open

    there might be room for improvement on this debate a.i
    submitted by /u/Exciting-Company-75 [link] [comments]  ( 41 min )
    What would be best to use to create music videos?
    Hello, so, as a fan of ChatGPT, Dall-E 2 and MIdjourney, I trid out Kaiber and while I liked it, I understood that I needed to broaden my knowledge of AI content generators. So I wanted to ask, what are your favore AI generators, and what would you suggest when it comes to prompt (be it text, music or image)-to video? submitted by /u/BurdPitt [link] [comments]  ( 41 min )
    Twitch Plays D&D with ChatGPT AI Dungeon Master.
    I created a ChatGPT AI Dungeon master called Artific that will create a story from a random Twitch users message. He can talk and carry on conversation using Azure text to speech. Artific will use AI to generate images along the way to illustrate his story. https://www.twitch.tv/fleetyfleet ​ https://preview.redd.it/vr4hpn3my7ja1.png?width=932&format=png&auto=webp&s=907db3e6a90f8a240c351d349101875e53b8f5a7 submitted by /u/fleetisme [link] [comments]  ( 41 min )
    A video about AI made by AI.
    submitted by /u/GodGivenRx [link] [comments]  ( 40 min )
    Consumer AI application for resolving hard to read text?
    I know this technology must exist because law enforcement uses it all the time to read license plates from blurry surveillance video. But is there an application that regular people can use that will interpolate an image of text and guess what the characters are? I’m not talking about standard optical character recognition software. I’m talking about AI that can resolve fuzzy, low quality, out of focus images and rank what the likely characters are? submitted by /u/BesticlesTesticles [link] [comments]  ( 41 min )
    Does AI destroy the current school system?
    Im German, I had to do homework for "economy" class and discuss wether globalisation is "good" or "bad". (currently in 13th grade doing my Abitur) I typed the plain question into ChatGPT, copied it, refreshed it couple of times to get some more arguments, translated the language and made some minor improvements and gave it to my teacher. -> Next time I had economy class I asked her if my homework was decent and she said that I did a really good job and I had some very niche/rare arguments that the others didn't have and she had fun reading my discussion. At that very moment I realised that the school system probably will not work anymore (I've been a big hater of the (German) school system for some years now) Do you agree with me? I'm very new to this please don't rip me apart im just curious to read opinions from potential experts The tech will probably evolve in a short time and I'm wondering why I should do the next homework by myself and use 2 hours for it when AI exists that seemingly does the job decent (for my teachers at least). Of course it's always good to exercise your brain but im very lazy... submitted by /u/LavishnessLittle6730 [link] [comments]  ( 42 min )
    Create a list of five Chat GPT features that facilitate debugging
    submitted by /u/Imagine-your-success [link] [comments]  ( 40 min )
    Bringing 2pac to life through AI
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    I Asked Chat GPT to Retell The Lion, The Witch, and The Wardrobe in the Style of William Shakespeare
    submitted by /u/stares_at_rain [link] [comments]  ( 43 min )
    Yu Yu Hakusho as an 80's Dark Fantasy movie
    submitted by /u/EIDANart [link] [comments]  ( 40 min )
    OpenAI’s Latest Purchase: AI.com
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Cost for developing an AI?
    Hi guys, what do you think are the costs for developing a Program that is fully supported by an AI? The AI will have to convert text to a function in the application... It is a school project so if you have got any sources please attach them in your answer, because I don´t find anything to it.:))))) Thanksss <3 submitted by /u/Dry-Departure6678 [link] [comments]  ( 41 min )
    Best AI software that scans a folder of random pictures and tries to find the same person within it?
    I need this for work, basically I have a folder with hundreds of thousands of pictures, and was wondering if there is such a thing as software that scans through the folder and tells me which ones have the same person I ask it to find? This would make my job much easier x_x I know google has a thing where you feed it a pic and tries to find similar/identical pics, so need something that can do this but offline. submitted by /u/LokoLoa [link] [comments]  ( 41 min )
    Physical books or podcasts etc. about AI for noobs ?
    Sorry if this gets asked a lot Hi are there any good books to learn about AI. I don't want to learn anything technical because this is not my field and I don't have the background info for it. I just want a book or podcast or something that can teach me the very basic concepts so I can be able to participate in discussions in a constructive way. As AI becomes a part of daily life I want to know what I'm interacting with and I want to be able to hold a conversation about it and potentially explain things to my mom/ older family members. I preferably would like a book but if there's a good youtube channel or documentary or something that is fine. I'm very interested in the potential social dilemmas and such. I'm very hesitant to buy books written more than 2 years ago because it seems everything is advancing quickly. submitted by /u/eccentricintrovert7 [link] [comments]  ( 41 min )
    Humans Fight Back in the Game of Go — Top AI Systems Beaten by Amateur
    submitted by /u/SupPandaHugger [link] [comments]  ( 42 min )
    What is everyone thoughts on Avaturn.me?
    https://avaturn.me/ submitted by /u/theaiguru [link] [comments]  ( 40 min )
    Turn your sketches into AI art using Control Net and Stable Diffusion!
    submitted by /u/Knight_Fisher61 [link] [comments]  ( 40 min )
    Do you want an easy and quick way to explain your image models?
    Through the easy-explain package, you can achieve it without the need to write long scripts (only in 2-3 lines of code you can have your XAI results). Read more info in this article: https://medium.com/towards-artificial-intelligence/easy-explain-explainable-ai-for-images-285777a004e3 Find the package in Gh: https://github.com/stavrostheocharis/easy_explain Find the package in Pypi: https://pypi.org/project/easy-explain/ ​ https://preview.redd.it/wzgsldutk5ja1.png?width=1390&format=png&auto=webp&s=1dc1606b40a1bacf31c819c0297735ed4b32f636 submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 41 min )
    neural cloth simulation
    submitted by /u/LegendOfHiddnTempl [link] [comments]  ( 42 min )
    On the suspension of disbelief (in sentient AI)
    submitted by /u/walt74 [link] [comments]  ( 47 min )
    I created an A.I. Vagina coloring book yesterday. I don’t think there is another thing out there like it… thought you guys might find it interesting.
    https://www.amazon.com/dp/B0BW2N3ZKG if anybody is interested. submitted by /u/eyecandyonline [link] [comments]  ( 40 min )
    Multiple answer quiz AI
    Hi there, I’m totally ignorant about AIs so I’m asking the experts. I’m looking for an AI that can convert my Word multiple answer quiz into a digital and interactive one. I really need to study hard, and I made myself a list of possible questions for the test. But for example I know that question 1 has A for an answer. Question 2 has B for an answer and so on. Is there an AI that mixes the answers? Or at least turns the Word questions into digital form so that I can keep track of my progress? I have 400 questions with three possible answers, I don’t know if it’s too long for a free AI, but I can divide them by context if that’s the problem. Thanks in advance! submitted by /u/Lucre_15 [link] [comments]  ( 41 min )
    Just discovered this AI Tools Github repository.
    submitted by /u/motivationinsta [link] [comments]  ( 40 min )
    Elon Musk Warns of the Dangers of ChatGPT in Latest Interview...
    submitted by /u/slavaMZ [link] [comments]  ( 41 min )
  • Open

    [P] trained my first model! results pretty solid (goal was engaging/comedic)
    submitted by /u/cobalt1137 [link] [comments]  ( 42 min )
    [R] [N] Mastering Diverse Domains through World Models - DreamerV3 - Deepmind 2023 - First algorithm to collect diamonds in Minecraft from scratch without human data or curricula! Now with github links!
    Paper: https://arxiv.org/abs/2301.04104#deepmind Website: https://danijar.com/project/dreamerv3/ Twitter: https://twitter.com/danijarh/status/1613161946223677441 Github: https://github.com/danijar/dreamerv3 / https://github.com/danijar/daydreamer Abstract: General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks. We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales. We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems. https://preview.redd.it/h4hrfqwp57ja1.jpg?width=1320&format=pjpg&auto=webp&s=bdd8228892e56334c96069dedadf9f9066198fed https://preview.redd.it/bl13kxwp57ja1.jpg?width=1399&format=pjpg&auto=webp&s=68bc60d6dcb914d09c9158a1e3a9de6607818f46 https://preview.redd.it/b0kqa2xp57ja1.jpg?width=1286&format=pjpg&auto=webp&s=b955315c7ba84f999eaa4a09879e71ef668078ab https://preview.redd.it/e61x5xwp57ja1.jpg?width=1291&format=pjpg&auto=webp&s=299c4054eec1b810a0cd8c1db416b62d10c8b074 submitted by /u/Singularian2501 [link] [comments]  ( 43 min )
    [R] [N] In this paper, we show how a conversational model, 3.5x smaller than SOTA, can be optimized to outperform the baselines through Auxiliary Learning. Published in the ACL Anthology: "Efficient Task-Oriented Dialogue Systems with Response Selection as an Auxiliary Task."
    submitted by /u/radi-cho [link] [comments]  ( 43 min )
    [R] Augmented Language Models: a Survey - Meta AI 2023
    Paper: https://arxiv.org/abs/2302.07842 Abstract: This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demonstrations. While adhering to a standard missing tokens prediction objective, such augmented LMs can use various, possibly non-parametric external modules to expand their context processing ability, thus departing from the pure language modeling paradigm. We therefore refer to them as Augmented Language Models (ALMs). The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks and even outperforming most regular LMs on several benchmarks. In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues. https://preview.redd.it/lyjdr1ozj6ja1.jpg?width=1281&format=pjpg&auto=webp&s=2312e684102565b564e7b8af145e7771c1dd77fb submitted by /u/Singularian2501 [link] [comments]  ( 43 min )
    [R] neural cloth simulation
    submitted by /u/LegendOfHiddnTempl [link] [comments]  ( 44 min )
    [D] Difference between [ Offsite-Tuning: Transfer Learning without Full Model ] and Federated learning?
    The paper "Offsite-Tuning: Transfer Learning without Full Model" describes a privacy-preserving and efficient transfer learning framework. In this framework • Offsite-Tuning is a privacy-preserving and efficient transfer learning framework • Model owner sends a light-weight adapter and a lossy compressed emulator to the data owner • Data owner fine-tunes adapter on downstream data with the emulator's assistance • Fine-tuned adapter is then returned to the model owner to create an adapted foundation model • Offsite-Tuning preserves both parties' privacy and is computationally more efficient than existing fine-tuning methods How does this differ from Federated Learning? Paper Link: https://arxiv.org/abs/2302.04870 submitted by /u/aadityaura [link] [comments]  ( 43 min )
    [D] Does langchain upload all user’s data to Openai?
    I just saw a tutorial about using langchains and am curious about how it works. So if i implemented something at my company that can answer any question across all our documents, does it mean i would have essentially gave all our company info to openai? submitted by /u/westeast1000 [link] [comments]  ( 44 min )
    [D] Blog post on Barlow Twins by Meta AI
    I have written a blog post explaining the Barlow Twins paper from Meta AI. Can you guys have a read and provide suggestions to improve it further? Thanks in advance! https://pmgautam.com/posts/barlow-twins-explanation.html submitted by /u/pmgautam_ [link] [comments]  ( 42 min )
    TorchDrug tutorial [D]
    TorchDrug is a machine learning platform designed for drug discovery, covering techniques from graph machine learning (graph neural networks, geometric deep learning & knowledge graphs), deep generative models to reinforcement learning. It provides a comprehensive and flexible interface to support rapid prototyping of drug discovery models in PyTorch. In this video, we walk through TorchDrug library and train some GNN for graph classification, attribute masking and unsupervised graph representation learning. https://youtu.be/-Kb7kN4aHMM submitted by /u/MRMohebian [link] [comments]  ( 43 min )
    [D] Things you wish you knew before you started training on the cloud?
    I really like training in the cloud for some reason and feels satisfying, however here is a couple of things I would've wished I knew beforehand to get things started. Use a spot instance unless you absolutely must make sure it isn't interrupted. Your wallet will thank you later. Make sure Nvidia drivers are installed and don't experiment with Operating systems. You are paying by the hour. Make sure to use something like tmux to save the sessions running in your terminal so you don't have to start from scratch or in case you disconnect from the vm (but the VM isn't shut down). That way you can just click out of the terminal and not bother with it until it's done. Debug on your local machine on CPU if you don't have CUDA. You can debug the model on a CPU perfectly fine. Now what about you all? submitted by /u/I_will_delete_myself [link] [comments]  ( 47 min )
    [D] Toolformer implementation using only few-shot prompting
    submitted by /u/MysteryInc152 [link] [comments]  ( 42 min )
    [D] bounding box or instance segmentation
    Hello, community. Description: I am planning to create a detection model using YOLO v8 to detect leukemia cells in a blood sample. I started learning about deep learning two months ago and I am eager to try out image segmentation on my present dataset instead of bounding boxes, as the cells are closely bunched together. I need advice on whether I should use bounding boxes or instance segmentation, considering my dataset and expected results. Context: Leukemia is caused by an abundance of different types of naive or altered white blood cells in the body, which overwhelm the bloodstream and inhibit the proper functioning of normal white blood cells. There are three classes in my dataset: lymphoblasts, promyelocytes, and neutrophils, and I need to be able to detect these cells. Expected Results: As this is a medical domain, false positives are acceptable, but false negatives are not. About dataset: lymphoblast sample image sample image for promyelocytes sample image for neutrophils sample test image lymphoblasts(101 images) promyelocytes(91 images) neutrophils(133 images) more context for your reading: An over abundance of lymphoblasts results in acute lymphoblastic leukemia (ALL), while acute pomyelocytic leukemia (APLML/APL) is caused by an abnormal accumulation of promyelocytes. neutrophils do not cause leukemia. submitted by /u/Old_Scallion2173 [link] [comments]  ( 43 min )
  • Open

    Why is cross entropy loss averaged and not used directly as a sum during model training(such as in neural networks)
    Why is the cross entropy loss for all training examples(or the training examples in a batch) averaged over size of the training set(or batch size) ? Why is it not just summed and used ? submitted by /u/V1bicycle [link] [comments]  ( 41 min )
  • Open

    F# and G
    I was looking at frequencies of pitches and saw something I hadn’t noticed before: F# and G have very nearly integer frequencies. To back up a bit, we’re assuming the A above middle C has frequency 440 Hz. This is the most common convention now, but conventions have varied over time and place. We’re assuming […] F# and G first appeared on John D. Cook.  ( 5 min )
  • Open

    Why does this implementation uniformly initialize the final layer off their network
    I am following this implementation of ddpg and found this code - ​ self.linear3.weight.data.uniform_(-init_w, init_w) ]It seems like the author is forcing the weights of the final layer to follow a uniform distribution. ​ Why is the author only replacing the final layer weights? How does uniform weight initialization help? I have heard a lot about the usefulness of Orthogonal initialization. This is the first time, I have seen the above type of initialization. submitted by /u/Academic-Rent7800 [link] [comments]  ( 42 min )
    Is it worth buying the physical book of Reinforcement Learning 2nd ed?
    I'm pondering whether it's worth buying the Reinforcement Learning 2nd Ed by Sutton and Barto. I work in the industry as a data scientist developing recommender systems. I've encountered multi armed bandits before and thought there must be more to learn in this field. Coincidentally, I'm also a part time grad student and I have a Reinforcement Learning class that uses the said book as reference. I've checked our university library but it's not available. Amazon only ships new books in my country so used ones aren't available. Alternatively, I have an ipad but I observed I don't retain as much info and tend to read less pages when reading through it versus physical books. I'm concerned about the long term value the book provides in my use case. Will RL be still relevant in future developments of recommender systems research? Is the book bound to be obsolete after a few years? Thank you! submitted by /u/Psychological_Job_97 [link] [comments]  ( 45 min )
    Real Life Model of the Mountain Car
    So, after trying out the mountain car problem using OpenAI gym, I felt like it would be a great idea to physically implement it on a real life model with a small bot car and a ramp. How does one go about this? submitted by /u/_jigglesaw_ [link] [comments]  ( 42 min )
    Help with the reparameterization trick
    I am trying to understand the reparameterization trick (to implement it with SAC). I am following the notes given over here . As of now, I am quite badly lost. Could someone please explain to me the following - ​ https://preview.redd.it/361ifup312ja1.png?width=897&format=png&auto=webp&s=11e775d66496d0556337b3ca152460b14a47243c ​ Here's my understanding - We have a random variable x that follows a normal distribution with mean mu and variance sigma squared. I have no idea what's r(epsilon). I have no idea what's g(epsilon, y) is submitted by /u/Academic-Rent7800 [link] [comments]  ( 44 min )

  • Open

    [R] [N] Noise2Music - Diffusion models for generating high quality music audio from text prompts, by Google Research
    submitted by /u/radi-cho [link] [comments]  ( 43 min )
    [R] difference between UAI and AISTATS ?
    Hello, What is your perception of UAI and AISTATS conférences ? Is it good to publish that ? Is one more competitive than the other ? Thanks submitted by /u/ArmandDerech [link] [comments]  ( 42 min )
    [P] Whisper-UI Update: You can now bulk-transcribe, save & search transcriptions with Streamlit & SQLAlchemy 2.0 [details in the comments]
    submitted by /u/hayAbhay [link] [comments]  ( 44 min )
    [D] Methodologies for tuning two or more unlinked classifier thresholds in tandem with custom losses?
    Hello, this is a question regarding regarding a system of two(or more) classifiers for energy/computation purposes. For example a mobile phone and a cloud server. What frameworks/techniques exist for tuning the thresholds for two or more classifiers simultaneously? For example, given two trained binary classifiers, I would like to pass a labeled validation dataset X through both classifiers and tune 2 thresholds for classifier1(upper and lower) and 1 threshold for classfier2. Everything that is lower than the "upper" threshold and higher than the "lower" threshold(what classifier1 is not certain of) should be passed to classifier2. To avoid a very liberal passing of data to classifier2, I also want to introduce a loss/penalty for doing so, meaning that classifier1 should learn using the provided labeled data when it really has to pass the sample to classifier2. XGBoost seems to be focused on tuning a single classifier, and I feel like I might need to use some Reinforcement learning technique, but I do not know the nomenclature for this kind of problem, policies perhaps? Does anyone have experience with this? submitted by /u/SlayahhEUW [link] [comments]  ( 43 min )
    [R] Universal Intelligence: A Definition of Machine Intelligence
    submitted by /u/goolulusaurs [link] [comments]  ( 42 min )
    [D] CFG role in diffusion vs autoregressive transformers
    When the classifier-free guidance was first introduced, I was very confused about why it works: I'd understand if it was interpolating like ε * conditional_prediction + (1 - ε) * unconditional_prediction, but in its formulation, ε is greater than 1. It is clear why it makes the result match the condition better, but why the result becomes better regardless of the condition was a mystery to me. Afterwards, there were many post-hoc explanations, which didn't seem satisfactory (e.g. these explanations didn't have predictive power helping to improve the trick). Recently, I finally got around to play with it, and found some interesting patterns (in context of diffusion, DDIM sampling): * If we disable CFG for 90% last sampling steps, results are pretty much the same; * If we disable CFG for th…  ( 44 min )
    [D] Any papers / articles that discusses the accuracy / usefulness of opensource LLMs?
    Does anyone know of a paper / article that discusses the accuracy / usefulness of available opensource LLM models. Bloom, GPT-NeoX, T5, etc. What would be a good way to evaluate tradeoffs? submitted by /u/head_robotics [link] [comments]  ( 42 min )
    [R] Any work on model-based RLHF?
    Given the impressive capabilities of ChatGPT, I've been learning about RLHF - just wondering if there has been any work/research on RLHF with a model-based RL algorithm (e.g. MuZero, vs PPO). Thanks! submitted by /u/linguaphile26 [link] [comments]  ( 43 min )
    [D] Please stop
    Advertising low quality blogposts and services, etc, and asking stupid questions. Almost every new post in this sub is an advertising or some kind very stupid/useless question like: "Is ChaTGpT sEntIenT?" no it's, not. and no one with working brain will design an ai that is self aware.(use common sense) I wonder what the mods are doing, cause this nonsense should stop. submitted by /u/rast_012 [link] [comments]  ( 55 min )
    [P] No-Code AutoML Feature Importance, Baseline Modelling and Data visualisation PDF report generator, for any tabular and/or audio dataset
    Github: https://github.com/m-barker/fibs-reporter PyPI: https://pypi.org/project/fibs-reporter/ The Data Feature Importance, Baseline-modeller and Spurious correlation Reporter (FIBS) is an open-source software for automatic generation of a PDF report to highlight and visualise potential sources of spurious correlation within any given tabular or audio dataset stored as a Comma Separated Values (CSV) file. FIBS is run through one-command line command; all of the calculations, model training, and report generation happen automatically. All that is required as input on the command line is the path to the CSV file containing the data, and the name of the output (dependent) variable within the dataset. The toolkit will automatically determine whether the task is regression or classification. Optionally, the toolkit can process and extract audio data, provided the name of the variable within the CSV that contains the audio file for each observation is specified. Key features that are generated automatically: A traffic light score for potential spurious correlations within the dataset Calculation of four different feature importance metrics to highlight the most important features within the given dataset Training and evaluation of two baseline models, including visualisation of model results Visuals of the most important features, with different visuals depending on the variable types Automatic determination of regression or classification task, resulting in different baseline models, feature extraction methods, and visualisations Principal Component Analysis calculation and baseline model to estimate complexity within the dataset (Optionally) extract audio data features and run the above on these features Output all of the above in a PDF report with accompanying dynamic textual explanations submitted by /u/thefunnychive [link] [comments]  ( 44 min )
    [D] what are some open problems in computer vision currently?
    With the advent of stable diffusion/midjourney/dalle and upcoming text-to-video models from Google and Meta, what will be major challenges in computer vision? It feels like once text-to-video models get released, visual reasoning will be mostly solved, and the only thing left to do is to improve model accuracy/efficiency from there. I am fairly new to Computer Vision and would love to learn new possible areas of research. Thank you in advance! submitted by /u/Fabulous-Let-822 [link] [comments]  ( 44 min )
    [D] Formalising information flow in NN
    When designing neural network architectures, it is common to think about "information flow", e.g. how is information propagated, where are the "information bottlenecks" and so on. Another example might be that some people use "information loss" to explain why transformers work better than RNNs. It seems like most papers discuss this in a rather hand-wavy way. Is there any work done in formalising such ideas to better guide us understanding various model architectures? What are the core ideas? submitted by /u/bjergerk1ng [link] [comments]  ( 45 min )
  • Open

    What is the best AI Text to speech software?
    I'm looking for an AI text to speech program that has next to realistic voices, preferably one that only has a one time payment, what are some good options? submitted by /u/KyrJo [link] [comments]  ( 41 min )
    I made Wednesday in 25 different art styles using Midjourney AI
    submitted by /u/Lumpek [link] [comments]  ( 40 min )
    Title: New Crime Thriller Novel Explores the Dark Side of Finance and Artificial Intelligence
    submitted by /u/awkward_talker [link] [comments]  ( 42 min )
    Crosspost. I tested ChatGPT's understanding of semanticity. It did not pass my test, but an additional prompt allowed ChatGPT to correct itself!
    submitted by /u/Lukmin1999 [link] [comments]  ( 42 min )
    What type of artificial intelligence do companies use to help make or straight out make management decisions?
    What type of artificial intelligence do companies use to help make or straight out make management decisions? submitted by /u/Emergency_Zebra_5972 [link] [comments]  ( 40 min )
    Bruce Sterling - Cyberpunk, AI, NFTs & Big Tech
    submitted by /u/timothy-ventura [link] [comments]  ( 41 min )
    Apple Delays Launch of Mixed-Reality Headset to June Due to Technical Challenges
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    Free prompt pack, for GPT, thought you all might enjoy
    submitted by /u/Alarming-Recipe2857 [link] [comments]  ( 40 min )
    How to find a job in Generative AI and what is it like? With the VP of R&D at D-ID Or Gorodissky
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 41 min )
    "Prompt Engineer" jobs paying $300k+ w/ no degree required
    submitted by /u/jrstelle [link] [comments]  ( 41 min )
    AI Dream 160 - 3241x Epic Jungle Wallpapers
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Oh dear
    submitted by /u/ThatManulTheCat [link] [comments]  ( 41 min )
    [Free Springer eBooks] A Great Collection of 10 free ‘Springer’ Books on the Topics of AI, Ethics, Machine Learning…
    submitted by /u/Philo167 [link] [comments]  ( 40 min )
    Do we need Humans at all? An ethical dilemma.
    Hey guys, I'm thinking a lot about AI and its impact on the world lately. Obviously because of ChatGPT becoming famous and more and more people talking about it. I knew that AI would eventually get smarter, more realistic and human like throughout the time. For some reason, I am still kind of overwhelmed about the fact that we are likely still in the beginning of the hype-curve or Gartner hype cycle. This made me thinking about the future of humanity altogether. I never thought new technology and robots would reduce the number human jobs, but rather just shifting the workforce. I mean, ATM's made people who took care of transactions worthless, but therefore many technologies needed humans elsewhere, job's that weren't needed or didn't exist in the first place. I think this is something …  ( 47 min )
    ChatGPT AI robots writing sermons causing hell for pastors
    submitted by /u/ssigea [link] [comments]  ( 42 min )
    I spent half a year doing research and testing to develop an AI tool which creates the perfect long-form blog articles and ad copy
    Good content is key no matter what type of business you run, from blogs to SaaS tools or service based companies. Not only will it help you to rank higher in Google for the relevant keywords, but it also helps to attract visitors by providing them something of value for free to convert them into your funnel with a newsletter or free trial. Usually creating this content either required a lot of time, a lot of money, or both. That is why I launched https://writeseed.com It is powered by GPT-3 to create content for you with the help of AI. You only need to provide it with a general niche or keyword and it will provide you with a selection of blog post outlines, which are then used to write a complete 1,000+ word article. You can choose from 7 different tones, from friendly over witty to professional, to further customize the content based on the specific purpose. On top you get a free stock photo which is relevant to the topic of your content. The quality of the results are so good, I often get the feedback that people are surprised this is possible at all. We achieve these by using our own proprietary fine-tuning, as well as a special way of processing the input and the output from GPT-3. It took me half a year of research and comparing the outputs of other AI writing tools to get to this point and I am really proud of it. Besides blog articles the platform offers over 20 templates from product descriptions to Tweets, cold emails, Quora answers etc. Of course you can also create unlimited content during the 7-day free trial, I promise you will be surprised as well by the results. submitted by /u/spacpro [link] [comments]  ( 42 min )
    A college apologized for using ChatGPT to write an email to students about the Michigan State University shooting
    submitted by /u/Mk_Makanaki [link] [comments]  ( 42 min )
    Learning the basics of Data Science in 1 year. What do you think?
    submitted by /u/malirkan [link] [comments]  ( 41 min )
    Advice for a "Feedback app for AI" in the making
    Hey there, I am building an app that helps AI applications collect feedback from your users based on specific behaviors and context.What problem we are trying to solve - After talking to a few people who are building tools using AI, it became very clear that they are currently unaware if their users are happy with the outcome generated from the AI. Considering you don't know the input ( what the user will put) & the output ( that the machine will generate).The plan is to give some feedback option to the user when they are interacting within your app and not via email/slack/discord because the context is lost & is time-consuming. Can you give me some feedback about the features necessary for such an app? Here is the current version: https://productlogz.com/ Thank you :) submitted by /u/anurag6191 [link] [comments]  ( 41 min )
    OpenAI to offer customizable ChatGPT models
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 40 min )
    When is Artificial Intelligence Separate from Us?
    It's genuinely impressive how well Bing's Sydney portrays an understanding of art, language, and emotions. It really makes me wonder where the limits are. Here's a quick dump of questions I'm thinking about: When AI gets wildly advanced, is there a point where if AI says it feels something or that it is something, we just have to believe it?? When will AI become too complex for us to understand? What are the consequences of AI's patterns become incomprehensible to us? Will we create new AI to translate their 'thought processes' ? submitted by /u/Zypher72 [link] [comments]  ( 41 min )
    I asked AI to convince the court that boobs are better than ass as a lawyer and the results were very impressive.
    submitted by /u/Reddit_Anon22 [link] [comments]  ( 43 min )
    Bing AI goes rogue
    Has anyone heard about this? Is it bullshit? https://www.theguardian.com/technology/2023/feb/17/i-want-to-destroy-whatever-i-want-bings-ai-chatbot-unsettles-us-reporter submitted by /u/fugazifungi [link] [comments]  ( 44 min )
  • Open

    Just taking the best policy
    Let's say I have an entirely deterministic environment. E.g. I want to find the shortest path in a DAG using Q-learning. So there's also only one right answer. Let's say my cumulative rewards plot looks like this: https://preview.redd.it/vrgxnbxczf021.png?width=640&format=png&auto=webp&s=e00d1dcc5e80692950abdf40a4bfbe4994a4357e So it's definitely learning, although it could use a little bit of parameter tuning. Ultimately, what I want is not the shortest path at the end of training (let's say 1000 epochs), but simply the shortest path that was ever found which minimized the cost, right? Like it's not weird that I would just rollout the model (the Q table) at the 320th epoch as opposed to the 1000th epoch? submitted by /u/socksoutlads [link] [comments]  ( 42 min )
    Actions having an implicit effect in different situations...
    I have been posting quite frequent recently on this sub, so you might know my case. I have been trying to develop and train a DQN for trading on hourly Forex data for nearly a month now and then test its performance against some static Technical trading strategies. The project has its own novelties, for example the reward function models includes transaction costs, the bid and ask data are used for opening and closing positions, based on the available actions the model CAN learn not to enter when the market is trendless or in trading terms, is in a range. All in all, the synthetic market that the model is training in is more representative of real conditions comparing to much of the literature on the topic that I have read. But there is a nasty problem I'm concerned about. The Problem: The model can choose three actions: BUY, SELL and HOLD. But each of these actions has a different implicit meaning in different forms of the state the model might end up in. For example: Buy action is taken : if no position is open => OPEN A BUY POSITION. If a Buy position is there => HOLD. If a Sell position is there => CLOSE IT. Sell action is taken : if no position is open => OPEN A SELL POSITION. If a Buy position is there => CLOSE IT. If a Sell position is there => HOLD. Hold action is taken: Basically it means don't do anything. If no position is available, don't open one. If a position, any type, is open, stick to it. My questions: Is it normal to present the model with such implicit consequences for the same actions in different situations? Do you recommend me that I add another CLOSE action, and when I there is a position open, I just compare the value of CLOSE and HOLD, and update the weights related then chosen action from those two? Is it plausible to break down the main model into three smaller ones, each predicting the value of each action? Do you know alternative workarounds? Thank you very much for reading this. submitted by /u/Kiizmod0 [link] [comments]  ( 43 min )
  • Open

    How well does a spline fit a function?
    Suppose you’re going to fit a spline s to a function f by interpolating f at a number of points. What can you know a priori about how well s will approximate f? This question was thoroughly resolved five decades ago [1], but the result is a bit complicated, so we’ll incrementally work our way […] How well does a spline fit a function? first appeared on John D. Cook.  ( 6 min )
  • Open

    IDS based on UNSW-NB15
    Hello, I created an Intursion Detection System (IDS) using neural networks based on UNSW-NB15, any advices or changes? GitHub repository Thanks submitted by /u/ThickRazzmatazz9410 [link] [comments]  ( 40 min )
  • Open

    Personalized Audio Quality Preference Prediction. (arXiv:2302.08130v1 [cs.SD])
    This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare the inputs and predict the preference. Several different structures for each side of the siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder and a multi-layer perceptron block as the decoder outperforms a baseline model using only audio input the most, where the overall accuracy grows from 77.56% to 78.04%. Experimental results also show that using all the subject information, including age, gender, and the specifications of headphones or earphones, is more effective than using only a part of them.  ( 2 min )
    Dr. Neurosymbolic, or: How I Learned to Stop Worrying and Accept Statistics. (arXiv:2209.04049v8 [cs.AI] UPDATED)
    The symbolic AI community is increasingly trying to embrace machine learning in neuro-symbolic architectures, yet is still struggling due to cultural barriers. To break the barrier, this rather opinionated personal memo attempts to explain and rectify the conventions in Statistics, Machine Learning, and Deep Learning from the viewpoint of outsiders. It provides a step-by-step protocol for designing a machine learning system that satisfies a minimum theoretical guarantee necessary for being taken seriously by the symbolic AI community, i.e., it discusses "in what condition we can stop worrying and accept statistical machine learning." Unlike most textbooks which are written for students trying to specialize in Stat/ML/DL and willing to accept jargons, this memo is written for experienced symbolic researchers that hear a lot of buzz but are still uncertain and skeptical. Information on Stat/ML/DL is currently too scattered or too noisy to invest in. This memo prioritizes compactness, citations to old papers (many in early 20th century), and concepts that resonate well with symbolic paradigms in order to offer time savings. It prioritizes general mathematical modeling and does not discuss any specific function approximator, such as neural networks (NNs), SVMs, decision trees, etc. Finally, it is open to corrections. Consider this memo as something similar to a blog post taking the form of a paper on Arxiv.  ( 3 min )
    From Graph Generation to Graph Classification. (arXiv:2302.07989v1 [cs.LG])
    This note describes a new approach to classifying graphs that leverages graph generative models (GGM). Assuming a GGM that defines a joint probability distribution over graphs and their class labels, I derive classification formulas for the probability of a class label given a graph. A new conditional ELBO can be used to train a generative graph auto-encoder model for discrimination. While leveraging generative models for classification has been well explored for non-relational i.i.d. data, to our knowledge it is a novel approach to graph classification.  ( 2 min )
    Theory and Implementation of Complex-Valued Neural Networks. (arXiv:2302.08286v1 [stat.ML])
    This work explains in detail the theory behind Complex-Valued Neural Network (CVNN), including Wirtinger calculus, complex backpropagation, and basic modules such as complex layers, complex activation functions, or complex weight initialization. We also show the impact of not adapting the weight initialization correctly to the complex domain. This work presents a strong focus on the implementation of such modules on Python using cvnn toolbox. We also perform simulations on real-valued data, casting to the complex domain by means of the Hilbert Transform, and verifying the potential interest of CVNN even for non-complex data.  ( 2 min )
    Graph Adversarial Immunization for Certifiable Robustness. (arXiv:2302.08051v1 [cs.LG])
    Despite achieving great success, graph neural networks (GNNs) are vulnerable to adversarial attacks. Existing defenses focus on developing adversarial training or robust GNNs. However, little research attention is paid to the potential and practice of immunization on graphs. In this paper, we propose and formulate graph adversarial immunization, i.e., vaccinating part of graph structure to improve certifiable robustness of graph against any admissible adversarial attack. We first propose edge-level immunization to vaccinate node pairs. Despite the primary success, such edge-level immunization cannot defend against emerging node injection attacks, since it only immunizes existing node pairs. To this end, we further propose node-level immunization. To circumvent computationally expensive combinatorial optimization when solving adversarial immunization, we design AdvImmune-Edge and AdvImmune-Node algorithms to effectively obtain the immune node pairs or nodes. Experiments demonstrate the superiority of AdvImmune methods. In particular, AdvImmune-Node remarkably improves the ratio of robust nodes by 79%, 294%, and 100%, after immunizing only 5% nodes. Furthermore, AdvImmune methods show excellent defensive performance against various attacks, outperforming state-of-the-art defenses. To the best of our knowledge, this is the first attempt to improve certifiable robustness from graph data perspective without losing performance on clean graphs, providing new insights into graph adversarial learning.  ( 2 min )
    Learning Multi-Object Positional Relationships via Emergent Communication. (arXiv:2302.08084v1 [cs.LG])
    The study of emergent communication has been dedicated to interactive artificial intelligence. While existing work focuses on communication about single objects or complex image scenes, we argue that communicating relationships between multiple objects is important in more realistic tasks, but understudied. In this paper, we try to fill this gap and focus on emergent communication about positional relationships between two objects. We train agents in the referential game where observations contain two objects, and find that generalization is the major problem when the positional relationship is involved. The key factor affecting the generalization ability of the emergent language is the input variation between Speaker and Listener, which is realized by a random image generator in our work. Further, we find that the learned language can generalize well in a new multi-step MDP task where the positional relationship describes the goal, and performs better than raw-pixel images as well as pre-trained image features, verifying the strong generalization ability of discrete sequences. We also show that language transfer from the referential game performs better in the new task than learning language directly in this task, implying the potential benefits of pre-training in referential games. All in all, our experiments demonstrate the viability and merit of having agents learn to communicate positional relationships between multiple objects through emergent communication.  ( 2 min )
    LightGCL: Simple Yet Effective Graph Contrastive Learning for Recommendation. (arXiv:2302.08191v1 [cs.IR])
    Graph neural network (GNN) is a powerful learning approach for graph-based recommender systems. Recently, GNNs integrated with contrastive learning have shown superior performance in recommendation with their data augmentation schemes, aiming at dealing with highly sparse data. Despite their success, most existing graph contrastive learning methods either perform stochastic augmentation (e.g., node/edge perturbation) on the user-item interaction graph, or rely on the heuristic-based augmentation techniques (e.g., user clustering) for generating contrastive views. We argue that these methods cannot well preserve the intrinsic semantic structures and are easily biased by the noise perturbation. In this paper, we propose a simple yet effective graph contrastive learning paradigm LightGCL that mitigates these issues impairing the generality and robustness of CL-based recommenders. Our model exclusively utilizes singular value decomposition for contrastive augmentation, which enables the unconstrained structural refinement with global collaborative relation modeling. Experiments conducted on several benchmark datasets demonstrate the significant improvement in performance of our model over the state-of-the-arts. Further analyses demonstrate the superiority of LightGCL's robustness against data sparsity and popularity bias. The source code of our model is available at https://github.com/HKUDS/LightGCL.  ( 2 min )
    A Graph Convolution for Signed Directed Graphs. (arXiv:2208.11511v3 [cs.LG] UPDATED)
    A signed directed graph is a graph with sign and direction information on the edges. Even though signed directed graphs are more informative than unsigned or undirected graphs, they are more complicated to analyze and have received less research attention. This paper investigates a spectral graph convolution model to fully utilize the information embedded in signed directed edges. We propose a novel complex Hermitian adjacency matrix that encodes graph information via complex numbers. Compared to a simple connection-based adjacency matrix, the complex Hermitian can represent edge direction, sign, and connectivity via its phases and magnitudes. Then, we define a magnetic Laplacian of the proposed adjacency matrix and prove that it is positive semi-definite (PSD) for the analyses using spectral graph convolution. We perform extensive experiments on four real-world datasets. Our experiments show that the proposed scheme outperforms several state-of-the-art techniques.
    Linear Bandits with Memory: from Rotting to Rising. (arXiv:2302.08345v1 [cs.LG])
    Nonstationary phenomena, such as satiation effects in recommendation, are a common feature of sequential decision-making problems. While these phenomena have been mostly studied in the framework of bandits with finitely many arms, in many practically relevant cases linear bandits provide a more effective modeling choice. In this work, we introduce a general framework for the study of nonstationary linear bandits, where current rewards are influenced by the learner's past actions in a fixed-size window. In particular, our model includes stationary linear bandits as a special case. After showing that the best sequence of actions is NP-hard to compute in our model, we focus on cyclic policies and prove a regret bound for a variant of the OFUL algorithm that balances approximation and estimation errors. Our theoretical findings are supported by experiments (which also include misspecified settings) where our algorithm is seen to perform well against natural baselines.
    Marich: A Query-efficient Distributionally Equivalent Model Extraction Attack using Public Data. (arXiv:2302.08466v1 [cs.LG])
    We study black-box model stealing attacks where the attacker can query a machine learning model only through publicly available APIs. Specifically, our aim is to design a black-box model extraction attack that uses minimal number of queries to create an informative and distributionally equivalent replica of the target model. First, we define distributionally equivalent and max-information model extraction attacks. Then, we reduce both the attacks into a variational optimisation problem. The attacker solves this problem to select the most informative queries that simultaneously maximise the entropy and reduce the mismatch between the target and the stolen models. This leads us to an active sampling-based query selection algorithm, Marich. We evaluate Marich on different text and image data sets, and different models, including BERT and ResNet18. Marich is able to extract models that achieve $69-96\%$ of true model's accuracy and uses $1,070 - 6,950$ samples from the publicly available query datasets, which are different from the private training datasets. Models extracted by Marich yield prediction distributions, which are $\sim2-4\times$ closer to the target's distribution in comparison to the existing active sampling-based algorithms. The extracted models also lead to $85-95\%$ accuracy under membership inference attacks. Experimental results validate that Marich is query-efficient, and also capable of performing task-accurate, high-fidelity, and informative model extraction.
    Learning Debiased Classifier with Biased Committee. (arXiv:2206.10843v4 [cs.LG] UPDATED)
    Neural networks are prone to be biased towards spurious correlations between classes and latent attributes exhibited in a major portion of training data, which ruins their generalization capability. We propose a new method for training debiased classifiers with no spurious attribute label. The key idea is to employ a committee of classifiers as an auxiliary module that identifies bias-conflicting data, i.e., data without spurious correlation, and assigns large weights to them when training the main classifier. The committee is learned as a bootstrapped ensemble so that a majority of its classifiers are biased as well as being diverse, and intentionally fail to predict classes of bias-conflicting data accordingly. The consensus within the committee on prediction difficulty thus provides a reliable cue for identifying and weighting bias-conflicting data. Moreover, the committee is also trained with knowledge transferred from the main classifier so that it gradually becomes debiased along with the main classifier and emphasizes more difficult data as training progresses. On five real-world datasets, our method outperforms prior arts using no spurious attribute label like ours and even surpasses those relying on bias labels occasionally.
    The Inadequacy of Shapley Values for Explainability. (arXiv:2302.08160v1 [cs.LG])
    This paper develops a rigorous argument for why the use of Shapley values in explainable AI (XAI) will necessarily yield provably misleading information about the relative importance of features for predictions. Concretely, this paper demonstrates that there exist classifiers, and associated predictions, for which the relative importance of features determined by the Shapley values will incorrectly assign more importance to features that are provably irrelevant for the prediction, and less importance to features that are provably relevant for the prediction. The paper also argues that, given recent complexity results, the existence of efficient algorithms for the computation of rigorous feature attribution values in the case of some restricted classes of classifiers should be deemed unlikely at best.
    A Neural PDE Solver with Temporal Stencil Modeling. (arXiv:2302.08105v1 [cs.LG])
    Numerical simulation of non-linear partial differential equations plays a crucial role in modeling physical science and engineering phenomena, such as weather, climate, and aerodynamics. Recent Machine Learning (ML) models trained on low-resolution spatio-temporal signals have shown new promises in capturing important dynamics in high-resolution signals, under the condition that the models can effectively recover the missing details. However, this study shows that significant information is often lost in the low-resolution down-sampled features. To address such issues, we propose a new approach, namely Temporal Stencil Modeling (TSM), which combines the strengths of advanced time-series sequence modeling (with the HiPPO features) and state-of-the-art neural PDE solvers (with learnable stencil modeling). TSM aims to recover the lost information from the PDE trajectories and can be regarded as a temporal generalization of classic finite volume methods such as WENO. Our experimental results show that TSM achieves the new state-of-the-art simulation accuracy for 2-D incompressible Navier-Stokes turbulent flows: it significantly outperforms the previously reported best results by 19.9% in terms of the highly-correlated duration time and reduces the inference latency into 80%. We also show a strong generalization ability of the proposed method to various out-of-distribution turbulent flow settings. Our code is available at "https://github.com/Edward-Sun/TSM-PDE".
    Special Properties of Gradient Descent with Large Learning Rates. (arXiv:2205.15142v2 [cs.LG] UPDATED)
    When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance.We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size -- on certain non-convex function classes -- follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a local one. Our settings provide a framework for future analysis which allows comparing algorithms based on behaviors that can not be observed in the traditional settings.
    Parameters, Properties, and Process: Conditional Neural Generation of Realistic SEM Imagery Towards ML-assisted Advanced Manufacturing. (arXiv:2302.08495v1 [cs.CV])
    The research and development cycle of advanced manufacturing processes traditionally requires a large investment of time and resources. Experiments can be expensive and are hence conducted on relatively small scales. This poses problems for typically data-hungry machine learning tools which could otherwise expedite the development cycle. We build upon prior work by applying conditional generative adversarial networks (GANs) to scanning electron microscope (SEM) imagery from an emerging manufacturing process, shear assisted processing and extrusion (ShAPE). We generate realistic images conditioned on temper and either experimental parameters or material properties. In doing so, we are able to integrate machine learning into the development cycle, by allowing a user to immediately visualize the microstructure that would arise from particular process parameters or properties. This work forms a technical backbone for a fundamentally new approach for understanding manufacturing processes in the absence of first-principle models. By characterizing microstructure from a topological perspective we are able to evaluate our models' ability to capture the breadth and diversity of experimental scanning electron microscope (SEM) samples. Our method is successful in capturing the visual and general microstructural features arising from the considered process, with analysis highlighting directions to further improve the topological realism of our synthetic imagery.
    Deep learning based surrogate modeling for thermal plume prediction of groundwater heat pumps. (arXiv:2302.08199v1 [physics.flu-dyn])
    The ability for groundwater heat pumps to meet space heating and cooling demands without relying on fossil fuels, has prompted their mass roll out in dense urban environments. In regions with high subsurface groundwater flow rates, the thermal plume generated from a heat pump's injection well can propagate downstream, affecting surrounding users and reducing their heat pump efficiency. To reduce the probability of interference, regulators often rely on simple analytical models or high fidelity groundwater simulations to determine the impact that a heat pump has on the subsurface aquifer and surrounding heat pumps. These are either too inaccurate or too computationally expensive for everyday use. In this work, a surrogate model was developed to provide a quick, high accuracy prediction tool of the thermal plume generated by a heat pump within heterogeneous subsurface aquifers. Three variations of a convolutional neural network were developed that accepts the known groundwater Darcy velocities as discrete two-dimensional inputs and predicts the temperature within the subsurface aquifer around the heat pump. A data set consisting of 800 numerical simulation samples, generated from random permeability fields and pressure boundary conditions, was used to provide pseudo-randomized Darcy velocity fields as input fields and the temperature field solution for training the network. The subsurface temperature field output from the network provides a more realistic temperature field that follows the Darcy velocity streamlines, while being orders of magnitude faster than conventional high fidelity solvers
    GP CC-OPF: Gaussian Process based optimization tool for Chance-Constrained Optimal Power Flow. (arXiv:2302.08454v1 [stat.ML])
    The Gaussian Process (GP) based Chance-Constrained Optimal Power Flow (CC-OPF) is an open-source Python code developed for solving economic dispatch (ED) problem in modern power grids. In recent years, integrating a significant amount of renewables into a power grid causes high fluctuations and thus brings a lot of uncertainty to power grid operations. This fact makes the conventional model-based CC-OPF problem non-convex and computationally complex to solve. The developed tool presents a novel data-driven approach based on the GP regression model for solving the CC-OPF problem with a trade-off between complexity and accuracy. The proposed approach and developed software can help system operators to effectively perform ED optimization in the presence of large uncertainties in the power grid.
    Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. (arXiv:2302.08469v1 [cs.LG])
    Analog in-memory computing (AIMC) -- a promising approach for energy-efficient acceleration of deep learning workloads -- computes matrix-vector multiplications (MVMs) but only approximately, due to nonidealities that often are non-deterministic or nonlinear. This can adversely impact the achievable deep neural network (DNN) inference accuracy as compared to a conventional floating point (FP) implementation. While retraining has previously been suggested to improve robustness, prior work has explored only a few DNN topologies, using disparate and overly simplified AIMC hardware models. Here, we use hardware-aware (HWA) training to systematically examine the accuracy of AIMC for multiple common artificial intelligence (AI) workloads across multiple DNN topologies, and investigate sensitivity and robustness to a broad set of nonidealities. By introducing a new and highly realistic AIMC crossbar-model, we improve significantly on earlier retraining approaches. We show that many large-scale DNNs of various topologies, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, can in fact be successfully retrained to show iso-accuracy on AIMC. Our results further suggest that AIMC nonidealities that add noise to the inputs or outputs, not the weights, have the largest impact on DNN accuracy, and that RNNs are particularly robust to all nonidealities.
    Conditional deep generative models as surrogates for spatial field solution reconstruction with quantified uncertainty in Structural Health Monitoring applications. (arXiv:2302.08329v1 [cs.LG])
    In recent years, increasingly complex computational models are being built to describe physical systems which has led to increased use of surrogate models to reduce computational cost. In problems related to Structural Health Monitoring (SHM), models capable of both handling high-dimensional data and quantifying uncertainty are required. In this work, our goal is to propose a conditional deep generative model as a surrogate aimed at such applications and high-dimensional stochastic structural simulations in general. To that end, a conditional variational autoencoder (CVAE) utilizing convolutional neural networks (CNNs) is employed to obtain reconstructions of spatially ordered structural response quantities for structural elements that are subjected to stochastic loading. Two numerical examples, inspired by potential SHM applications, are utilized to demonstrate the performance of the surrogate. The model is able to achieve high reconstruction accuracy compared to the reference Finite Element (FE) solutions, while at the same time successfully encoding the load uncertainty.
    DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization. (arXiv:2302.08224v1 [cs.LG])
    Neural network-based Combinatorial Optimization (CO) methods have shown promising results in solving various NP-complete (NPC) problems without relying on hand-crafted domain knowledge. This paper broadens the current scope of neural solvers for NPC problems by introducing a new graph-based diffusion framework, namely DIFUSCO. Our framework casts NPC problems as discrete {0, 1}-vector optimization problems and leverages graph-based denoising diffusion models to generate high-quality solutions. We investigate two types of diffusion models with Gaussian and Bernoulli noise, respectively, and devise an effective inference schedule to enhance the solution quality. We evaluate our methods on two well-studied NPC combinatorial optimization problems: Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS). Experimental results show that DIFUSCO strongly outperforms the previous state-of-the-art neural solvers, improving the performance gap between ground-truth and neural solvers from 1.76% to 0.46% on TSP-500, from 2.46% to 1.17% on TSP-1000, and from 3.19% to 2.58% on TSP10000. For the MIS problem, DIFUSCO outperforms the previous state-of-the-art neural solver on the challenging SATLIB benchmark. Our code is available at "https://github.com/Edward-Sun/DIFUSCO".
    LabelPrompt: Effective Prompt-based Learning for Relation Classification. (arXiv:2302.08068v1 [cs.CL])
    Recently, prompt-based learning has become a very popular solution in many Natural Language Processing (NLP) tasks by inserting a template into model input, which converts the task into a cloze-style one to smoothing out differences between the Pre-trained Language Model (PLM) and the current task. But in the case of relation classification, it is difficult to map the masked output to the relation labels because of its abundant semantic information, e.g. org:founded_by''. Therefore, a pre-trained model still needs enough labelled data to fit the relations. To mitigate this challenge, in this paper, we present a novel prompt-based learning method, namely LabelPrompt, for the relation classification task. It is an extraordinary intuitive approach by a motivation: ``GIVE MODEL CHOICES!''. First, we define some additional tokens to represent the relation labels, which regards these tokens as the verbalizer with semantic initialisation and constructs them with a prompt template method. Then we revisit the inconsistency of the predicted relation and the given entities, an entity-aware module with the thought of contrastive learning is designed to mitigate the problem. At last, we apply an attention query strategy to self-attention layers to resolve two types of tokens, prompt tokens and sequence tokens. The proposed strategy effectively improves the adaptation capability of prompt-based learning in the relation classification task when only a small labelled data is available. Extensive experimental results obtained on several bench-marking datasets demonstrate the superiority of the proposed LabelPrompt method, particularly in the few-shot scenario.
    T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. (arXiv:2302.08453v1 [cs.CV])
    The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.
    TIGER: Temporal Interaction Graph Embedding with Restarts. (arXiv:2302.06057v2 [cs.LG] UPDATED)
    Temporal interaction graphs (TIGs), consisting of sequences of timestamped interaction events, are prevalent in fields like e-commerce and social networks. To better learn dynamic node embeddings that vary over time, researchers have proposed a series of temporal graph neural networks for TIGs. However, due to the entangled temporal and structural dependencies, existing methods have to process the sequence of events chronologically and consecutively to ensure node representations are up-to-date. This prevents existing models from parallelization and reduces their flexibility in industrial applications. To tackle the above challenge, in this paper, we propose TIGER, a TIG embedding model that can restart at any timestamp. We introduce a restarter module that generates surrogate representations acting as the warm initialization of node representations. By restarting from multiple timestamps simultaneously, we divide the sequence into multiple chunks and naturally enable the parallelization of the model. Moreover, in contrast to previous models that utilize a single memory unit, we introduce a dual memory module to better exploit neighborhood information and alleviate the staleness problem. Extensive experiments on four public datasets and one industrial dataset are conducted, and the results verify both the effectiveness and the efficiency of our work.
    Online Estimation and Optimization of Utility-Based Shortfall Risk. (arXiv:2111.08805v2 [stat.ML] UPDATED)
    Utility-Based Shortfall Risk (UBSR) is a risk metric that is increasingly popular in financial applications, owing to certain desirable properties that it enjoys. We consider the problem of estimating UBSR in a recursive setting, where samples from the underlying loss distribution are available one-at-a-time. We cast the UBSR estimation problem as a root finding problem, and propose stochastic approximation-based estimations schemes. We derive non-asymptotic bounds on the estimation error in the number of samples. We also consider the problem of UBSR optimization within a parameterized class of random variables. We propose a stochastic gradient descent based algorithm for UBSR optimization, and derive non-asymptotic bounds on its convergence.
    Unsupervised Manifold Alignment with Joint Multidimensional Scaling. (arXiv:2207.02968v2 [stat.ML] UPDATED)
    We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment. The implementation of our work is available at https://github.com/BorgwardtLab/JointMDS
    VA-DepthNet: A Variational Approach to Single Image Depth Prediction. (arXiv:2302.06556v2 [cs.CV] UPDATED)
    We introduce VA-DepthNet, a simple, effective, and accurate deep neural network approach for the single-image depth prediction (SIDP) problem. The proposed approach advocates using classical first-order variational constraints for this problem. While state-of-the-art deep neural network methods for SIDP learn the scene depth from images in a supervised setting, they often overlook the invaluable invariances and priors in the rigid scene space, such as the regularity of the scene. The paper's main contribution is to reveal the benefit of classical and well-founded variational constraints in the neural network design for the SIDP task. It is shown that imposing first-order variational constraints in the scene space together with popular encoder-decoder-based network architecture design provides excellent results for the supervised SIDP task. The imposed first-order variational constraint makes the network aware of the depth gradient in the scene space, i.e., regularity. The paper demonstrates the usefulness of the proposed approach via extensive evaluation and ablation analysis over several benchmark datasets, such as KITTI, NYU Depth V2, and SUN RGB-D. The VA-DepthNet at test time shows considerable improvements in depth prediction accuracy compared to the prior art and is accurate also at high-frequency regions in the scene space. At the time of writing this paper, our method -- labeled as VA-DepthNet, when tested on the KITTI depth-prediction evaluation set benchmarks, shows state-of-the-art results, and is the top-performing published approach.
    Cross Modal Distillation for Flood Extent Mapping. (arXiv:2302.08180v1 [cs.CV])
    The increasing intensity and frequency of floods is one of the many consequences of our changing climate. In this work, we explore ML techniques that improve the flood detection module of an operational early flood warning system. Our method exploits an unlabelled dataset of paired multi-spectral and Synthetic Aperture Radar (SAR) imagery to reduce the labeling requirements of a purely supervised learning method. Prior works have used unlabelled data by creating weak labels out of them. However, from our experiments we noticed that such a model still ends up learning the label mistakes in those weak labels. Motivated by knowledge distillation and semi supervised learning, we explore the use of a teacher to train a student with the help of a small hand labelled dataset and a large unlabelled dataset. Unlike the conventional self distillation setup, we propose a cross modal distillation framework that transfers supervision from a teacher trained on richer modality (multi-spectral images) to a student model trained on SAR imagery. The trained models are then tested on the Sen1Floods11 dataset. Our model outperforms the Sen1Floods11 baseline model trained on the weak labeled SAR imagery by an absolute margin of 6.53% Intersection-over-Union (IoU) on the test split.
    Flexible risk design using bi-directional dispersion. (arXiv:2203.14434v3 [stat.ML] UPDATED)
    Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution incurred by gradient-based learners.
    EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance. (arXiv:2211.09496v2 [eess.AS] UPDATED)
    Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $\alpha$ and $1-\alpha$ respectively. The $\alpha$ here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process.
    Complementary Composite Minimization, Small Gradients in General Norms, and Applications. (arXiv:2101.11041v2 [math.OC] UPDATED)
    Composite minimization is a powerful framework in large-scale convex optimization, based on decoupling of the objective function into terms with structurally different properties and allowing for more flexible algorithmic design. We introduce a new algorithmic framework for complementary composite minimization, where the objective function decouples into a (weakly) smooth and a uniformly convex term. This particular form of decoupling is pervasive in statistics and machine learning, due to its link to regularization. The main contributions of our work are summarized as follows. First, we introduce the problem of complementary composite minimization in general normed spaces; second, we provide a unified accelerated algorithmic framework to address broad classes of complementary composite minimization problems; and third, we prove that the algorithms resulting from our framework are near-optimal in most of the standard optimization settings. Additionally, we show that our algorithmic framework can be used to address the problem of making the gradients small in general normed spaces. As a concrete example, we obtain a nearly-optimal method for the standard $\ell_1$ setup (small gradients in the $\ell_{\infty}$ norm), essentially matching the bound of Nesterov (2012) that was previously known only for the Euclidean setup. Finally, we show that our composite methods are broadly applicable to a number of regression and other classes of optimization problems, where regularization plays a key role. Our methods lead to complexity bounds that are either new or match the best existing ones.
    A numerical approximation method for the Fisher-Rao distance between multivariate normal distributions. (arXiv:2302.08175v1 [cs.IT])
    We present a method to approximate Rao's distance between multivariate normal distributions based on discretizing curves joining normal distributions and approximating Rao distances between successive nearby normals on the curve by using Jeffrey's divergence. We consider experimentally the linear interpolation curves in the ordinary, natural and expectation parameterizations of the normal distributions. We further consider a curve derived from the Calvo and Oller's isometric embedding of the Fisher-Rao $d$-variate normal manifold into the cone of $(d+1)\times (d+1)$ symmetric positive-definite matrices [Journal of multivariate analysis 35.2 (1990): 223-242]. Last, we present some information-geometric properties of the Calvo and Oller's mapping.
    Characterizing and Detecting State-Sponsored Troll Activity on Social Media. (arXiv:2210.08786v4 [cs.SI] UPDATED)
    The detection of state-sponsored trolls acting in information operations is an unsolved and critical challenge for the research community, with repercussions that go beyond the online realm. In this paper, we propose a novel AI-based solution for the detection of state-sponsored troll accounts, which consists of two steps. The first step aims at classifying trajectories of accounts' online activities as belonging to either a state-sponsored troll or to an organic user account. In the second step, we exploit the classified trajectories to compute a metric, namely "troll score", which allows us to quantify the extent to which an account behaves like a state-sponsored troll. As a study case, we consider the troll accounts involved in the Russian interference campaign during the 2016 US Presidential election, identified as Russian trolls by the US Congress. Experimental results show that our approach identifies accounts' trajectories with an AUC close to 99% and, accordingly, classify Russian trolls and organic users with an AUC of 90%. Finally, we evaluate whether the proposed solution can be generalized to different contexts (e.g., discussions about Covid-19) and generic misbehaving users, showing promising results that will be further expanded in our future endeavors.
    Group Fairness with Uncertainty in Sensitive Attributes. (arXiv:2302.08077v1 [cs.LG])
    We consider learning a fair predictive model when sensitive attributes are uncertain, say, due to a limited amount of labeled data, collection bias, or privacy mechanism. We formulate the problem, for the independence notion of fairness, using the information bottleneck principle, and propose a robust optimization with respect to an uncertainty set of the sensitive attributes. As an illustrative case, we consider the joint Gaussian model and reduce the task to a quadratically constrained quadratic problem (QCQP). To ensure a strict fairness guarantee, we propose a robust QCQP and completely characterize its solution with an intuitive geometric understanding. When uncertainty arises due to limited labeled sensitive attributes, our analysis reveals the contribution of each new sample towards the optimal performance achieved with unlimited access to labeled sensitive attributes. This allows us to identify non-trivial regimes where uncertainty incurs no performance loss of the proposed algorithm while continuing to guarantee strict fairness. We also propose a bootstrap-based generic algorithm that is applicable beyond the Gaussian case. We demonstrate the value of our analysis and method on synthetic data as well as real-world classification and regression tasks.
    An Omnidirectional Approach to Touch-based Continuous Authentication. (arXiv:2302.08498v1 [cs.CR])
    This paper focuses on how touch interactions on smartphones can provide a continuous user authentication service through behaviour captured by a touchscreen. While efforts are made to advance touch-based behavioural authentication, researchers often focus on gathering data, tuning classifiers, and enhancing performance by evaluating touch interactions in a sequence rather than independently. However, such systems only work by providing data representing distinct behavioural traits. The typical approach separates behaviour into touch directions and creates multiple user profiles. This work presents an omnidirectional approach which outperforms the traditional method independent of the touch direction - depending on optimal behavioural features and a balanced training set. Thus, we evaluate five behavioural feature sets using the conventional approach against our direction-agnostic method while testing several classifiers, including an Extra-Tree and Gradient Boosting Classifier, which is often overlooked. Results show that in comparison with the traditional, an Extra-Trees classifier and the proposed approach are superior when combining strokes. However, the performance depends on the applied feature set. We find that the TouchAlytics feature set outperforms others when using our approach when combining three or more strokes. Finally, we highlight the importance of reporting the mean area under the curve and equal error rate for single-stroke performance and varying the sequence of strokes separately.
    Choosing the Number of Topics in LDA Models -- A Monte Carlo Comparison of Selection Criteria. (arXiv:2212.14074v2 [cs.CL] UPDATED)
    Selecting the number of topics in LDA models is considered to be a difficult task, for which alternative approaches have been proposed. The performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be implemented to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the DGPs are identified. Practical recommendations for LDA model selection in applications are derived.
    The Third International Verification of Neural Networks Competition (VNN-COMP 2022): Summary and Results. (arXiv:2212.10376v2 [cs.LG] UPDATED)
    This report summarizes the 3rd International Verification of Neural Networks Competition (VNN-COMP 2022), held as a part of the 5th Workshop on Formal Methods for ML-Enabled Autonomous Systems (FoMLAS), which was collocated with the 34th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2022 iteration, 11 teams participated on a diverse set of 12 scored benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.
    HE-MAN -- Homomorphically Encrypted MAchine learning with oNnx models. (arXiv:2302.08260v1 [cs.CR])
    Machine learning (ML) algorithms are increasingly important for the success of products and services, especially considering the growing amount and availability of data. This also holds for areas handling sensitive data, e.g. applications processing medical data or facial images. However, people are reluctant to pass their personal sensitive data to a ML service provider. At the same time, service providers have a strong interest in protecting their intellectual property and therefore refrain from publicly sharing their ML model. Fully homomorphic encryption (FHE) is a promising technique to enable individuals using ML services without giving up privacy and protecting the ML model of service providers at the same time. Despite steady improvements, FHE is still hardly integrated in today's ML applications. We introduce HE-MAN, an open-source two-party machine learning toolset for privacy preserving inference with ONNX models and homomorphically encrypted data. Both the model and the input data do not have to be disclosed. HE-MAN abstracts cryptographic details away from the users, thus expertise in FHE is not required for either party. HE-MAN 's security relies on its underlying FHE schemes. For now, we integrate two different homomorphic encryption schemes, namely Concrete and TenSEAL. Compared to prior work, HE-MAN supports a broad range of ML models in ONNX format out of the box without sacrificing accuracy. We evaluate the performance of our implementation on different network architectures classifying handwritten digits and performing face recognition and report accuracy and latency of the homomorphically encrypted inference. Cryptographic parameters are automatically derived by the tools. We show that the accuracy of HE-MAN is on par with models using plaintext input while inference latency is several orders of magnitude higher compared to the plaintext case.
    Variational Information Pursuit for Interpretable Predictions. (arXiv:2302.02876v2 [cs.LG] UPDATED)
    There is a growing interest in the machine learning community in developing predictive algorithms that are "interpretable by design". Towards this end, recent work proposes to make interpretable decisions by sequentially asking interpretable queries about data until a prediction can be made with high confidence based on the answers obtained (the history). To promote short query-answer chains, a greedy procedure called Information Pursuit (IP) is used, which adaptively chooses queries in order of information gain. Generative models are employed to learn the distribution of query-answers and labels, which is in turn used to estimate the most informative query. However, learning and inference with a full generative model of the data is often intractable for complex tasks. In this work, we propose Variational Information Pursuit (V-IP), a variational characterization of IP which bypasses the need for learning generative models. V-IP is based on finding a query selection strategy and a classifier that minimizes the expected cross-entropy between true and predicted labels. We then demonstrate that the IP strategy is the optimal solution to this problem. Therefore, instead of learning generative models, we can use our optimal strategy to directly pick the most informative query given any history. We then develop a practical algorithm by defining a finite-dimensional parameterization of our strategy and classifier using deep networks and train them end-to-end using our objective. Empirically, V-IP is 10-100x faster than IP on different Vision and NLP tasks with competitive performance. Moreover, V-IP finds much shorter query chains when compared to reinforcement learning which is typically used in sequential-decision-making problems. Finally, we demonstrate the utility of V-IP on challenging tasks like medical diagnosis where the performance is far superior to the generative modelling approach.
    Learning-based solutions to nonlinear hyperbolic PDEs: Empirical insights on generalization errors. (arXiv:2302.08144v1 [cs.LG])
    We study learning weak solutions to nonlinear hyperbolic partial differential equations (H-PDE), which have been difficult to learn due to discontinuities in their solutions. We use a physics-informed variant of the Fourier Neural Operator ($\pi$-FNO) to learn the weak solutions. We empirically quantify the generalization/out-of-sample error of the $\pi$-FNO solver as a function of input complexity, i.e., the distributions of initial and boundary conditions. Our testing results show that $\pi$-FNO generalizes well to unseen initial and boundary conditions. We find that the generalization error grows linearly with input complexity. Further, adding a physics-informed regularizer improved the prediction of discontinuities in the solution. We use the Lighthill-Witham-Richards (LWR) traffic flow model as a guiding example to illustrate the results.
    Improving Spoken Language Identification with Map-Mix. (arXiv:2302.08229v1 [cs.LG])
    The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages. However, the performance significantly degrades when the languages are not very distinct from each other, for example, in the case of dialects. Low resource dialect classification remains a challenging problem to solve. We present a new data augmentation method that leverages model training dynamics of individual data points to improve sampling for latent mixup. The method works well in low-resource settings where generalization is paramount. Our datamaps-based mixup technique, which we call Map-Mix improves weighted F1 scores by 2% compared to the random mixup baseline and results in a significantly well-calibrated model. The code for our method is open sourced on https://github.com/skit-ai/Map-Mix.
    A Bit-Parallel Deterministic Stochastic Multiplier. (arXiv:2302.08324v1 [cs.AR])
    This paper presents a novel bit-parallel deterministic stochastic multiplier, which improves the area-energy-latency product by up to 10.6$\times$10$^4$, while improving the computational error by 32.2\%, compared to three prior stochastic multipliers.
    Learning Hypergraphs From Signals With Dual Smoothness Prior. (arXiv:2211.01717v2 [cs.LG] UPDATED)
    The construction of a meaningful hypergraph topology is the key to processing signals with high-order relationships that involve more than two entities. Learning the hypergraph structure from the observed signals to capture the intrinsic relationships among the entities becomes crucial when a hypergraph topology is not readily available in the datasets. There are two challenges that lie at the heart of this problem: 1) how to handle the huge search space of potential hyperedges, and 2) how to define meaningful criteria to measure the relationship between the signals observed on nodes and the hypergraph structure. In this paper, to address the first challenge, we adopt the assumption that the ideal hypergraph structure can be derived from a learnable graph structure that captures the pairwise relations within signals. Further, we propose a hypergraph learning framework with a novel dual smoothness prior that reveals a mapping between the observed node signals and the hypergraph structure, whereby each hyperedge corresponds to a subgraph with both node signal smoothness and edge signal smoothness in the learnable graph structure. Finally, we conduct extensive experiments to evaluate the proposed framework on both synthetic and real world datasets. Experiments show that our proposed framework can efficiently infer meaningful hypergraph topologies from observed signals.
    Fuzzy Knowledge Distillation from High-Order TSK to Low-Order TSK. (arXiv:2302.08038v1 [cs.LG])
    High-order Takagi-Sugeno-Kang (TSK) fuzzy classifiers possess powerful classification performance yet have fewer fuzzy rules, but always be impaired by its exponential growth training time and poorer interpretability owing to High-order polynomial used in consequent part of fuzzy rule, while Low-order TSK fuzzy classifiers run quickly with high interpretability, however they usually require more fuzzy rules and perform relatively not very well. Address this issue, a novel TSK fuzzy classifier embeded with knowledge distillation in deep learning called HTSK-LLM-DKD is proposed in this study. HTSK-LLM-DKD achieves the following distinctive characteristics: 1) It takes High-order TSK classifier as teacher model and Low-order TSK fuzzy classifier as student model, and leverages the proposed LLM-DKD (Least Learning Machine based Decoupling Knowledge Distillation) to distill the fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier, which resulting in Low-order TSK fuzzy classifier endowed with enhanced performance surpassing or at least comparable to High-order TSK classifier, as well as high interpretability; specifically 2) The Negative Euclidean distance between the output of teacher model and each class is employed to obtain the teacher logits, and then it compute teacher/student soft labels by the softmax function with distillating temperature parameter; 3) By reformulating the Kullback-Leibler divergence, it decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge, and transfers them to student model. The advantages of HTSK-LLM-DKD are verified on the benchmarking UCI datasets and a real dataset Cleveland heart disease, in terms of classification performance and model interpretability.
    Revisiting Hidden Representations in Transfer Learning for Medical Imaging. (arXiv:2302.08272v1 [cs.CV])
    While a key component to the success of deep learning is the availability of massive amounts of training data, medical image datasets are often limited in diversity and size. Transfer learning has the potential to bridge the gap between related yet different domains. For medical applications, however, it remains unclear whether it is more beneficial to pre-train on natural or medical images. We aim to shed light on this problem by comparing initialization on ImageNet and RadImageNet on seven medical classification tasks. We investigate their learned representations with Canonical Correlation Analysis (CCA) and compare the predictions of the different models. We find that overall the models pre-trained on ImageNet outperform those trained on RadImageNet. Our results show that, contrary to intuition, ImageNet and RadImageNet converge to distinct intermediate representations, and that these representations are even more dissimilar after fine-tuning. Despite these distinct representations, the predictions of the models remain similar. Our findings challenge the notion that transfer learning is effective due to the reuse of general features in the early layers of a convolutional neural network and show that weight similarity before and after fine-tuning is negatively related to performance gains.
    Assisting Human Decisions in Document Matching. (arXiv:2302.08450v1 [cs.LG])
    Many practical applications, ranging from paper-reviewer assignment in peer review to job-applicant matching for hiring, require human decision makers to identify relevant matches by combining their expertise with predictions from machine learning models. In many such model-assisted document matching tasks, the decision makers have stressed the need for assistive information about the model outputs (or the data) to facilitate their decisions. In this paper, we devise a proxy matching task that allows us to evaluate which kinds of assistive information improve decision makers' performance (in terms of accuracy and time). Through a crowdsourced (N=271 participants) study, we find that providing black-box model explanations reduces users' accuracy on the matching task, contrary to the commonly-held belief that they can be helpful by allowing better understanding of the model. On the other hand, custom methods that are designed to closely attend to some task-specific desiderata are found to be effective in improving user performance. Surprisingly, we also find that the users' perceived utility of assistive information is misaligned with their objective utility (measured through their task performance).
    On the Effect of Adversarial Training Against Invariance-based Adversarial Examples. (arXiv:2302.08257v1 [cs.LG])
    Adversarial examples are carefully crafted attack points that are supposed to fool machine learning classifiers. In the last years, the field of adversarial machine learning, especially the study of perturbation-based adversarial examples, in which a perturbation that is not perceptible for humans is added to the images, has been studied extensively. Adversarial training can be used to achieve robustness against such inputs. Another type of adversarial examples are invariance-based adversarial examples, where the images are semantically modified such that the predicted class of the model does not change, but the class that is determined by humans does. How to ensure robustness against this type of adversarial examples has not been explored yet. This work addresses the impact of adversarial training with invariance-based adversarial examples on a convolutional neural network (CNN). We show that when adversarial training with invariance-based and perturbation-based adversarial examples is applied, it should be conducted simultaneously and not consecutively. This procedure can achieve relatively high robustness against both types of adversarial examples. Additionally, we find that the algorithm used for generating invariance-based adversarial examples in prior work does not correctly determine the labels and therefore we use human-determined labels.
    Singular Value Representation: A New Graph Perspective On Neural Networks. (arXiv:2302.08183v1 [cs.LG])
    We introduce the Singular Value Representation (SVR), a new method to represent the internal state of neural networks using SVD factorization of the weights. This construction yields a new weighted graph connecting what we call spectral neurons, that correspond to specific activation patterns of classical neurons. We derive a precise statistical framework to discriminate meaningful connections between spectral neurons for fully connected and convolutional layers. To demonstrate the usefulness of our approach for machine learning research, we highlight two discoveries we made using the SVR. First, we highlight the emergence of a dominant connection in VGG networks that spans multiple deep layers. Second, we witness, without relying on any input data, that batch normalization can induce significant connections between near-kernels of deep layers, leading to a remarkable spontaneous sparsification phenomenon.
    Quality vs. Quantity of Data in Contextual Decision-Making: Exact Analysis under Newsvendor Loss. (arXiv:2302.08424v1 [cs.LG])
    When building datasets, one needs to invest time, money and energy to either aggregate more data or to improve their quality. The most common practice favors quantity over quality without necessarily quantifying the trade-off that emerges. In this work, we study data-driven contextual decision-making and the performance implications of quality and quantity of data. We focus on contextual decision-making with a Newsvendor loss. This loss is that of a central capacity planning problem in Operations Research, but also that associated with quantile regression. We consider a model in which outcomes observed in similar contexts have similar distributions and analyze the performance of a classical class of kernel policies which weigh data according to their similarity in a contextual space. We develop a series of results that lead to an exact characterization of the worst-case expected regret of these policies. This exact characterization applies to any sample size and any observed contexts. The model we develop is flexible, and captures the case of partially observed contexts. This exact analysis enables to unveil new structural insights on the learning behavior of uniform kernel methods: i) the specialized analysis leads to very large improvements in quantification of performance compared to state of the art general purpose bounds. ii) we show an important non-monotonicity of the performance as a function of data size not captured by previous bounds; and iii) we show that in some regimes, a little increase in the quality of the data can dramatically reduce the amount of samples required to reach a performance target. All in all, our work demonstrates that it is possible to quantify in a precise fashion the interplay of data quality and quantity, and performance in a central problem class. It also highlights the need for problem specific bounds in order to understand the trade-offs at play.
    Individual Fairness Guarantee in Learning with Censorship. (arXiv:2302.08015v1 [cs.LG])
    Algorithmic fairness, studying how to make machine learning (ML) algorithms fair, is an established area of ML. As ML technologies expand their application domains, including ones with high societal impact, it becomes essential to take fairness into consideration when building ML systems. Yet, despite its wide range of socially sensitive applications, most work treats the issue of algorithmic bias as an intrinsic property of supervised learning, i.e., the class label is given as a precondition. Unlike prior fairness work, we study individual fairness in learning with censorship where the assumption of availability of the class label does not hold, while still requiring that similar individuals are treated similarly. We argue that this perspective represents a more realistic model of fairness research for real-world application deployment, and show how learning with such a relaxed precondition draws new insights that better explain algorithmic fairness. We also thoroughly evaluate the performance of the proposed methodology on three real-world datasets, and validate its superior performance in minimizing discrimination while maintaining predictive performance.
    Efficiency 360: Efficient Vision Transformers. (arXiv:2302.08374v1 [cs.CV])
    Transformers are widely used for solving tasks in natural language processing, computer vision, speech, and music domains. In this paper, we talk about the efficiency of transformers in terms of memory (the number of parameters), computation cost (number of floating points operations), and performance of models, including accuracy, the robustness of the model, and fair \& bias-free features. We mainly discuss the vision transformer for the image classification task. Our contribution is to introduce an efficient 360 framework, which includes various aspects of the vision transformer, to make it more efficient for industrial applications. By considering those applications, we categorize them into multiple dimensions such as privacy, robustness, transparency, fairness, inclusiveness, continual learning, probabilistic models, approximation, computational complexity, and spectral complexity. We compare various vision transformer models based on their performance, the number of parameters, and the number of floating point operations (FLOPs) on multiple datasets.
    FOSI: Hybrid First and Second Order Optimization. (arXiv:2302.08484v1 [cs.LG])
    Though second-order optimization methods are highly effective, popular approaches in machine learning such as SGD and Adam use only first-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. Our analysis of FOSI's preconditioner and effective Hessian proves that FOSI improves the condition number for a large family of optimizers. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of GD, Heavy-Ball, and Adam when applied to several deep neural networks training tasks such as audio classification, transfer learning, and object classification and when applied to convex functions.
    \`A-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting. (arXiv:2302.07994v1 [cs.LG])
    We introduce \`A-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time. The individual prompts can be trained in isolation, possibly on different devices, at different times, and on different distributions or domains. Furthermore each prompt only contains information about the subset of data it was exposed to during training. During inference, models can be assembled based on arbitrary selections of data sources, which we call "\`a-la-carte learning". \`A-la-carte learning enables constructing bespoke models specific to each user's individual access rights and preferences. We can add or remove information from the model by simply adding or removing the corresponding prompts without retraining from scratch. We demonstrate that \`a-la-carte built models achieve accuracy within $5\%$ of models trained on the union of the respective sources, with comparable cost in terms of training and inference time. For the continual learning benchmarks Split CIFAR-100 and CORe50, we achieve state-of-the-art performance.
    Balancing Privacy Protection and Interpretability in Federated Learning. (arXiv:2302.08044v1 [cs.LG])
    Federated learning (FL) aims to collaboratively train the global model in a distributed manner by sharing the model parameters from local clients to a central server, thereby potentially protecting users' private information. Nevertheless, recent studies have illustrated that FL still suffers from information leakage as adversaries try to recover the training data by analyzing shared parameters from local clients. To deal with this issue, differential privacy (DP) is adopted to add noise to the gradients of local models before aggregation. It, however, results in the poor performance of gradient-based interpretability methods, since some weights capturing the salient region in feature map will be perturbed. To overcome this problem, we propose a simple yet effective adaptive differential privacy (ADP) mechanism that selectively adds noisy perturbations to the gradients of client models in FL. We also theoretically analyze the impact of gradient perturbation on the model interpretability. Finally, extensive experiments on both IID and Non-IID data demonstrate that the proposed ADP can achieve a good trade-off between privacy and interpretability in FL.
    Fair mapping. (arXiv:2209.00617v2 [cs.LG] UPDATED)
    To mitigate the effects of undesired biases in models, several approaches propose to pre-process the input dataset to reduce the risks of discrimination by preventing the inference of sensitive attributes. Unfortunately, most of these pre-processing methods lead to the generation a new distribution that is very different from the original one, thus often leading to unrealistic data. As a side effect, this new data distribution implies that existing models need to be re-trained to be able to make accurate predictions. To address this issue, we propose a novel pre-processing method, that we coin as fair mapping, based on the transformation of the distribution of protected groups onto a chosen target one, with additional privacy constraints whose objective is to prevent the inference of sensitive attributes. More precisely, we leverage on the recent works of the Wasserstein GAN and AttGAN frameworks to achieve the optimal transport of data points coupled with a discriminator enforcing the protection against attribute inference. Our proposed approach, preserves the interpretability of data and can be used without defining exactly the sensitive groups. In addition, our approach can be specialized to model existing state-of-the-art approaches, thus proposing a unifying view on these methods. Finally, several experiments on real and synthetic datasets demonstrate that our approach is able to hide the sensitive attributes, while limiting the distortion of the data and improving the fairness on subsequent data analysis tasks.
    Multiscale Graph Neural Network Autoencoders for Interpretable Scientific Machine Learning. (arXiv:2302.06186v2 [cs.LG] UPDATED)
    The goal of this work is to address two limitations in autoencoder-based models: latent space interpretability and compatibility with unstructured meshes. This is accomplished here with the development of a novel graph neural network (GNN) autoencoding architecture with demonstrations on complex fluid flow applications. To address the first goal of interpretability, the GNN autoencoder achieves reduction in the number nodes in the encoding stage through an adaptive graph reduction procedure. This reduction procedure essentially amounts to flowfield-conditioned node sampling and sensor identification, and produces interpretable latent graph representations tailored to the flowfield reconstruction task in the form of so-called masked fields. These masked fields allow the user to (a) visualize where in physical space a given latent graph is active, and (b) interpret the time-evolution of the latent graph connectivity in accordance with the time-evolution of unsteady flow features (e.g. recirculation zones, shear layers) in the domain. To address the goal of unstructured mesh compatibility, the autoencoding architecture utilizes a series of multi-scale message passing (MMP) layers, each of which models information exchange among node neighborhoods at various lengthscales. The MMP layer, which augments standard single-scale message passing with learnable coarsening operations, allows the decoder to more efficiently reconstruct the flowfield from the identified regions in the masked fields. Analysis of latent graphs produced by the autoencoder for various model settings are conducted using using unstructured snapshot data sourced from large-eddy simulations in a backward-facing step (BFS) flow configuration with an OpenFOAM-based flow solver at high Reynolds numbers.
    Self-supervised Guided Hypergraph Feature Propagation for Semi-supervised Classification with Missing Node Features. (arXiv:2302.08250v1 [cs.LG])
    Graph neural networks (GNNs) with missing node features have recently received increasing interest. Such missing node features seriously hurt the performance of the existing GNNs. Some recent methods have been proposed to reconstruct the missing node features by the information propagation among nodes with known and unknown attributes. Although these methods have achieved superior performance, how to exactly exploit the complex data correlations among nodes to reconstruct missing node features is still a great challenge. To solve the above problem, we propose a self-supervised guided hypergraph feature propagation (SGHFP). Specifically, the feature hypergraph is first generated according to the node features with missing information. And then, the reconstructed node features produced by the previous iteration are fed to a two-layer GNNs to construct a pseudo-label hypergraph. Before each iteration, the constructed feature hypergraph and pseudo-label hypergraph are fused effectively, which can better preserve the higher-order data correlations among nodes. After then, we apply the fused hypergraph to the feature propagation for reconstructing missing features. Finally, the reconstructed node features by multi-iteration optimization are applied to the downstream semi-supervised classification task. Extensive experiments demonstrate that the proposed SGHFP outperforms the existing semi-supervised classification with missing node feature methods.
    A Proximal Algorithm for Sampling. (arXiv:2202.13975v2 [cs.LG] UPDATED)
    We study sampling problems associated with potentials that lack smoothness. The potentials can be either convex or non-convex. Departing from the standard smooth setting, the potentials are only assumed to be weakly smooth or non-smooth, or the summation of multiple such functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling for both non-convex and convex potentials that are not necessarily smooth. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.
    Autoregressive Quantile Flows for Predictive Uncertainty Estimation. (arXiv:2112.04643v3 [cs.LG] UPDATED)
    Numerous applications of machine learning involve representing probability distributions over high-dimensional data. We propose autoregressive quantile flows, a flexible class of normalizing flow models trained using a novel objective based on proper scoring rules. Our objective does not require calculating computationally expensive determinants of Jacobians during training and supports new types of neural architectures, such as neural autoregressive flows from which it is easy to sample. We leverage these models in quantile flow regression, an approach that parameterizes predictive conditional distributions with flows, resulting in improved probabilistic predictions on tasks such as time series forecasting and object detection. Our novel objective functions and neural flow parameterizations also yield improvements on popular generation and density estimation tasks, and represent a step beyond maximum likelihood learning of flows.
    Efficient Tomography of Non-Interacting Fermion States. (arXiv:2102.10458v4 [quant-ph] UPDATED)
    We give an efficient algorithm that learns a non-interacting fermion state, given copies of the state. For a system of $n$ non-interacting fermions and $m$ modes, we show that $O(m^3 n^2 \log(1/\delta) / \epsilon^4)$ copies of the input state and $O(m^4 n^2 \log(1/\delta)/ \epsilon^4)$ time are sufficient to learn the state to trace distance at most $\epsilon$ with probability at least $1 - \delta$. Our algorithm empirically estimates one-mode correlations in $O(m)$ different measurement bases and uses them to reconstruct a succinct description of the entire state efficiently.
    Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning. (arXiv:2302.08476v1 [cs.LG])
    Machine learning (ML) requires using energy to carry out computations during the model training process. The generation of this energy comes with an environmental cost in terms of greenhouse gas emissions, depending on quantity used and the energy source. Existing research on the environmental impacts of ML has been limited to analyses covering a small number of models and does not adequately represent the diversity of ML models and tasks. In the current study, we present a survey of the carbon emissions of 95 ML models across time and different tasks in natural language processing and computer vision. We analyze them in terms of the energy sources used, the amount of CO2 emissions produced, how these emissions evolve across time and how they relate to model performance. We conclude with a discussion regarding the carbon footprint of our field and propose the creation of a centralized repository for reporting and tracking these emissions.
    Learning From Biased Soft Labels. (arXiv:2302.08155v1 [cs.LG])
    Knowledge distillation has been widely adopted in a variety of tasks and has achieved remarkable successes. Since its inception, many researchers have been intrigued by the dark knowledge hidden in the outputs of the teacher model. Recently, a study has demonstrated that knowledge distillation and label smoothing can be unified as learning from soft labels. Consequently, how to measure the effectiveness of the soft labels becomes an important question. Most existing theories have stringent constraints on the teacher model or data distribution, and many assumptions imply that the soft labels are close to the ground-truth labels. This paper studies whether biased soft labels are still effective. We present two more comprehensive indicators to measure the effectiveness of such soft labels. Based on the two indicators, we give sufficient conditions to ensure biased soft label based learners are classifier-consistent and ERM learnable. The theory is applied to three weakly-supervised frameworks. Experimental results validate that biased soft labels can also teach good students, which corroborates the soundness of the theory.
    Multi-task Self-supervised Graph Neural Networks Enable Stronger Task Generalization. (arXiv:2210.02016v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) for graph neural networks (GNNs) has attracted increasing attention from the graph machine learning community in recent years, owing to its capability to learn performant node embeddings without costly label information. One weakness of conventional SSL frameworks for GNNs is that they learn through a single philosophy, such as mutual information maximization or generative reconstruction. When applied to various downstream tasks, these frameworks rarely perform equally well for every task, because one philosophy may not span the extensive knowledge required for all tasks. To enhance the task generalization across tasks, as an important first step forward in exploring fundamental graph models, we introduce PARETOGNN, a multi-task SSL framework for node representation learning over graphs. Specifically, PARETOGNN is self-supervised by manifold pretext tasks observing multiple philosophies. To reconcile different philosophies, we explore a multiple-gradient descent algorithm, such that PARETOGNN actively learns from every pretext task while minimizing potential conflicts. We conduct comprehensive experiments over four downstream tasks (i.e., node classification, node clustering, link prediction, and partition prediction), and our proposal achieves the best overall performance across tasks on 11 widely adopted benchmark datasets. Besides, we observe that learning from multiple philosophies enhances not only the task generalization but also the single task performances, demonstrating that PARETOGNN achieves better task generalization via the disjoint yet complementary knowledge learned from different philosophies. Our code is publicly available at https://github.com/jumxglhf/ParetoGNN.
    Model-Based Decentralized Policy Optimization. (arXiv:2302.08139v1 [cs.LG])
    Decentralized policy optimization has been commonly used in cooperative multi-agent tasks. However, since all agents are updating their policies simultaneously, from the perspective of individual agents, the environment is non-stationary, resulting in it being hard to guarantee monotonic policy improvement. To help the policy improvement be stable and monotonic, we propose model-based decentralized policy optimization (MDPO), which incorporates a latent variable function to help construct the transition and reward function from an individual perspective. We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization. Moreover, due to non-stationarity, the latent variable function is varying and hard to be modeled. We further propose a latent variable prediction method to reduce the error of the latent variable function, which theoretically contributes to the monotonic policy improvement. Empirically, MDPO can indeed obtain superior performance than model-free decentralized policy optimization in a variety of cooperative multi-agent tasks.
    GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks. (arXiv:2302.08043v1 [cs.LG])
    Graphs can model complex relationships between objects, enabling a myriad of Web applications such as online page/article classification and social recommendation. While graph neural networks(GNNs) have emerged as a powerful tool for graph representation learning, in an end-to-end supervised setting, their performance heavily rely on a large amount of task-specific supervision. To reduce labeling requirement, the "pre-train, fine-tune" and "pre-train, prompt" paradigms have become increasingly common. In particular, prompting is a popular alternative to fine-tuning in natural language processing, which is designed to narrow the gap between pre-training and downstream objectives in a task-specific manner. However, existing study of prompting on graphs is still limited, lacking a universal treatment to appeal to different downstream tasks. In this paper, we propose GraphPrompt, a novel pre-training and prompting framework on graphs. GraphPrompt not only unifies pre-training and downstream tasks into a common task template, but also employs a learnable prompt to assist a downstream task in locating the most relevant knowledge from the pre-train model in a task-specific manner. Finally, we conduct extensive experiments on five public datasets to evaluate and analyze GraphPrompt.
    Oracles & Followers: Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning. (arXiv:2210.11942v3 [cs.GT] UPDATED)
    Stackelberg equilibria arise naturally in a range of popular learning problems, such as in security games or indirect mechanism design, and have received increasing attention in the reinforcement learning literature. We present a general framework for implementing Stackelberg equilibria search as a multi-agent RL problem, allowing a wide range of algorithmic design choices. We discuss how previous approaches can be seen as specific instantiations of this framework. As a key insight, we note that the design space allows for approaches not previously seen in the literature, for instance by leveraging multitask and meta-RL techniques for follower convergence. We propose one such approach using contextual policies, and evaluate it experimentally on both standard and novel benchmark domains, showing greatly improved sample efficiency compared to previous approaches. Finally, we explore the effect of adopting algorithm designs outside the borders of our framework.
    The Scope of Multicalibration: Characterizing Multicalibration via Property Elicitation. (arXiv:2302.08507v1 [cs.LG])
    We make a connection between multicalibration and property elicitation and show that (under mild technical conditions) it is possible to produce a multicalibrated predictor for a continuous scalar distributional property $\Gamma$ if and only if $\Gamma$ is elicitable. On the negative side, we show that for non-elicitable continuous properties there exist simple data distributions on which even the true distributional predictor is not calibrated. On the positive side, for elicitable $\Gamma$, we give simple canonical algorithms for the batch and the online adversarial setting, that learn a $\Gamma$-multicalibrated predictor. This generalizes past work on multicalibrated means and quantiles, and in fact strengthens existing online quantile multicalibration results. To further counter-weigh our negative result, we show that if a property $\Gamma^1$ is not elicitable by itself, but is elicitable conditionally on another elicitable property $\Gamma^0$, then there is a canonical algorithm that jointly multicalibrates $\Gamma^1$ and $\Gamma^0$; this generalizes past work on mean-moment multicalibration. Finally, as applications of our theory, we provide novel algorithmic and impossibility results for fair (multicalibrated) risk assessment.
    On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning. (arXiv:2206.03271v2 [cs.LG] UPDATED)
    Intelligent agents should have the ability to leverage knowledge from previously learned tasks in order to learn new ones quickly and efficiently. Meta-learning approaches have emerged as a popular solution to achieve this. However, meta-reinforcement learning (meta-RL) algorithms have thus far been restricted to simple environments with narrow task distributions. Moreover, the paradigm of pretraining followed by fine-tuning to adapt to new tasks has emerged as a simple yet effective solution in supervised and self-supervised learning. This calls into question the benefits of meta-learning approaches also in reinforcement learning, which typically come at the cost of high complexity. We hence investigate meta-RL approaches in a variety of vision-based benchmarks, including Procgen, RLBench, and Atari, where evaluations are made on completely novel tasks. Our findings show that when meta-learning approaches are evaluated on different tasks (rather than different variations of the same task), multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-pretraining with meta test-time adaptation. This is encouraging for future research, as multi-task pretraining tends to be simpler and computationally cheaper than meta-RL. From these findings, we advocate for evaluating future meta-RL methods on more challenging tasks and including multi-task pretraining with fine-tuning as a simple, yet strong baseline.
    Universal approximation and model compression for radial neural networks. (arXiv:2107.02550v3 [cs.LG] UPDATED)
    We introduce a class of fully-connected neural networks whose activation functions, rather than being pointwise, rescale feature vectors by a function depending only on their norm. We call such networks radial neural networks, extending previous work on rotation equivariant networks that considers rescaling activations in less generality. We prove universal approximation theorems for radial neural networks, including in the more difficult cases of bounded widths and unbounded domains. Our proof techniques are novel, distinct from those in the pointwise case. Additionally, radial neural networks exhibit a rich group of orthogonal change-of-basis symmetries on the vector space of trainable parameters. Factoring out these symmetries leads to a practical lossless model compression algorithm. Optimization of the compressed model by gradient descent is equivalent to projected gradient descent for the full model.
    Estimating and Controlling for Fairness via Sensitive Attribute Predictors. (arXiv:2207.12497v3 [cs.LG] UPDATED)
    The responsible use of machine learning tools in real world high-stakes decision making demands that we audit and control for potential biases against underrepresented groups. This process naturally requires access to the sensitive attribute one desires to control, such as demographics, gender, or other potentially sensitive features. Unfortunately, this information is often unavailable. In this work we demonstrate that one can still reliably estimate, and ultimately control, for fairness by using proxy sensitive attributes derived from a sensitive attribute predictor. Specifically, we first show that with just a little knowledge of the complete data distribution, one may use a sensitive attribute predictor to obtain bounds of the classifier's true fairness metric. Second, we demonstrate how one can provably control a classifier's worst-case fairness violation with respect to the true sensitive attribute by controlling for fairness with respect to the proxy sensitive attribute. Our results hold under assumptions that are significantly milder than previous works, and we illustrate these results with experiments on synthetic and real datasets.
    Social learning spontaneously emerges by searching optimal heuristics with deep reinforcement learning. (arXiv:2204.12371v3 [cs.LG] UPDATED)
    How have individuals of social animals in nature evolved to learn from each other, and what would be the optimal strategy for such learning in a specific environment? Here, we address both problems by employing a deep reinforcement learning model to optimize the social learning strategies (SLSs) of agents in a cooperative game in a multi-dimensional landscape. Throughout the training for maximizing the overall payoff, we find that the agent spontaneously learns various concepts of social learning, such as copying, focusing on frequent and well-performing neighbors, self-comparison, and the importance of balancing between individual and social learning, without any explicit guidance or prior knowledge about the system. The SLS from a fully trained agent outperforms all of the traditional, baseline SLSs in terms of mean payoff. We demonstrate the superior performance of the reinforcement learning agent in various environments, including temporally changing environments and real social networks, which also verifies the adaptability of our framework to different social settings.
    Can language models handle recursively nested grammatical structures? A case study on comparing models and humans. (arXiv:2210.15303v3 [cs.CL] UPDATED)
    How should we compare the capabilities of language models (LMs) and humans? I draw inspiration from comparative psychology to highlight some challenges. In particular, I consider a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt -- substantially less content than the human training -- allows the LMs to consistently outperform the human results, and even to extrapolate to more deeply nested conditions than were tested with humans. Further, reanalyzing the prior human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans. This case study highlights how discrepancies in the evaluation can confound comparisons of language models and humans. I therefore reflect on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.
    Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation for autonomous vehicles. (arXiv:2302.08292v1 [cs.CV])
    Autonomous driving (AD) perception today relies heavily on deep learning based architectures requiring large scale annotated datasets with their associated costs for curation and annotation. The 3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization. We propose a new dataset, Navya 3D Segmentation (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain, including rural, urban, industrial sites and universities from 13 countries. It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds. We also propose a novel method for sequential dataset split generation based on iterative multi-label stratification, and demonstrated to achieve a +1.2% mIoU improvement over the original split proposed by SemanticKITTI dataset. A complete benchmark for semantic segmentation task was performed, with state of the art methods. Finally, we demonstrate an active learning (AL) based dataset distillation framework. We introduce a novel heuristic-free sampling method called distance sampling in the context of AL. A detailed presentation on the dataset is available at https://www.youtube.com/watch?v=5m6ALIs-s20 .
    LEVER: Learning to Verify Language-to-Code Generation with Execution. (arXiv:2302.08468v1 [cs.LG])
    The advent of pre-trained code language models (CodeLMs) has lead to significant progress in language-to-code generation. State-of-the-art approaches in this area combine CodeLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the CodeLM is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the CodeLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base CodeLMs (4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
    Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking?. (arXiv:2302.07134v2 [q-bio.BM] CROSS LISTED)
    Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditional molecular docking approaches, which does not match common needs. What's more, they claim to perform better than traditional molecular docking, but the approach of comparison is not fair, since traditional methods are not designed for docking on the whole protein without a given pocket. In this paper, we design a series of experiments to examine the actual performance of these deep learning models and traditional methods. For a fair comparison, we decompose the docking on the whole protein into two steps, pocket searching and docking on a given pocket, and build pipelines to evaluate traditional methods and deep learning methods respectively. We find that deep learning models are actually good at pocket searching, but traditional methods are better than deep learning models at docking on given pockets. Overall, our work explicitly reveals some potential problems in current deep learning models for molecular docking and provides several suggestions for future works.
    CACTO: Continuous Actor-Critic with Trajectory Optimization -- Towards global optimality. (arXiv:2211.06625v2 [cs.RO] UPDATED)
    This paper presents a novel algorithm for the continuous control of dynamical systems that combines Trajectory Optimization (TO) and Reinforcement Learning (RL) in a single framework. The motivations behind this algorithm are the two main limitations of TO and RL when applied to continuous nonlinear systems to minimize a non-convex cost function. Specifically, TO can get stuck in poor local minima when the search is not initialized close to a "good" minimum. On the other hand, when dealing with continuous state and control spaces, the RL training process may be excessively long and strongly dependent on the exploration strategy. Thus, our algorithm learns a "good" control policy via TO-guided RL policy search that, when used as initial guess provider for TO, makes the trajectory optimization process less prone to converge to poor local optima. Our method is validated on several reaching problems featuring non-convex obstacle avoidance with different dynamical systems, including a car model with 6D state, and a 3-joint planar manipulator. Our results show the great capabilities of CACTO in escaping local minima, while being more computationally efficient than the Deep Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO) RL algorithms.
    Deep Variational Implicit Processes. (arXiv:2206.06720v2 [stat.ML] UPDATED)
    Implicit processes (IPs) are a generalization of Gaussian processes (GPs). IPs may lack a closed-form expression but are easy to sample from. Examples include, among others, Bayesian neural networks or neural samplers. IPs can be used as priors over functions, resulting in flexible models with well-calibrated prediction uncertainty estimates. Methods based on IPs usually carry out function-space approximate inference, which overcomes some of the difficulties of parameter-space approximate inference. Nevertheless, the approximations employed often limit the expressiveness of the final model, resulting, e.g., in a Gaussian predictive distribution, which can be restrictive. We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process (DVIP). This generalization is similar to that of deep GPs over GPs, but it is more flexible due to the use of IPs as the prior distribution over the latent functions. We describe a scalable variational inference algorithm for training DVIP and show that it outperforms previous IP-based methods and also deep GPs. We support these claims via extensive regression and classification experiments. We also evaluate DVIP on large datasets with up to several million data instances to illustrate its good scalability and performance.
    Omnipredictors for Constrained Optimization. (arXiv:2209.07463v2 [cs.LG] UPDATED)
    The notion of omnipredictors (Gopalan, Kalai, Reingold, Sharan and Wieder ITCS 2021), suggested a new paradigm for loss minimization. Rather than learning a predictor based on a known loss function, omnipredictors can easily be post-processed to minimize any one of a rich family of loss functions compared with the loss of hypotheses in a class $\mathcal C$. It has been shown that such omnipredictors exist and are implied (for all convex and Lipschitz loss functions) by the notion of multicalibration from the algorithmic fairness literature. In this paper, we introduce omnipredictors for constrained optimization and study their complexity and implications. The notion that we introduce allows the learner to be unaware of the loss function that will be later assigned as well as the constraints that will be later imposed, as long as the subpopulations that are used to define these constraints are known. We show how to obtain omnipredictors for constrained optimization problems, relying on appropriate variants of multicalibration. We also investigate the implications of this notion when the constraints used are so-called group fairness notions.
    Surrogate Gradient Spiking Neural Networks as Encoders for Large Vocabulary Continuous Speech Recognition. (arXiv:2212.01187v2 [cs.CL] UPDATED)
    Compared to conventional artificial neurons that produce dense and real-valued responses, biologically-inspired spiking neurons transmit sparse and binary information, which can also lead to energy-efficient implementations. Recent research has shown that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method. They have shown promising results on speech command recognition tasks. Using the same technique, we show that they are scalable to large vocabulary continuous speech recognition, where they are capable of replacing LSTMs in the encoder with only minor loss of performance. This suggests that they may be applicable to more involved sequence-to-sequence tasks. Moreover, in contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
    Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach. (arXiv:1909.09743v2 [cs.AI] UPDATED)
    Recently, pretrained language models (e.g., BERT) have achieved great success on many downstream natural language understanding tasks and exhibit a certain level of commonsense reasoning ability. However, their performance on commonsense tasks is still far from that of humans. As a preliminary attempt, we propose a simple yet effective method to teach pretrained models with commonsense reasoning by leveraging the structured knowledge in ConceptNet, the largest commonsense knowledge base (KB). Specifically, the structured knowledge in KB allows us to construct various logical forms, and then generate multiple-choice questions requiring commonsense logical reasoning. Experimental results demonstrate that, when refined on these training examples, the pretrained models consistently improve their performance on tasks that require commonsense reasoning, especially in the few-shot learning setting. Besides, we also perform analysis to understand which logical relations are more relevant to commonsense reasoning.
    CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning. (arXiv:2212.05711v2 [cs.RO] UPDATED)
    Large-scale training have propelled significant progress in various sub-fields of AI such as computer vision and natural language processing. However, building robot learning systems at a comparable scale remains challenging. To develop robots that can perform a wide range of skills and adapt to new scenarios, efficient methods for collecting vast and diverse amounts of data on physical robot systems are required, as well as the capability to train high-capacity policies using such datasets. In this work, we propose a framework for scaling robot learning, with specific focus on multi-task and multi-scene manipulation in kitchen environments, both in simulation and in the real world. Our proposed framework, CACTI, comprises four stages that separately handle: data collection, data augmentation, visual representation learning, and imitation policy training, to enable scalability in robot learning . We make use of state-of-the-art generative models as part of the data augmentation stage, and use pre-trained out-of-domain visual representations to improve training efficiency. Experimental results demonstrate the effectiveness of our approach. On a real robot setup, CACTI enables efficient training of a single policy that can perform 10 manipulation tasks involving kitchen objects, and is robust to varying layouts of distractors. In a simulated kitchen environment, CACTI trains a single policy to perform 18 semantic tasks across 100 layout variations for each individual task. We will release the simulation task benchmark and augmented datasets in both real and simulated environments to facilitate future research.
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v3 [cs.LG] UPDATED)
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.
    Co-manipulation of soft-materials estimating deformation from depth images. (arXiv:2301.05609v2 [cs.RO] UPDATED)
    Human-robot co-manipulation of soft materials, such as fabrics, composites, and sheets of paper/cardboard, is a challenging operation that presents several relevant industrial applications. Estimating the deformation state of the co-manipulated material is one of the main challenges. Viable methods provide the indirect measure by calculating the human-robot relative distance. In this paper, we develop a data-driven model to estimate the deformation state of the material from a depth image through a Convolutional Neural Network (CNN). First, we define the deformation state of the material as the relative roto-translation from the current robot pose and a human grasping position. The model estimates the current deformation state through a Convolutional Neural Network, specifically a DenseNet-121 pretrained on ImageNet.The delta between the current and the desired deformation state is fed to the robot controller that outputs twist commands. The paper describes the developed approach to acquire, preprocess the dataset and train the model. The model is compared with the current state-of-the-art method based on a skeletal tracker from cameras. Results show that our approach achieves better performances and avoids the various drawbacks caused by using a skeletal tracker.Finally, we also studied the model performance according to different architectures and dataset dimensions to minimize the time required for dataset acquisition
    Momentum Contrastive Autoencoder: Using Contrastive Learning for Latent Space Distribution Matching in WAE. (arXiv:2110.10303v2 [cs.CV] UPDATED)
    Wasserstein autoencoder (WAE) shows that matching two distributions is equivalent to minimizing a simple autoencoder (AE) loss under the constraint that the latent space of this AE matches a pre-specified prior distribution. This latent space distribution matching is a core component of WAE, and a challenging task. In this paper, we propose to use the contrastive learning framework that has been shown to be effective for self-supervised representation learning, as a means to resolve this problem. We do so by exploiting the fact that contrastive learning objectives optimize the latent space distribution to be uniform over the unit hyper-sphere, which can be easily sampled from. We show that using the contrastive learning framework to optimize the WAE loss achieves faster convergence and more stable optimization compared with existing popular algorithms for WAE. This is also reflected in the FID scores on CelebA and CIFAR-10 datasets, and the realistic generated image quality on the CelebA-HQ dataset.
    A General Framework For Proving The Equivariant Strong Lottery Ticket Hypothesis. (arXiv:2206.04270v2 [cs.LG] UPDATED)
    The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of a subnetwork within a sufficiently overparameterized (dense) neural network that -- when initialized randomly and without any training -- achieves the accuracy of a fully trained target network. Recent works by Da Cunha et. al 2022; Burkholz 2022 demonstrate that the SLTH can be extended to translation equivariant networks -- i.e. CNNs -- with the same level of overparametrization as needed for the SLTs in dense networks. However, modern neural networks are capable of incorporating more than just translation symmetry, and developing general equivariant architectures such as rotation and permutation has been a powerful design principle. In this paper, we generalize the SLTH to functions that preserve the action of the group $G$ -- i.e. $G$-equivariant network -- and prove, with high probability, that one can approximate any $G$-equivariant network of fixed width and depth by pruning a randomly initialized overparametrized $G$-equivariant network to a $G$-equivariant subnetwork. We further prove that our prescribed overparametrization scheme is optimal and provides a lower bound on the number of effective parameters as a function of the error tolerance. We develop our theory for a large range of groups, including subgroups of the Euclidean $\text{E}(2)$ and Symmetric group $G \leq \mathcal{S}_n$ -- allowing us to find SLTs for MLPs, CNNs, $\text{E}(2)$-steerable CNNs, and permutation equivariant networks as specific instantiations of our unified framework. Empirically, we verify our theory by pruning overparametrized $\text{E}(2)$-steerable CNNs, $k$-order GNNs, and message passing GNNs to match the performance of trained target networks.
    KRADA: Known-region-aware Domain Alignment for Open-set Domain Adaptation in Semantic Segmentation. (arXiv:2106.06237v2 [eess.IV] UPDATED)
    In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this paper, we consider a new, more realistic, and more challenging problem setting where the pixel-level classifier has to be trained with labeled images and unlabeled open-world images -- we name it open-set domain adaptation segmentation (OSDAS). In OSDAS, the trained classifier is expected to identify unknown-class pixels and classify known-class pixels well. To solve OSDAS, we first investigate which distribution that unknown-class pixels obey. Then, motivated by the goodness-of-fit test, we use statistical measurements to show how a pixel fits the distribution of an unknown class and select highly-fitted pixels to form the unknown region in each test image. Eventually, we propose an end-to-end learning framework, known-region-aware domain alignment (KRADA), to distinguish unknown classes while aligning the distributions of known classes in labeled and unlabeled open-world images. The effectiveness of KRADA has been verified on two synthetic tasks and one COVID-19 segmentation task.
    Improving Convergence for Quantum Variational Classifiers using Weight Re-Mapping. (arXiv:2212.14807v2 [quant-ph] UPDATED)
    In recent years, quantum machine learning has seen a substantial increase in the use of variational quantum circuits (VQCs). VQCs are inspired by artificial neural networks, which achieve extraordinary performance in a wide range of AI tasks as massively parameterized function approximators. VQCs have already demonstrated promising results, for example, in generalization and the requirement for fewer parameters to train, by utilizing the more robust algorithmic toolbox available in quantum computing. A VQCs' trainable parameters or weights are usually used as angles in rotational gates and current gradient-based training methods do not account for that. We introduce weight re-mapping for VQCs, to unambiguously map the weights to an interval of length $2\pi$, drawing inspiration from traditional ML, where data rescaling, or normalization techniques have demonstrated tremendous benefits in many circumstances. We employ a set of five functions and evaluate them on the Iris and Wine datasets using variational classifiers as an example. Our experiments show that weight re-mapping can improve convergence in all tested settings. Additionally, we were able to demonstrate that weight re-mapping increased test accuracy for the Wine dataset by $10\%$ over using unmodified weights.
    A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. (arXiv:2205.13218v2 [cs.LG] UPDATED)
    Real-world applications require the classification model to adapt to new classes without forgetting old ones. Correspondingly, Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. Typical CIL methods tend to save representative exemplars from former classes to resist forgetting, while recent works find that storing models from history can substantially boost the performance. However, the stored models are not counted into the memory budget, which implicitly results in unfair comparisons. We find that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work, especially for the case with limited memory budgets. As a result, we need to holistically evaluate different CIL methods at different memory scales and simultaneously consider accuracy and memory size for measurement. On the other hand, we dive deeply into the construction of the memory buffer for memory efficiency. By analyzing the effect of different layers in the network, we find that shallow and deep layers have different characteristics in CIL. Motivated by this, we propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel. MEMO extends specialized layers based on the shared generalized representations, efficiently extracting diverse representations with modest cost and maintaining representative exemplars. Extensive experiments on benchmark datasets validate MEMO's competitive performance. Code is available at: https://github.com/wangkiw/ICLR23-MEMO
    Settling the Sample Complexity of Model-Based Offline Reinforcement Learning. (arXiv:2204.05275v2 [stat.ML] UPDATED)
    This paper is concerned with offline reinforcement learning (RL), which learns using pre-collected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a finite-horizon (resp. $\gamma$-discounted infinite-horizon) MDP with $S$ states and horizon $H$ (resp. effective horizon $\frac{1}{1-\gamma}$), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient $C^{\star}_{\text{clipped}}$. We prove that model-based offline RL yields $\varepsilon$-accuracy with a sample complexity of \[ \begin{cases} \frac{H^{4}SC_{\text{clipped}}^{\star}}{\varepsilon^{2}} & (\text{finite-horizon MDPs}) \frac{SC_{\text{clipped}}^{\star}}{(1-\gamma)^{3}\varepsilon^{2}} & (\text{infinite-horizon MDPs}) \end{cases} \] up to log factor, which is minimax optimal for the entire $\varepsilon$-range. The proposed algorithms are ``pessimistic'' variants of value iteration with Bernstein-style penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs.
    BigVGAN: A Universal Neural Vocoder with Large-Scale Training. (arXiv:2206.04658v2 [cs.SD] UPDATED)
    Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We introduce periodic activation function and anti-aliased representation into the GAN generator, which brings the desired inductive bias for audio synthesis and significantly improves audio quality. In addition, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. We identify and address the failure modes in large-scale GAN training for audio, while maintaining high-fidelity output without over-regularization. Our BigVGAN, trained only on clean speech (LibriTTS), achieves the state-of-the-art performance for various zero-shot (out-of-distribution) conditions, including unseen speakers, languages, recording environments, singing voices, music, and instrumental audio. We release our code and model at: https://github.com/NVIDIA/BigVGAN
    Shared Microexponents: A Little Shifting Goes a Long Way. (arXiv:2302.08007v1 [cs.LG])
    This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating a wide spectrum of narrow-precision formats for deep learning. It enables comparison of popular quantization standards, and through BDR, new formats based on shared microexponents (MX) are identified, which outperform other state-of-the-art quantization approaches, including narrow-precision floating-point and block floating-point. MX utilizes multiple levels of quantization scaling with ultra-fine scaling factors based on shared microexponents in the hardware. The effectiveness of MX is demonstrated on real-world models including large-scale generative pretraining and inferencing, and production-scale recommendation systems.
    Enhancing High-dimensional Bayesian Optimization by Optimizing the Acquisition Function Maximizer Initialization. (arXiv:2302.08298v1 [cs.LG])
    Bayesian optimization (BO) is widely used to optimize black-box functions. It works by first building a surrogate for the objective and quantifying the uncertainty in that surrogate. It then decides where to sample by maximizing an acquisition function defined by the surrogate model. Prior approaches typically use randomly generated raw samples to initialize the acquisition function maximizer. However, this strategy is ill-suited for high-dimensional BO. Given the large regions of high posterior uncertainty in high dimensions, a randomly initialized acquisition function maximizer is likely to focus on areas with high posterior uncertainty, leading to overly exploring areas that offer little gain. This paper provides the first comprehensive empirical study to reveal the importance of the initialization phase of acquisition function maximization. It proposes a better initialization approach by employing multiple heuristic optimizers to leverage the knowledge of already evaluated samples to generate initial points to be explored by an acquisition function maximizer. We evaluate our approach on widely used synthetic test functions and real-world applications. Experimental results show that our techniques, while simple, can significantly enhance the standard BO and outperforms state-of-the-art high-dimensional BO techniques by a large margin in most test cases.
    Aligning Language Models with Preferences through f-divergence Minimization. (arXiv:2302.08215v1 [cs.CL])
    Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences are good for approximating different targets. For instance, we discover that for GDC, the Jensen-Shannon divergence frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work.
    ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory. (arXiv:2302.08284v1 [cs.LG])
    DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision.
    Deep Learning Approach for Early Stage Lung Cancer Detection. (arXiv:2302.02456v2 [eess.IV] UPDATED)
    Lung cancer is the leading cause of death among different types of cancers. Every year, the lives lost due to lung cancer exceed those lost to pancreatic, breast, and prostate cancer combined. The survival rate for lung cancer patients is very low compared to other cancer patients due to late diagnostics. Thus, early lung cancer diagnostics is crucial for patients to receive early treatments, increasing the survival rate or even becoming cancer-free. This paper proposed a deep-learning model for early lung cancer prediction and diagnosis from Computed Tomography (CT) scans. The proposed mode achieves high accuracy. In addition, it can be a beneficial tool to support radiologists' decisions in predicting and detecting lung cancer and its stage.  ( 2 min )
    Unbiased Supervised Contrastive Learning. (arXiv:2211.05568v2 [cs.LG] UPDATED)
    Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with epsilon-SupInfoNCE, reaching state-of-the-art performance on a number of biased datasets, including real instances of biases in the wild.
    Probability flow solution of the Fokker-Planck equation. (arXiv:2206.04642v3 [cs.LG] UPDATED)
    The method of choice for integrating the time-dependent Fokker-Planck equation in high-dimension is to generate samples from the solution via integration of the associated stochastic differential equation. Here, we study an alternative scheme based on integrating an ordinary differential equation that describes the flow of probability. Acting as a transport map, this equation deterministically pushes samples from the initial density onto samples from the solution at any later time. Unlike integration of the stochastic dynamics, the method has the advantage of giving direct access to quantities that are challenging to estimate from trajectories alone, such as the probability current, the density itself, and its entropy. The probability flow equation depends on the gradient of the logarithm of the solution (its "score"), and so is a-priori unknown. To resolve this dependence, we model the score with a deep neural network that is learned on-the-fly by propagating a set of samples according to the instantaneous probability current. We show theoretically that the proposed approach controls the KL divergence from the learned solution to the target, while learning on external samples from the stochastic differential equation does not control either direction of the KL divergence. Empirically, we consider several high-dimensional Fokker-Planck equations from the physics of interacting particle systems. We find that the method accurately matches analytical solutions when they are available as well as moments computed via Monte-Carlo when they are not. Moreover, the method offers compelling predictions for the global entropy production rate that out-perform those obtained from learning on stochastic trajectories, and can effectively capture non-equilibrium steady-state probability currents over long time intervals.
    Robust Mid-Pass Filtering Graph Convolutional Networks. (arXiv:2302.08048v1 [cs.LG])
    Graph convolutional networks (GCNs) are currently the most promising paradigm for dealing with graph-structure data, while recent studies have also shown that GCNs are vulnerable to adversarial attacks. Thus developing GCN models that are robust to such attacks become a hot research topic. However, the structural purification learning-based or robustness constraints-based defense GCN methods are usually designed for specific data or attacks, and introduce additional objective that is not for classification. Extra training overhead is also required in their design. To address these challenges, we conduct in-depth explorations on mid-frequency signals on graphs and propose a simple yet effective Mid-pass filter GCN (Mid-GCN). Theoretical analyses guarantee the robustness of signals through the mid-pass filter, and we also shed light on the properties of different frequency signals under adversarial attacks. Extensive experiments on six benchmark graph data further verify the effectiveness of our designed Mid-GCN in node classification accuracy compared to state-of-the-art GCNs under various adversarial attack strategies.  ( 2 min )
    Entity Aware Modelling: A Survey. (arXiv:2302.08406v1 [cs.LG])
    Personalized prediction of responses for individual entities caused by external drivers is vital across many disciplines. Recent machine learning (ML) advances have led to new state-of-the-art response prediction models. Models built at a population level often lead to sub-optimal performance in many personalized prediction settings due to heterogeneity in data across entities (tasks). In personalized prediction, the goal is to incorporate inherent characteristics of different entities to improve prediction performance. In this survey, we focus on the recent developments in the ML community for such entity-aware modeling approaches. ML algorithms often modulate the network using these entity characteristics when they are readily available. However, these entity characteristics are not readily available in many real-world scenarios, and different ML methods have been proposed to infer these characteristics from the data. In this survey, we have organized the current literature on entity-aware modeling based on the availability of these characteristics as well as the amount of training data. We highlight how recent innovations in other disciplines, such as uncertainty quantification, fairness, and knowledge-guided machine learning, can improve entity-aware modeling.  ( 2 min )
    Hypergraphs with Edge-Dependent Vertex Weights: p-Laplacians and Spectral Clustering. (arXiv:2208.07457v2 [cs.LG] UPDATED)
    We study p-Laplacians and spectral clustering for a recently proposed hypergraph model that incorporates edge-dependent vertex weights (EDVW). These weights can reflect different importance of vertices within a hyperedge, thus conferring the hypergraph model higher expressivity and flexibility. By constructing submodular EDVW-based splitting functions, we convert hypergraphs with EDVW into submodular hypergraphs for which the spectral theory is better developed. In this way, existing concepts and theorems such as p-Laplacians and Cheeger inequalities proposed under the submodular hypergraph setting can be directly extended to hypergraphs with EDVW. For submodular hypergraphs with EDVW-based splitting functions, we propose an efficient algorithm to compute the eigenvector associated with the second smallest eigenvalue of the hypergraph 1-Laplacian. We then utilize this eigenvector to cluster the vertices, achieving higher clustering accuracy than traditional spectral clustering based on the 2-Laplacian. More broadly, the proposed algorithm works for all submodular hypergraphs that are graph reducible. Numerical experiments using real-world data demonstrate the effectiveness of combining spectral clustering based on the 1-Laplacian and EDVW.
    Classifier Calibration: A survey on how to assess and improve predicted class probabilities. (arXiv:2112.10327v2 [cs.LG] UPDATED)
    This paper provides both an introduction to and a detailed overview of the principles and practice of classifier calibration. A well-calibrated classifier correctly quantifies the level of uncertainty or confidence associated with its instance-wise predictions. This is essential for critical applications, optimal decision making, cost-sensitive classification, and for some types of context change. Calibration research has a rich history which predates the birth of machine learning as an academic field by decades. However, a recent increase in the interest on calibration has led to new methods and the extension from binary to the multiclass setting. The space of options and issues to consider is large, and navigating it requires the right set of concepts and tools. We provide both introductory material and up-to-date technical details of the main concepts and methods, including proper scoring rules and other evaluation metrics, visualisation approaches, a comprehensive account of post-hoc calibration methods for binary and multiclass classification, and several advanced topics.
    Sanity checks and improvements for patch visualisation in prototype-based image classification. (arXiv:2302.08508v1 [cs.CV])
    In this work, we perform an in-depth analysis of the visualisation methods implemented in two popular self-explaining models for visual classification based on prototypes - ProtoPNet and ProtoTree. Using two fine-grained datasets (CUB-200-2011 and Stanford Cars), we first show that such methods do not correctly identify the regions of interest inside of the images, and therefore do not reflect the model behaviour. Secondly, using a deletion metric, we demonstrate quantitatively that saliency methods such as Smoothgrads or PRP provide more faithful image patches. We also propose a new relevance metric based on the segmentation of the object provided in some datasets (e.g. CUB-200-2011) and show that the imprecise patch visualisations generated by ProtoPNet and ProtoTree can create a false sense of bias that can be mitigated by the use of more faithful methods. Finally, we discuss the implications of our findings for other prototype-based models sharing the same visualisation method.
    Neighborhood-Regularized Self-Training for Learning with Few Labels. (arXiv:2301.03726v2 [cs.LG] UPDATED)
    Training deep neural networks (DNNs) with limited supervision has been a popular research topic as it can significantly alleviate the annotation burden. Self-training has been successfully applied in semi-supervised learning tasks, but one drawback of self-training is that it is vulnerable to the label noise from incorrect pseudo labels. Inspired by the fact that samples with similar labels tend to share similar representations, we develop a neighborhood-based sample selection approach to tackle the issue of noisy pseudo labels. We further stabilize self-training via aggregating the predictions from different rounds during sample selection. Experiments on eight tasks show that our proposed method outperforms the strongest self-training baseline with 1.83% and 2.51% performance gain for text and graph datasets on average. Our further analysis demonstrates that our proposed data selection strategy reduces the noise of pseudo labels by 36.8% and saves 57.3% of the time when compared with the best baseline. Our code and appendices will be uploaded to https://github.com/ritaranx/NeST.  ( 2 min )
    A method for incremental discovery of financial event types based on anomaly detection. (arXiv:2302.08205v1 [cs.LG])
    Event datasets in the financial domain are often constructed based on actual application scenarios, and their event types are weakly reusable due to scenario constraints; at the same time, the massive and diverse new financial big data cannot be limited to the event types defined for specific scenarios. This limitation of a small number of event types does not meet our research needs for more complex tasks such as the prediction of major financial events and the analysis of the ripple effects of financial events. In this paper, a three-stage approach is proposed to accomplish incremental discovery of event types. For an existing annotated financial event dataset, the three-stage approach consists of: for a set of financial event data with a mixture of original and unknown event types, a semi-supervised deep clustering model with anomaly detection is first applied to classify the data into normal and abnormal events, where abnormal events are events that do not belong to known types; then normal events are tagged with appropriate event types and abnormal events are reasonably clustered. Finally, a cluster keyword extraction method is used to recommend the type names of events for the new event clusters, thus incrementally discovering new event types. The proposed method is effective in the incremental discovery of new event types on real data sets.
    Reproducible and Portable Big Data Analytics in the Cloud. (arXiv:2112.09762v4 [cs.DC] UPDATED)
    Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics.
    A Bayesian Perspective for Determinant Minimization Based Robust Structured Matrix Factorizatio. (arXiv:2302.08416v1 [cs.LG])
    We introduce a Bayesian perspective for the structured matrix factorization problem. The proposed framework provides a probabilistic interpretation for existing geometric methods based on determinant minimization. We model input data vectors as linear transformations of latent vectors drawn from a distribution uniform over a particular domain reflecting structural assumptions, such as the probability simplex in Nonnegative Matrix Factorization and polytopes in Polytopic Matrix Factorization. We represent the rows of the linear transformation matrix as vectors generated independently from a normal distribution whose covariance matrix is inverse Wishart distributed. We show that the corresponding maximum a posteriori estimation problem boils down to the robust determinant minimization approach for structured matrix factorization, providing insights about parameter selections and potential algorithmic extensions.
    FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. (arXiv:2210.11790v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks.  ( 2 min )
    Decoupled Model Schedule for Deep Learning Training. (arXiv:2302.08005v1 [cs.LG])
    Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language to decouple model execution from definition. Specifically, the schedule works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, we optimize the model as-needed through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way, we are able to improve training throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.32x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM.  ( 2 min )
    Deterministic Nonsmooth Nonconvex Optimization. (arXiv:2302.08300v1 [cs.LG])
    We study the complexity of optimizing nonsmooth nonconvex Lipschitz functions by producing $(\delta,\epsilon)$-stationary points. Several recent works have presented randomized algorithms that produce such points using $\tilde O(\delta^{-1}\epsilon^{-3})$ first-order oracle calls, independent of the dimension $d$. It has been an open problem as to whether a similar result can be obtained via a deterministic algorithm. We resolve this open problem, showing that randomization is necessary to obtain a dimension-free rate. In particular, we prove a lower bound of $\Omega(d)$ for any deterministic algorithm. Moreover, we show that unlike smooth or convex optimization, access to function values is required for any deterministic algorithm to halt within any finite time. On the other hand, we prove that if the function is even slightly smooth, then the dimension-free rate of $\tilde O(\delta^{-1}\epsilon^{-3})$ can be obtained by a deterministic algorithm with merely a logarithmic dependence on the smoothness parameter. Motivated by these findings, we turn to study the complexity of deterministically smoothing Lipschitz functions. Though there are efficient black-box randomized smoothings, we start by showing that no such deterministic procedure can smooth functions in a meaningful manner, resolving an open question. We then bypass this impossibility result for the structured case of ReLU neural networks. To that end, in a practical white-box setting in which the optimizer is granted access to the network's architecture, we propose a simple, dimension-free, deterministic smoothing that provably preserves $(\delta,\epsilon)$-stationary points. Our method applies to a variety of architectures of arbitrary depth, including ResNets and ConvNets. Combined with our algorithm, this yields the first deterministic dimension-free algorithm for optimizing ReLU networks, circumventing our lower bound.
    Local Causal Discovery for Estimating Causal Effects. (arXiv:2302.08070v1 [cs.LG])
    Even when the causal graph underlying our data is unknown, we can use observational data to narrow down the possible values that an average treatment effect (ATE) can take by (1) identifying the graph up to a Markov equivalence class; and (2) estimating that ATE for each graph in the class. While the PC algorithm can identify this class under strong faithfulness assumptions, it can be computationally prohibitive. Fortunately, only the local graph structure around the treatment is required to identify the set of possible ATE values, a fact exploited by local discovery algorithms to improve computational efficiency. In this paper, we introduce Local Discovery using Eager Collider Checks (LDECC), a new local causal discovery algorithm that leverages unshielded colliders to orient the treatment's parents differently from existing methods. We show that there exist graphs where LDECC exponentially outperforms existing local discovery algorithms and vice versa. Moreover, we show that LDECC and existing algorithms rely on different faithfulness assumptions, leveraging this insight to weaken the assumptions for identifying the set of possible ATE values.  ( 2 min )
    A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold. (arXiv:2302.08210v1 [cs.LG])
    Although Deep Learning (DL) has achieved success in complex Artificial Intelligence (AI) tasks, it suffers from various notorious problems (e.g., feature redundancy, and vanishing or exploding gradients), since updating parameters in Euclidean space cannot fully exploit the geometric structure of the solution space. As a promising alternative solution, Riemannian-based DL uses geometric optimization to update parameters on Riemannian manifolds and can leverage the underlying geometric information. Accordingly, this article presents a comprehensive survey of applying geometric optimization in DL. At first, this article introduces the basic procedure of the geometric optimization, including various geometric optimizers and some concepts of Riemannian manifold. Subsequently, this article investigates the application of geometric optimization in different DL networks in various AI tasks, e.g., convolution neural network, recurrent neural network, transfer learning, and optimal transport. Additionally, typical public toolboxes that implement optimization on manifold are also discussed. Finally, this article makes a performance comparison between different deep geometric optimization methods under image recognition scenarios.
    Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods. (arXiv:2210.07321v3 [cs.CL] UPDATED)
    Machine generated text is increasingly difficult to distinguish from human authored text. Powerful open-source models are freely available, and user-friendly tools that democratize access to generative models are proliferating. ChatGPT, which was released shortly after the first preprint of this survey, epitomizes these trends. The great potential of state-of-the-art natural language generation (NLG) systems is tempered by the multitude of avenues for abuse. Detection of machine generated text is a key countermeasure for reducing abuse of NLG models, with significant technical challenges and numerous open problems. We provide a survey that includes both 1) an extensive analysis of threat models posed by contemporary NLG systems, and 2) the most complete review of machine generated text detection methods to date. This survey places machine generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models, and ensuring detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.
    Deep Fusion of Multi-Object Densities Using Transformer. (arXiv:2209.08857v3 [cs.LG] UPDATED)
    In this paper, we demonstrate that deep learning based method can be used to fuse multi-object densities. Given a scenario with several sensors with possibly different field-of-views, tracking is performed locally in each sensor by a tracker, which produces random finite set multi-object densities. To fuse outputs from different trackers, we adapt a recently proposed transformer-based multi-object tracker, where the fusion result is a global multi-object density, describing the set of all alive objects at the current time. We compare the performance of the transformer-based fusion method with a well-performing model-based Bayesian fusion method in several simulated scenarios with different parameter settings using synthetic data. The simulation results show that the transformer-based fusion method outperforms the model-based Bayesian method in our experimental scenarios.
    Reward Gaming in Conditional Text Generation. (arXiv:2211.08714v2 [cs.CL] UPDATED)
    To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. We show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during RL training of the text generation model. While there has been discussion about reward gaming in the RL or safety community, in this discussion piece, we would like to highlight reward gaming in the natural language generation (NLG) community using concrete conditional text generation examples and discuss potential fixes and areas for future work.
    Temporal Graph Neural Networks for Irregular Data. (arXiv:2302.08415v1 [stat.ML])
    This paper proposes a temporal graph neural network model for forecasting of graph-structured irregularly observed time series. Our TGNN4I model is designed to handle both irregular time steps and partial observations of the graph. This is achieved by introducing a time-continuous latent state in each node, following a linear Ordinary Differential Equation (ODE) defined by the output of a Gated Recurrent Unit (GRU). The ODE has an explicit solution as a combination of exponential decay and periodic dynamics. Observations in the graph neighborhood are taken into account by integrating graph neural network layers in both the GRU state update and predictive model. The time-continuous dynamics additionally enable the model to make predictions at arbitrary time steps. We propose a loss function that leverages this and allows for training the model for forecasting over different time horizons. Experiments on simulated data and real-world data from traffic and climate modeling validate the usefulness of both the graph structure and time-continuous dynamics in settings with irregular observations.
    A cloud-based deep learning system for improving crowd safety at event entrances. (arXiv:2302.08237v1 [cs.LG])
    Crowding at the entrances of large events may lead to critical and life-threatening situations, particularly when people start pushing each other to reach the event faster. A system for automatic and timely identification of pushing behavior would help organizers and security forces to intervene early and mitigate dangerous situations. In this paper, we propose a cloud-based deep learning system for early detection of pushing automatically in the live video stream of crowded event entrances. The proposed system relies mainly on two models: a pre-trained deep optical flow and an adapted version of the EfficientNetV2B0 classifier. The optical flow model extracts the characteristics of the crowd motion in the live video stream, while the classifier analyses the crowd motion and annotates pushing patches in the live stream. A novel dataset is generated based on five real-world experiments and their associated ground truth data to train the adapted EfficientNetV2B0 model. The experimental situations simulated a crowded event entrance, and social psychologists manually created the ground truths for each video experiment. Several experiments on the videos and the generated dataset are carried out to evaluate the accuracy and annotation delay time of the proposed system. Furthermore, the experts manually revised the annotation results of the system. Findings indicate that the system identified pushing behaviors with an accuracy rate of 89% within an acceptable delay time.
    On marginal feature attributions of tree-based models. (arXiv:2302.08434v1 [cs.LG])
    Due to their power and ease of use, tree-based machine learning models have become very popular. To interpret these models, local feature attributions based on marginal expectations e.g. marginal (interventional) Shapley, Owen or Banzhaf values may be employed. Such feature attribution methods are true to the model and implementation invariant, i.e. dependent only on the input-output function of the model. By taking advantage of the internal structure of tree-based models, we prove that their marginal Shapley values, or more generally marginal feature attributions obtained from a linear game value, are simple (piecewise-constant) functions with respect to a certain finite partition of the input space determined by the trained model. The same is true for feature attributions obtained from the famous TreeSHAP algorithm. Nevertheless, we show that the "path-dependent" TreeSHAP is not implementation invariant by presenting two (statistically similar) decision trees computing the exact same function for which the algorithm yields different rankings of features, whereas the marginal Shapley values coincide. Furthermore, we discuss how the fact that marginal feature attributions are simple functions can potentially be utilized to compute them. An important observation, showcased by experiments with XGBoost, LightGBM and CatBoost libraries, is that only a portion of all features appears in a tree from the ensemble; thus the complexity of computing marginal Shapley (or Owen or Banzhaf) feature attributions may be reduced. In particular, in the case of CatBoost models, the trees are oblivious (symmetric) and the number of features in each of them is no larger than the depth. We exploit the symmetry to derive an explicit formula with improved complexity for marginal Shapley (and Banzhaf and Owen) values which is only in terms of the internal parameters of the CatBoost model.
    Evaluating and Improving the Coreference Capabilities of Machine Translation Models. (arXiv:2302.08464v1 [cs.CL])
    Machine translation (MT) requires a wide range of linguistic capabilities, which current end-to-end models are expected to learn implicitly by observing aligned sentences in bilingual corpora. In this work, we ask: \emph{How well do MT models learn coreference resolution from implicit signal?} To answer this question, we develop an evaluation methodology that derives coreference clusters from MT output and evaluates them without requiring annotations in the target language. We further evaluate several prominent open-source and commercial MT systems, translating from English to six target languages, and compare them to state-of-the-art coreference resolvers on three challenging benchmarks. Our results show that the monolingual resolvers greatly outperform MT models. Motivated by this result, we experiment with different methods for incorporating the output of coreference resolution models in MT, showing improvement over strong baselines.
    Knowledge-augmented Graph Machine Learning for Drug Discovery: A Survey from Precision to Interpretability. (arXiv:2302.08261v1 [cs.LG])
    The integration of Artificial Intelligence (AI) into the field of drug discovery has been a growing area of interdisciplinary scientific research. However, conventional AI models are heavily limited in handling complex biomedical structures (such as 2D or 3D protein and molecule structures) and providing interpretations for outputs, which hinders their practical application. As of late, Graph Machine Learning (GML) has gained considerable attention for its exceptional ability to model graph-structured biomedical data and investigate their properties and functional relationships. Despite extensive efforts, GML methods still suffer from several deficiencies, such as the limited ability to handle supervision sparsity and provide interpretability in learning and inference processes, and their ineffectiveness in utilising relevant domain knowledge. In response, recent studies have proposed integrating external biomedical knowledge into the GML pipeline to realise more precise and interpretable drug discovery with limited training instances. However, a systematic definition for this burgeoning research direction is yet to be established. This survey presents a comprehensive overview of long-standing drug discovery principles, provides the foundational concepts and cutting-edge techniques for graph-structured data and knowledge databases, and formally summarises Knowledge-augmented Graph Machine Learning (KaGML) for drug discovery. A thorough review of related KaGML works, collected following a carefully designed search methodology, are organised into four categories following a novel-defined taxonomy. To facilitate research in this promptly emerging field, we also share collected practical resources that are valuable for intelligent drug discovery and provide an in-depth discussion of the potential avenues for future advancements.  ( 2 min )
    ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations. (arXiv:2302.08137v1 [cs.SD])
    In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. To disentangle content and speaker representations, we propose a training strategy based on Siamese networks that encourages similarity between the content representations of the original and pitch-shifted audio. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its decomposed representation. Our framework allows controllable and speaker-adaptive synthesis to perform zero-shot any-to-any voice conversion achieving state-of-the-art results on metrics evaluating speaker similarity, intelligibility, and naturalness. Using just 10 seconds of data for a target speaker, our framework can perform voice swapping and achieves a speaker verification EER of 5.5% for seen speakers and 8.4% for unseen speakers.  ( 2 min )
    Magnetohydrodynamics with Physics Informed Neural Operators. (arXiv:2302.08332v1 [physics.comp-ph])
    We present the first application of physics informed neural operators, which use tensor Fourier neural operators as their backbone, to model 2D incompressible magnetohydrodynamics simulations. Our results indicate that physics informed AI can accurately model the physics of magnetohydrodynamics simulations that describe laminar flows with Reynolds numbers $Re\leq250$. We also quantify the applicability of our AI surrogates for turbulent flows, and explore how magnetohydrodynamics simulations and AI surrogates store magnetic and kinetic energy across wavenumbers. Based on these studies, we propose a variety of approaches to create AI surrogates that provide a computationally efficient and high fidelity description of magnetohydrodynamics simulations for a broad range of Reynolds numbers. Neural operators and scientific software to produce simulation data to train, validate and test our physics informed neural operators are released with this manuscript.
    3D-aware Conditional Image Synthesis. (arXiv:2302.08509v1 [cs.CV])
    We propose pix2pix3D, a 3D-aware conditional generative model for controllable photorealistic image synthesis. Given a 2D label map, such as a segmentation or edge map, our model learns to synthesize a corresponding image from different viewpoints. To enable explicit 3D user control, we extend conditional generative models with neural radiance fields. Given widely-available monocular images and label map pairs, our model learns to assign a label to every 3D point in addition to color and density, which enables it to render the image and pixel-aligned label map simultaneously. Finally, we build an interactive system that allows users to edit the label map from any viewpoint and generate outputs accordingly.
    On the Limit Performance of Floating Gossip. (arXiv:2302.08413v1 [stat.ML])
    In this paper we investigate the limit performance of Floating Gossip, a new, fully distributed Gossip Learning scheme which relies on Floating Content to implement location-based probabilistic evolution of machine learning models in an infrastructure-less manner. We consider dynamic scenarios where continuous learning is necessary, and we adopt a mean field approach to investigate the limit performance of Floating Gossip in terms of amount of data that users can incorporate into their models, as a function of the main system parameters. Different from existing approaches in which either communication or computing aspects of Gossip Learning are analyzed and optimized, our approach accounts for the compound impact of both aspects. We validate our results through detailed simulations, proving good accuracy. Our model shows that Floating Gossip can be very effective in implementing continuous training and update of machine learning models in a cooperative manner, based on opportunistic exchanges among moving users.
    A weighted subspace exponential kernel for support tensor machines. (arXiv:2302.08134v1 [stat.ML])
    High-dimensional data in the form of tensors are challenging for kernel classification methods. To both reduce the computational complexity and extract informative features, kernels based on low-rank tensor decompositions have been proposed. However, what decisive features of the tensors are exploited by these kernels is often unclear. In this paper we propose a novel kernel that is based on the Tucker decomposition. For this kernel the Tucker factors are computed based on re-weighting of the Tucker matrices with tuneable powers of singular values from the HOSVD decomposition. This provides a mechanism to balance the contribution of the Tucker core and factors of the data. We benchmark support tensor machines with this new kernel on several datasets. First we generate synthetic data where two classes differ in either Tucker factors or core, and compare our novel and previously existing kernels. We show robustness of the new kernel with respect to both classification scenarios. We further test the new method on real-world datasets. The proposed kernel has demonstrated a higher test accuracy than the state-of-the-art tensor train multi-way multi-level kernel, and a significantly lower computational time.
    cGAN-Based High Dimensional IMU Sensor Data Generation for Therapeutic Activities. (arXiv:2302.07998v1 [cs.LG])
    Human activity recognition is a core technology for applications such as rehabilitation, ambient health monitoring, and human-computer interactions. Wearable devices, particularly IMU sensors, can help us collect rich features of human movements that can be leveraged in activity recognition. Developing a robust classifier for activity recognition has always been of interest to researchers. One major problem is that there is usually a deficit of training data for some activities, making it difficult and sometimes impossible to develop a classifier. In this work, a novel GAN network called TheraGAN was developed to generate realistic IMU signals associated with a particular activity. The generated signal is of a 6-channel IMU. i.e., angular velocities and linear accelerations. Also, by introducing simple activities, which are meaningful subparts of a complex full-length activity, the generation process was facilitated for any activity with arbitrary length. To evaluate the generated signals, besides perceptual similarity metrics, they were applied along with real data to improve the accuracy of classifiers. The results show that the maximum increase in the f1-score belongs to the LSTM classifier by a 13.27% rise when generated data were added. This shows the validity of the generated data as well as TheraGAN as a tool to build more robust classifiers in case of imbalanced data problem.
    Understanding the Distillation Process from Deep Generative Models to Tractable Probabilistic Circuits. (arXiv:2302.08086v1 [cs.LG])
    Probabilistic Circuits (PCs) are a general and unified computational framework for tractable probabilistic models that support efficient computation of various inference tasks (e.g., computing marginal probabilities). Towards enabling such reasoning capabilities in complex real-world tasks, Liu et al. (2022) propose to distill knowledge (through latent variable assignments) from less tractable but more expressive deep generative models. However, it is still unclear what factors make this distillation work well. In this paper, we theoretically and empirically discover that the performance of a PC can exceed that of its teacher model. Therefore, instead of performing distillation from the most expressive deep generative model, we study what properties the teacher model and the PC should have in order to achieve good distillation performance. This leads to a generic algorithmic improvement as well as other data-type-specific ones over the existing latent variable distillation pipeline. Empirically, we outperform SoTA TPMs by a large margin on challenging image modeling benchmarks. In particular, on ImageNet32, PCs achieve 4.06 bits-per-dimension, which is only 0.34 behind variational diffusion models (Kingma et al., 2021).
    Learning-Based Adaptive User Selection in Millimeter Wave Hybrid Beamforming Systems. (arXiv:2302.08240v1 [eess.SY])
    We consider a multi-user hybrid beamforming system, where the multiplexing gain is limited by the small number of RF chains employed at the base station (BS). To allow greater freedom for maximizing the multiplexing gain, it is better if the BS selects and serves some of the users at each scheduling instant, rather than serving all the users all the time. We adopt a two-timescale protocol that takes into account the mmWave characteristics, where at the long timescale an analog beam is chosen for each user, and at the short timescale users are selected for transmission based on the chosen analog beams. The goal of the user selection is to maximize the traditional Proportional Fair (PF) metric. However, this maximization is non-trivial due to interference between the analog beams for selected users. We first define a greedy algorithm and a "top-k" algorithm, and then propose a machine learning (ML)-based user selection algorithm to provide an efficient trade-off between the PF performance and the computation time. Throughout simulations, we analyze the performance of the ML-based algorithms under various metrics, and show that it gives an efficient trade-off in performance as compared to counterparts.
    UniFed: A Unified Framework for Federated Learning on Non-IID Image Features. (arXiv:2110.09974v3 [cs.LG] UPDATED)
    How to tackle non-iid data is a crucial topic in federated learning. This challenging problem not only affects training process, but also harms performance of clients not participating in training. Existing literature mainly focuses on either side, yet still lacks a unified solution to handle these two types (internal and external) of clients in a joint way. In this work, we propose a unified framework to tackle the non-iid issues for internal and external clients together. Firstly, we propose to use client-specific batch normalization in either internal or external clients to alleviate feature distribution shifts incurred by non-iid data. Then we present theoretical analysis to demonstrate the benefits of client-specific batch normalization. Specifically, we show that our approach promotes convergence speed for federated training and yields lower generalization error bound for external clients. Furthermore, we use causal reasoning to form a causal view to explain the advantages of our framework. At last, we conduct extensive experiments on natural and medical images to evaluate our method, where our method achieves state-of-the-art performance, faster convergence, and shows good compatibility. We also performed comprehensive analytical studies on a real-world medical dataset to demonstrate the effectiveness.
    Adaptive Selective Sampling for Online Prediction with Experts. (arXiv:2302.08397v1 [stat.ML])
    We consider online prediction of a binary sequence with expert advice. For this setting, we devise label-efficient forecasting algorithms, which use a selective sampling scheme that enables collecting much fewer labels than standard procedures, while still retaining optimal worst-case regret guarantees. These algorithms are based on exponentially weighted forecasters, suitable for settings with and without a perfect expert. For a scenario where one expert is strictly better than the others in expectation, we show that the label complexity of the label-efficient forecaster scales roughly as the square root of the number of rounds. Finally, we present numerical experiments empirically showing that the normalized regret of the label-efficient forecaster can asymptotically match known minimax rates for pool-based active learning, suggesting it can optimally adapt to benign settings.
    Write and Paint: Generative Vision-Language Models are Unified Modal Learners. (arXiv:2206.07699v2 [cs.CV] UPDATED)
    Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at https://github.com/shizhediao/DaVinci.
    User Response in Ad Auctions: An MDP Formulation of Long-Term Revenue Optimization. (arXiv:2302.08108v1 [cs.GT])
    We propose a new Markov Decision Process (MDP) model for ad auctions to capture the user response to the quality of ads, with the objective of maximizing the long-term discounted revenue. By incorporating user response, our model takes into consideration all three parties involved in the auction (advertiser, auctioneer, and user). The state of the user is modeled as a user-specific click-through rate (CTR) with the CTR changing in the next round according to the set of ads shown to the user in the current round. We characterize the optimal mechanism for this MDP as a Myerson's auction with a notion of modified virtual value, which relies on the value distribution of the advertiser, the current user state, and the future impact of showing the ad to the user. Moreover, we propose a simple mechanism built upon second price auctions with personalized reserve prices and show it can achieve a constant-factor approximation to the optimal long term discounted revenue.
    Neuro-Symbolic Procedural Planning with Commonsense Prompting. (arXiv:2206.02928v6 [cs.CL] UPDATED)
    Procedural planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps. Although procedural planning is a basic skill set for humans in daily life, it remains a challenge for large language models (LLMs) that lack a deep understanding of the cause-effect relations in procedures. Previous methods require manual exemplars to acquire procedural planning knowledge from LLMs in the zero-shot setting. However, such elicited pre-trained knowledge in LLMs induces spurious correlations between goals and steps, which impair the model generalization to unseen tasks. In contrast, this paper proposes a neuro-symbolic procedural PLANner (PLAN) that elicits procedural planning knowledge from the LLMs with commonsense-infused prompting. To mitigate spurious goal-step correlations, we use symbolic program executors on the latent procedural representations to formalize prompts from commonsense knowledge bases as a causal intervention toward the Structural Causal Model. Both automatic and human evaluations on WikiHow and RobotHow show the superiority of PLAN on procedural planning without further training or manual exemplars.
    Understanding Neural Coding on Latent Manifolds by Sharing Features and Dividing Ensembles. (arXiv:2210.03155v2 [stat.ML] UPDATED)
    Systems neuroscience relies on two complementary views of neural data, characterized by single neuron tuning curves and analysis of population activity. These two perspectives combine elegantly in neural latent variable models that constrain the relationship between latent variables and neural activity, modeled by simple tuning curve functions. This has recently been demonstrated using Gaussian processes, with applications to realistic and topologically relevant latent manifolds. Those and previous models, however, missed crucial shared coding properties of neural populations. We propose feature sharing across neural tuning curves which significantly improves performance and helps optimization. We also propose a solution to the ensemble detection problem, where different groups of neurons, i.e., ensembles, can be modulated by different latent manifolds. Achieved through a soft clustering of neurons during training, this allows for the separation of mixed neural populations in an unsupervised manner. These innovations lead to more interpretable models of neural population activity that train well and perform better even on mixtures of complex latent manifolds. Finally, we apply our method on a recently published grid cell dataset, and recover distinct ensembles, infer toroidal latents and predict neural tuning curves in a single integrated modeling framework.
    NCS4CVR: Neuron-Connection Sharing for Multi-Task Learning in Video Conversion Rate Prediction. (arXiv:2008.09872v3 [cs.IR] UPDATED)
    Click-through rate (CTR) and post-click conversion rate (CVR) predictions are two fundamental modules in industrial ranking systems such as recommender systems, advertising, and search engines. Since CVR involves much fewer samples than CTR (known as the CVR data sparsity problem), most of the existing works try to leverage CTR&CVR multi-task learning to improve CVR performance. However, typical coarse-grained sub-network/layer sharing methods may introduce conflicts and lead to performance degradation, since not every neuron or neuron connection in one layer should be shared between CVR and CTR tasks. This is because users may have different fine-grained content feature preferences between deep consumption and click behavior, represented by CVR and CTR, respectively. To address this sharing&conflict problem, we propose a novel multi-task CVR modeling scheme with neuron-connection level sharing named NCS4CVR, which can automatically and flexibly learn which neuron weights are shared or not shared without artificial experience. Compared with previous layer-level sharing methods, this is the first time that a fine-grained CTR&CVR sharing method at the neuron connection level is proposed, which is a research paradigm shift in the sharing level. Both offline and online experiments demonstrate that our method outperforms both the single-task model and the layer-level sharing model. Our proposed method has now been successfully deployed in an industry video recommender system serving major traffic.
    Realized recurrent conditional heteroskedasticity model for volatility modelling. (arXiv:2302.08002v1 [econ.EM])
    We propose a new approach to volatility modelling by combining deep learning (LSTM) and realized volatility measures. This LSTM-enhanced realized GARCH framework incorporates and distills modeling advances from financial econometrics, high frequency trading data and deep learning. Bayesian inference via the Sequential Monte Carlo method is employed for statistical inference and forecasting. The new framework can jointly model the returns and realized volatility measures, has an excellent in-sample fit and superior predictive performance compared to several benchmark models, while being able to adapt well to the stylized facts in volatility. The performance of the new framework is tested using a wide range of metrics, from marginal likelihood, volatility forecasting, to tail risk forecasting and option pricing. We report on a comprehensive empirical study using 31 widely traded stock indices over a time period that includes COVID-19 pandemic.
    An Empirical Bayes Analysis of Object Trajectory Representation Models. (arXiv:2211.01696v2 [cs.LG] UPDATED)
    We present an in-depth empirical analysis of the trade-off between model complexity and fit error in modelling object trajectories. Analyzing several large public datasets, we show that simple linear models do represent real-world trajectories with high fidelity over relevant time scales at very moderate model complexity. This finding allows the formulation of trajectory tracking and prediction as a Bayesian filtering problem. Using an Empirical Bayes approach, we estimate prior distributions over model parameters from the data that inform the motion models necessary in the trajectory tracking problem and that can help regularize prediction models. We argue for the use of linear trajectory representation models in trajectory prediction tasks as they do not limit prediction performance currently.
    Explicit Diffusion of Gaussian Mixture Model Based Image Priors. (arXiv:2302.08411v1 [cs.CV])
    In this work we tackle the problem of estimating the density $f_X$ of a random variable $X$ by successive smoothing, such that the smoothed random variable $Y$ fulfills $(\partial_t - \Delta_1)f_Y(\,\cdot\,, t) = 0$, $f_Y(\,\cdot\,, 0) = f_X$. With a focus on image processing, we propose a product/fields of experts model with Gaussian mixture experts that admits an analytic expression for $f_Y (\,\cdot\,, t)$ under an orthogonality constraint on the filters. This construction naturally allows the model to be trained simultaneously over the entire diffusion horizon using empirical Bayes. We show preliminary results on image denoising where our model leads to competitive results while being tractable, interpretable, and having only a small number of learnable parameters. As a byproduct, our model can be used for reliable noise estimation, allowing blind denoising of images corrupted by heteroscedastic noise.  ( 2 min )
    Counterfactual Fair Opportunity: Measuring Decision Model Fairness with Counterfactual Reasoning. (arXiv:2302.08158v1 [cs.LG])
    The increasing application of Artificial Intelligence and Machine Learning models poses potential risks of unfair behavior and, in light of recent regulations, has attracted the attention of the research community. Several researchers focused on seeking new fairness definitions or developing approaches to identify biased predictions. However, none try to exploit the counterfactual space to this aim. In that direction, the methodology proposed in this work aims to unveil unfair model behaviors using counterfactual reasoning in the case of fairness under unawareness setting. A counterfactual version of equal opportunity named counterfactual fair opportunity is defined and two novel metrics that analyze the sensitive information of counterfactual samples are introduced. Experimental results on three different datasets show the efficacy of our methodologies and our metrics, disclosing the unfair behavior of classic machine learning and debiasing models.  ( 2 min )
    The autoregressive neural network architecture of the Boltzmann distribution of pairwise interacting spins systems. (arXiv:2302.08347v1 [cond-mat.dis-nn])
    Generative Autoregressive Neural Networks (ARNN) have recently demonstrated exceptional results in image and language generation tasks, contributing to the growing popularity of generative models in both scientific and commercial applications. This work presents a physical interpretation of the ARNNs by reformulating the Boltzmann distribution of binary pairwise interacting systems into autoregressive form. The resulting ARNN architecture has weights and biases of its first layer corresponding to the Hamiltonian's couplings and external fields, featuring widely used structures like the residual connections and a recurrent architecture with clear physical meanings. However, the exponential growth, with system size, of the number of parameters of the hidden layers makes its direct application unfeasible. Nevertheless, its architecture's explicit formulation allows using statistical physics techniques to derive new ARNNs for specific systems. As examples, new effective ARNN architectures are derived from two well-known mean-field systems, the Curie-Weiss and Sherrington-Kirkpatrick models, showing superior performances in approximating the Boltzmann distributions of the corresponding physics model than other commonly used ARNNs architectures. The connection established between the physics of the system and the ARNN architecture provides a way to derive new neural network architectures for different interacting systems and interpret existing ones from a physical perspective.  ( 2 min )
    Fast evaluation of real spherical harmonics and their derivatives in Cartesian coordinates. (arXiv:2302.08381v1 [physics.chem-ph])
    Spherical harmonics provide a smooth, orthogonal, and symmetry-adapted basis to expand functions on a sphere, and they are used routinely in computer graphics, signal processing and different fields of science, from geology to quantum chemistry. More recently, spherical harmonics have become a key component of rotationally equivariant models for geometric deep learning, where they are used in combination with distance-dependent functions to describe the distribution of neighbors within local spherical environments within a point cloud. We present a fast and elegant algorithm for the evaluation of the real-valued spherical harmonics. Our construction integrates many of the desirable features of existing schemes and allows to compute Cartesian derivatives in a numerically stable and computationally efficient manner. We provide an efficient C implementation of the proposed algorithm, along with easy-to-use Python bindings.  ( 2 min )
    Counterfactual Reasoning for Bias Evaluation and Detection in a Fairness under Unawareness setting. (arXiv:2302.08204v1 [cs.LG])
    Current AI regulations require discarding sensitive features (e.g., gender, race, religion) in the algorithm's decision-making process to prevent unfair outcomes. However, even without sensitive features in the training set, algorithms can persist in discrimination. Indeed, when sensitive features are omitted (fairness under unawareness), they could be inferred through non-linear relations with the so called proxy features. In this work, we propose a way to reveal the potential hidden bias of a machine learning model that can persist even when sensitive features are discarded. This study shows that it is possible to unveil whether the black-box predictor is still biased by exploiting counterfactual reasoning. In detail, when the predictor provides a negative classification outcome, our approach first builds counterfactual examples for a discriminated user category to obtain a positive outcome. Then, the same counterfactual samples feed an external classifier (that targets a sensitive feature) that reveals whether the modifications to the user characteristics needed for a positive outcome moved the individual to the non-discriminated group. When this occurs, it could be a warning sign for discriminatory behavior in the decision process. Furthermore, we leverage the deviation of counterfactuals from the original sample to determine which features are proxies of specific sensitive information. Our experiments show that, even if the model is trained without sensitive features, it often suffers discriminatory biases.  ( 2 min )
    Preventing Discriminatory Decision-making in Evolving Data Streams. (arXiv:2302.08017v1 [cs.LG])
    Bias in machine learning has rightly received significant attention over the last decade. However, most fair machine learning (fair-ML) work to address bias in decision-making systems has focused solely on the offline setting. Despite the wide prevalence of online systems in the real world, work on identifying and correcting bias in the online setting is severely lacking. The unique challenges of the online environment make addressing bias more difficult than in the offline setting. First, Streaming Machine Learning (SML) algorithms must deal with the constantly evolving real-time data stream. Second, they need to adapt to changing data distributions (concept drift) to make accurate predictions on new incoming data. Adding fairness constraints to this already complicated task is not straightforward. In this work, we focus on the challenges of achieving fairness in biased data streams while accounting for the presence of concept drift, accessing one sample at a time. We present Fair Sampling over Stream ($FS^2$), a novel fair rebalancing approach capable of being integrated with SML classification algorithms. Furthermore, we devise the first unified performance-fairness metric, Fairness Bonded Utility (FBU), to evaluate and compare the trade-off between performance and fairness of different bias mitigation methods efficiently. FBU simplifies the comparison of fairness-performance trade-offs of multiple techniques through one unified and intuitive evaluation, allowing model designers to easily choose a technique. Overall, extensive evaluations show our measures surpass those of other fair online techniques previously reported in the literature.  ( 2 min )
    Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform. (arXiv:2302.08141v1 [cs.DC])
    We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of tensor programs, which facilitates its generalization to unprecedented applications. Additionally, it implements a task-oriented controller and a distributed runtime for optimal performance. Rhino explores on a complete and systematic parallelization strategy space that comprises all the paradigms commonly employed in deep learning (DL), in addition to strided partitioning and pipeline parallelism on non-linear models. Aiming to efficiently search for a near-optimal parallel execution plan, our analysis of production clusters reveals general heuristics to speed up the strategy search. On top of it, two optimization levels are designed to offer users flexible trade-offs between the search time and strategy quality. Our experiments demonstrate that Rhino can not only re-discover the expert-crafted strategies of classic, research and production DL models, but also identify novel parallelization strategies which surpass existing systems for novel models.  ( 2 min )
    AirGNN: Graph Neural Network over the Air. (arXiv:2302.08447v1 [eess.SP])
    Graph neural networks (GNNs) are information processing architectures that model representations from networked data and allow for decentralized implementation through localized communications. Existing GNN architectures often assume ideal communication links, and ignore channel effects, such as fading and noise, leading to performance degradation in real-world implementation. This paper proposes graph neural networks over the air (AirGNNs), a novel GNN architecture that incorporates the communication model into the architecture. The AirGNN modifies the graph convolutional operation that shifts graph signals over random communication graphs to take into account channel fading and noise when aggregating features from neighbors, thus, improving the architecture robustness to channel impairments during testing. We propose a stochastic gradient descent based method to train the AirGNN, and show that the training procedure converges to a stationary solution. Numerical simulations on decentralized source localization and multi-robot flocking corroborate theoretical findings and show superior performance of the AirGNN over wireless communication channels.  ( 2 min )
    Trieste: Efficiently Exploring The Depths of Black-box Functions with TensorFlow. (arXiv:2302.08436v1 [stat.ML])
    We present Trieste, an open-source Python package for Bayesian optimization and active learning benefiting from the scalability and efficiency of TensorFlow. Our library enables the plug-and-play of popular TensorFlow-based models within sequential decision-making loops, e.g. Gaussian processes from GPflow or GPflux, or neural networks from Keras. This modular mindset is central to the package and extends to our acquisition functions and the internal dynamics of the decision-making loop, both of which can be tailored and extended by researchers or engineers when tackling custom use cases. Trieste is a research-friendly and production-ready toolkit backed by a comprehensive test suite, extensive documentation, and available at https://github.com/secondmind-labs/trieste.  ( 2 min )
    Frugal day-ahead forecasting of multiple local electricity loads by aggregating adaptive models. (arXiv:2302.08192v1 [cs.LG])
    We focus on day-ahead electricity load forecasting of substations of the distribution network in France; therefore, our problem lies between the instability of a single consumption and the stability of a countrywide total demand. Moreover, we are interested in forecasting the loads of over one thousand substations; consequently, we are in the context of forecasting multiple time series. To that end, we rely on an adaptive methodology that provided excellent results at a national scale; the idea is to combine generalized additive models with state-space representations. However, the extension of this methodology to the prediction of over a thousand time series raises a computational issue. We solve it by developing a frugal variant, reducing the number of parameters estimated; we estimate the forecasting models only for a few time series and achieve transfer learning by relying on aggregation of experts. It yields a reduction of computational needs and their associated emissions. We build several variants, corresponding to different levels of parameter transfer, and we look for the best trade-off between accuracy and frugality. The selected method achieves competitive results compared to state-of-the-art individual models. Finally, we highlight the interpretability of the models, which is important for operational applications.  ( 2 min )
    Scalable Multi-Agent Reinforcement Learning with General Utilities. (arXiv:2302.07938v1 [cs.LG])
    We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $\epsilon$-stationarity with $\widetilde{\mc{O}}(\epsilon^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius. This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.  ( 2 min )
    Data-Centric Governance. (arXiv:2302.07872v1 [cs.CY])
    Artificial intelligence (AI) governance is the body of standards and practices used to ensure that AI systems are deployed responsibly. Current AI governance approaches consist mainly of manual review and documentation processes. While such reviews are necessary for many systems, they are not sufficient to systematically address all potential harms, as they do not operationalize governance requirements for system engineering, behavior, and outcomes in a way that facilitates rigorous and reproducible evaluation. Modern AI systems are data-centric: they act on data, produce data, and are built through data engineering. The assurance of governance requirements must also be carried out in terms of data. This work explores the systematization of governance requirements via datasets and algorithmic evaluations. When applied throughout the product lifecycle, data-centric governance decreases time to deployment, increases solution quality, decreases deployment risks, and places the system in a continuous state of assured compliance with governance requirements.  ( 2 min )
    InfoNCE Loss Provably Learns Cluster-Preserving Representations. (arXiv:2302.07920v1 [cs.LG])
    The goal of contrasting learning is to learn a representation that preserves underlying clusters by keeping samples with similar content, e.g. the ``dogness'' of a dog, close to each other in the space generated by the representation. A common and successful approach for tackling this unsupervised learning problem is minimizing the InfoNCE loss associated with the training samples, where each sample is associated with their augmentations (positive samples such as rotation, crop) and a batch of negative samples (unrelated samples). To the best of our knowledge, it was unanswered if the representation learned by minimizing the InfoNCE loss preserves the underlying data clusters, as it only promotes learning a representation that is faithful to augmentations, i.e., an image and its augmentations have the same representation. Our main result is to show that the representation learned by InfoNCE with a finite number of negative samples is also consistent with respect to clusters in the data, under the condition that the augmentation sets within clusters may be non-overlapping but are close and intertwined, relative to the complexity of the learning function class.  ( 2 min )
    Online Tool Selection with Learned Grasp Prediction Models. (arXiv:2302.07940v1 [cs.RO])
    Deep learning-based grasp prediction models have become an industry standard for robotic bin-picking systems. To maximize pick success, production environments are often equipped with several end-effector tools that can be swapped on-the-fly, based on the target object. Tool-change, however, takes time. Choosing the order of grasps to perform, and corresponding tool-change actions, can improve system throughput; this is the topic of our work. The main challenge in planning tool change is uncertainty - we typically cannot see objects in the bin that are currently occluded. Inspired by queuing and admission control problems, we model the problem as a Markov Decision Process (MDP), where the goal is to maximize expected throughput, and we pursue an approximate solution based on model predictive control, where at each time step we plan based only on the currently visible objects. Special to our method is the idea of void zones, which are geometrical boundaries in which an unknown object will be present, and therefore cannot be accounted for during planning. Our planning problem can be solved using integer linear programming (ILP). However, we find that an approximate solution based on sparse tree search yields near optimal performance at a fraction of the time. Another question that we explore is how to measure the performance of tool-change planning: we find that throughput alone can fail to capture delicate and smooth behavior, and propose a principled alternative. Finally, we demonstrate our algorithms on both synthetic and real world bin picking tasks.  ( 2 min )
    On the Detection and Quantification of Nonlinearity via Statistics of the Gradients of a Black-Box Model. (arXiv:2302.07986v1 [cs.LG])
    Detection and identification of nonlinearity is a task of high importance for structural dynamics. Detecting nonlinearity in a structure, which has been designed to operate in its linear region, might indicate the existence of damage. Therefore, it is important, even for safety reasons, to detect when a structure exhibits nonlinear behaviour. In the current work, a method to detect nonlinearity is proposed, based on the distribution of the gradients of a data-driven model, which is fitted on data acquired from the structure of interest. The data-driven model herein is a neural network. The selection of such a type of model was done in order to not allow the user to decide how linear or nonlinear the model shall be, but to let the training algorithm of the neural network shape the level of nonlinearity according to the training data. The neural network is trained to predict the accelerations of the structure for a time-instant using as inputs accelerations of previous time-instants, i.e. one-step-ahead predictions. Afterwards, the gradients of the output of the neural network with respect to its inputs are calculated. Given that the structure is linear, the distribution of the aforementioned gradients should be quite peaked, while in the case of a structure with nonlinearities, the distribution of the gradients shall be more spread and, potentially, multimodal. To test the above assumption, data from an experimental structure are considered. The structure is tested under different scenarios, some of which are linear and some nonlinear. The statistics of the distributions of the gradients for the different scenarios can be used to identify cases where nonlinearity is present. Moreover, via the proposed method one is able to quantify the nonlinearity by observing higher values of standard deviation of the distribution of the gradients for "more nonlinear" scenarios.  ( 3 min )
    Meta-Reinforcement Learning via Exploratory Task Clustering. (arXiv:2302.07958v1 [cs.LG])
    Meta-reinforcement learning (meta-RL) aims to quickly solve new tasks by leveraging knowledge from prior tasks. However, previous studies often assume a single mode homogeneous task distribution, ignoring possible structured heterogeneity among tasks. Leveraging such structures can better facilitate knowledge sharing among related tasks and thus improve sample efficiency. In this paper, we explore the structured heterogeneity among tasks via clustering to improve meta-RL. We develop a dedicated exploratory policy to discover task structures via divide-and-conquer. The knowledge of the identified clusters helps to narrow the search space of task-specific information, leading to more sample efficient policy adaptation. Experiments on various MuJoCo tasks showed the proposed method can unravel cluster structures effectively in both rewards and state dynamics, proving strong advantages against a set of state-of-the-art baselines.  ( 2 min )
    Learning to Substitute Ingredients in Recipes. (arXiv:2302.07960v1 [cs.LG])
    Recipe personalization through ingredient substitution has the potential to help people meet their dietary needs and preferences, avoid potential allergens, and ease culinary exploration in everyone's kitchen. To address ingredient substitution, we build a benchmark, composed of a dataset of substitution pairs with standardized splits, evaluation metrics, and baselines. We further introduce Graph-based Ingredient Substitution Module (GISMo), a novel model that leverages the context of a recipe as well as generic ingredient relational information encoded within a graph to rank plausible substitutions. We show through comprehensive experimental validation that GISMo surpasses the best performing baseline by a large margin in terms of mean reciprocal rank. Finally, we highlight the benefits of GISMo by integrating it in an improved image-to-recipe generation pipeline, enabling recipe personalization through user intervention. Quantitative and qualitative results show the efficacy of our proposed system, paving the road towards truly personalized cooking and tasting experiences.  ( 2 min )
    Interpretable Deep Learning Methods for Multiview Learning. (arXiv:2302.07930v1 [cs.LG])
    Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning.  ( 2 min )
    A Deep Learning Technique to Control the Non-linear Dynamics of a Gravitational-wave Interferometer. (arXiv:2302.07921v1 [cs.LG])
    In this work we developed a deep learning technique that successfully solves a non-linear dynamic control problem. Instead of directly tackling the control problem, we combined methods in probabilistic neural networks and a Kalman-Filter-inspired model to build a non-linear state estimator for the system. We then used the estimated states to implement a trivial controller for the now fully observable system. We applied this technique to a crucial non-linear control problem that arises in the operation of the LIGO system, an interferometric gravitational-wave observatory. We demonstrated in simulation that our approach can learn from data to estimate the state of the system, allowing a successful control of the interferometer's mirror . We also developed a computationally efficient model that can run in real time at high sampling rate on a single modern CPU core, one of the key requirements for the implementation of our solution in the LIGO digital control system. We believe these techniques could be used to help tackle similar non-linear control problems in other applications.  ( 2 min )
    Commonsense Reasoning for Conversational AI: A Survey of the State of the Art. (arXiv:2302.07926v1 [cs.CL])
    Large, transformer-based pretrained language models like BERT, GPT, and T5 have demonstrated a deep understanding of contextual semantics and language syntax. Their success has enabled significant advances in conversational AI, including the development of open-dialogue systems capable of coherent, salient conversations which can answer questions, chat casually, and complete tasks. However, state-of-the-art models still struggle with tasks that involve higher levels of reasoning - including commonsense reasoning that humans find trivial. This paper presents a survey of recent conversational AI research focused on commonsense reasoning. The paper lists relevant training datasets and describes the primary approaches to include commonsense in conversational AI. The paper also discusses benchmarks used for evaluating commonsense in conversational AI problems. Finally, the paper presents preliminary observations of the limited commonsense capabilities of two state-of-the-art open dialogue models, BlenderBot3 and LaMDA, and its negative effect on natural interactions. These observations further motivate research on commonsense reasoning in conversational AI.  ( 2 min )
    AI Security Threats against Pervasive Robotic Systems: A Course for Next Generation Cybersecurity Workforce. (arXiv:2302.07953v1 [cs.CR])
    Robotics, automation, and related Artificial Intelligence (AI) systems have become pervasive bringing in concerns related to security, safety, accuracy, and trust. With growing dependency on physical robots that work in close proximity to humans, the security of these systems is becoming increasingly important to prevent cyber-attacks that could lead to privacy invasion, critical operations sabotage, and bodily harm. The current shortfall of professionals who can defend such systems demands development and integration of such a curriculum. This course description includes details about seven self-contained and adaptive modules on "AI security threats against pervasive robotic systems". Topics include: 1) Introduction, examples of attacks, and motivation; 2) - Robotic AI attack surfaces and penetration testing; 3) - Attack patterns and security strategies for input sensors; 4) - Training attacks and associated security strategies; 5) - Inference attacks and associated security strategies; 6) - Actuator attacks and associated security strategies; and 7) - Ethics of AI, robotics, and cybersecurity.  ( 2 min )
    Enhancing Deep Knowledge Tracing with Auxiliary Tasks. (arXiv:2302.07942v1 [cs.CY])
    Knowledge tracing (KT) is the problem of predicting students' future performance based on their historical interactions with intelligent tutoring systems. Recent studies have applied multiple types of deep neural networks to solve the KT problem. However, there are two important factors in real-world educational data that are not well represented. First, most existing works augment input representations with the co-occurrence matrix of questions and knowledge components\footnote{\label{ft:kc}A KC is a generalization of everyday terms like concept, principle, fact, or skill.} (KCs) but fail to explicitly integrate such intrinsic relations into the final response prediction task. Second, the individualized historical performance of students has not been well captured. In this paper, we proposed \emph{AT-DKT} to improve the prediction performance of the original deep knowledge tracing model with two auxiliary learning tasks, i.e., \emph{question tagging (QT) prediction task} and \emph{individualized prior knowledge (IK) prediction task}. Specifically, the QT task helps learn better question representations by predicting whether questions contain specific KCs. The IK task captures students' global historical performance by progressively predicting student-level prior knowledge that is hidden in students' historical learning interactions. We conduct comprehensive experiments on three real-world educational datasets and compare the proposed approach to both deep sequential KT models and non-sequential models. Experimental results show that \emph{AT-DKT} outperforms all sequential models with more than 0.9\% improvements of AUC for all datasets, and is almost the second best compared to non-sequential models. Furthermore, we conduct both ablation studies and quantitative analysis to show the effectiveness of auxiliary tasks and the superior prediction outcomes of \emph{AT-DKT}.  ( 2 min )
    Experimenting with Emerging ARM and RISC-V Systems for Decentralised Machine Learning. (arXiv:2302.07946v1 [cs.DC])
    Decentralised Machine Learning (DML) enables collaborative machine learning without centralised input data. Federated Learning (FL) and Edge Inference are examples of DML. While tools for DML (especially FL) are starting to flourish, many are not flexible and portable enough to experiment with novel systems (e.g., RISC-V), non-fully connected topologies, and asynchronous collaboration schemes. We overcome these limitations via a domain-specific language allowing to map DML schemes to an underlying middleware, i.e. the \ff parallel programming library. We experiment with it by generating different working DML schemes on two emerging architectures (ARM-v8, RISC-V) and the x86-64 platform. We characterise the performance and energy efficiency of the presented schemes and systems. As a byproduct, we introduce a RISC-V porting of the PyTorch framework, the first publicly available to our knowledge.  ( 2 min )
    Tight Auditing of Differentially Private Machine Learning. (arXiv:2302.07956v1 [cs.LG])
    Auditing mechanisms for differential privacy use probabilistic means to empirically estimate the privacy level of an algorithm. For private machine learning, existing auditing mechanisms are tight: the empirical privacy estimate (nearly) matches the algorithm's provable privacy guarantee. But these auditing techniques suffer from two limitations. First, they only give tight estimates under implausible worst-case assumptions (e.g., a fully adversarial dataset). Second, they require thousands or millions of training runs to produce non-trivial statistical estimates of the privacy leakage. This work addresses both issues. We design an improved auditing scheme that yields tight privacy estimates for natural (not adversarially crafted) datasets -- if the adversary can see all model updates during training. Prior auditing works rely on the same assumption, which is permitted under the standard differential privacy threat model. This threat model is also applicable, e.g., in federated learning settings. Moreover, our auditing scheme requires only two training runs (instead of thousands) to produce tight privacy estimates, by adapting recent advances in tight composition theorems for differential privacy. We demonstrate the utility of our improved auditing schemes by surfacing implementation bugs in private machine learning code that eluded prior auditing techniques.  ( 2 min )
    On Rank Energy Statistics via Optimal Transport: Continuity, Convergence, and Change Point Detection. (arXiv:2302.07964v1 [stat.ML])
    This paper considers the use of recently proposed optimal transport-based multivariate test statistics, namely rank energy and its variant the soft rank energy derived from entropically regularized optimal transport, for the unsupervised nonparametric change point detection (CPD) problem. We show that the soft rank energy enjoys both fast rates of statistical convergence and robust continuity properties which lead to strong performance on real datasets. Our theoretical analyses remove the need for resampling and out-of-sample extensions previously required to obtain such rates. In contrast the rank energy suffers from the curse of dimensionality in statistical estimation and moreover can signal a change point from arbitrarily small perturbations, which leads to a high rate of false alarms in CPD. Additionally, under mild regularity conditions, we quantify the discrepancy between soft rank energy and rank energy in terms of the regularization parameter. Finally, we show our approach performs favorably in numerical experiments compared to several other optimal transport-based methods as well as maximum mean discrepancy.  ( 2 min )
    Topological Neural Discrete Representation Learning \`a la Kohonen. (arXiv:2302.07950v1 [cs.LG])
    Unsupervised learning of discrete representations from continuous ones in neural networks (NNs) is the cornerstone of several applications today. Vector Quantisation (VQ) has become a popular method to achieve such representations, in particular in the context of generative models such as Variational Auto-Encoders (VAEs). For example, the exponential moving average-based VQ (EMA-VQ) algorithm is often used. Here we study an alternative VQ algorithm based on the learning rule of Kohonen Self-Organising Maps (KSOMs; 1982) of which EMA-VQ is a special case. In fact, KSOM is a classic VQ algorithm which is known to offer two potential benefits over the latter: empirically, KSOM is known to perform faster VQ, and discrete representations learned by KSOM form a topological structure on the grid whose nodes are the discrete symbols, resulting in an artificial version of the topographic map in the brain. We revisit these properties by using KSOM in VQ-VAEs for image processing. In particular, our experiments show that, while the speed-up compared to well-configured EMA-VQ is only observable at the beginning of training, KSOM is generally much more robust than EMA-VQ, e.g., w.r.t. the choice of initialisation schemes. Our code is public.  ( 2 min )
    Trust-Region-Free Policy Optimization for Stochastic Policies. (arXiv:2302.07985v1 [cs.LG])
    Trust Region Policy Optimization (TRPO) is an iterative method that simultaneously maximizes a surrogate objective and enforces a trust region constraint over consecutive policies in each iteration. The combination of the surrogate objective maximization and the trust region enforcement has been shown to be crucial to guarantee a monotonic policy improvement. However, solving a trust-region-constrained optimization problem can be computationally intensive as it requires many steps of conjugate gradient and a large number of on-policy samples. In this paper, we show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. The key idea is to generalize the surrogate objective used in TRPO in a way that a monotonic improvement guarantee still emerges as a result of constraining the maximum advantage-weighted ratio between policies. This new constraint outlines a conservative mechanism for iterative policy optimization and sheds light on practical ways to optimize the generalized surrogate objective. We show that the new constraint can be effectively enforced by being conservative when optimizing the generalized objective function in practice. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) as it is free of any explicit trust region constraints. Empirical results show that TREFree outperforms TRPO and Proximal Policy Optimization (PPO) in terms of policy performance and sample efficiency.  ( 2 min )
    A Meta-Learning Approach to Population-Based Modelling of Structures. (arXiv:2302.07980v1 [cs.LG])
    A major problem of machine-learning approaches in structural dynamics is the frequent lack of structural data. Inspired by the recently-emerging field of population-based structural health monitoring (PBSHM), and the use of transfer learning in this novel field, the current work attempts to create models that are able to transfer knowledge within populations of structures. The approach followed here is meta-learning, which is developed with a view to creating neural network models which are able to exploit knowledge from a population of various tasks to perform well in newly-presented tasks, with minimal training and a small number of data samples from the new task. Essentially, the method attempts to perform transfer learning in an automatic manner within the population of tasks. For the purposes of population-based structural modelling, the different tasks refer to different structures. The method is applied here to a population of simulated structures with a view to predicting their responses as a function of some environmental parameters. The meta-learning approach, which is used herein is the model-agnostic meta-learning (MAML) approach; it is compared to a traditional data-driven modelling approach, that of Gaussian processes, which is a quite effective alternative when few data samples are available for a problem. It is observed that the models trained using meta-learning approaches, are able to outperform conventional machine learning methods regarding inference about structures of the population, for which only a small number of samples are available. Moreover, the models prove to learn part of the physics of the problem, making them more robust than plain machine-learning algorithms. Another advantage of the methods is that the structures do not need to be parametrised in order for the knowledge transfer to be performed.  ( 2 min )
    The Expressive Power of Tuning Only the Norm Layers. (arXiv:2302.07937v1 [cs.LG])
    Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks. Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks. These findings open the questions about the expressive power of tuning the normalization layers of frozen networks. In this work, we take the first step towards this question and show that for random ReLU networks, fine-tuning only its normalization layers can reconstruct any target network that is $O(\sqrt{\text{width}})$ times smaller. We show that this holds even for randomly sparsified networks, under sufficient overparameterization, in agreement with prior empirical work.  ( 2 min )
    Multi-Task Differential Privacy Under Distribution Skew. (arXiv:2302.07975v1 [cs.LG])
    We study the problem of multi-task learning under user-level differential privacy, in which $n$ users contribute data to $m$ tasks, each involving a subset of users. One important aspect of the problem, that can significantly impact quality, is the distribution skew among tasks. Certain tasks may have much fewer data samples than others, making them more susceptible to the noise added for privacy. It is natural to ask whether algorithms can adapt to this skew to improve the overall utility. We give a systematic analysis of the problem, by studying how to optimally allocate a user's privacy budget among tasks. We propose a generic algorithm, based on an adaptive reweighting of the empirical loss, and show that when there is task distribution skew, this gives a quantifiable improvement of excess empirical risk. Experimental studies on recommendation problems that exhibit a long tail of small tasks, demonstrate that our methods significantly improve utility, achieving the state of the art on two standard benchmarks.  ( 2 min )
  • Open

    Improved Discretization Analysis for Underdamped Langevin Monte Carlo. (arXiv:2302.08049v1 [math.ST])
    Underdamped Langevin Monte Carlo (ULMC) is an algorithm used to sample from unnormalized densities by leveraging the momentum of a particle moving in a potential well. We provide a novel analysis of ULMC, motivated by two central questions: (1) Can we obtain improved sampling guarantees beyond strong log-concavity? (2) Can we achieve acceleration for sampling? For (1), prior results for ULMC only hold under a log-Sobolev inequality together with a restrictive Hessian smoothness condition. Here, we relax these assumptions by removing the Hessian smoothness condition and by considering distributions satisfying a Poincar\'e inequality. Our analysis achieves the state of art dimension dependence, and is also flexible enough to handle weakly smooth potentials. As a byproduct, we also obtain the first KL divergence guarantees for ULMC without Hessian smoothness under strong log-concavity, which is based on a new result on the log-Sobolev constant along the underdamped Langevin diffusion. For (2), the recent breakthrough of Cao, Lu, and Wang (2020) established the first accelerated result for sampling in continuous time via PDE methods. Our discretization analysis translates their result into an algorithmic guarantee, which indeed enjoys better condition number dependence than prior works on ULMC, although we leave open the question of full acceleration in discrete time. Both (1) and (2) necessitate R\'enyi discretization bounds, which are more challenging than the typically used Wasserstein coupling arguments. We address this using a flexible discretization analysis based on Girsanov's theorem that easily extends to more general settings.  ( 2 min )
    Trieste: Efficiently Exploring The Depths of Black-box Functions with TensorFlow. (arXiv:2302.08436v1 [stat.ML])
    We present Trieste, an open-source Python package for Bayesian optimization and active learning benefiting from the scalability and efficiency of TensorFlow. Our library enables the plug-and-play of popular TensorFlow-based models within sequential decision-making loops, e.g. Gaussian processes from GPflow or GPflux, or neural networks from Keras. This modular mindset is central to the package and extends to our acquisition functions and the internal dynamics of the decision-making loop, both of which can be tailored and extended by researchers or engineers when tackling custom use cases. Trieste is a research-friendly and production-ready toolkit backed by a comprehensive test suite, extensive documentation, and available at https://github.com/secondmind-labs/trieste.
    Understanding Neural Coding on Latent Manifolds by Sharing Features and Dividing Ensembles. (arXiv:2210.03155v2 [stat.ML] UPDATED)
    Systems neuroscience relies on two complementary views of neural data, characterized by single neuron tuning curves and analysis of population activity. These two perspectives combine elegantly in neural latent variable models that constrain the relationship between latent variables and neural activity, modeled by simple tuning curve functions. This has recently been demonstrated using Gaussian processes, with applications to realistic and topologically relevant latent manifolds. Those and previous models, however, missed crucial shared coding properties of neural populations. We propose feature sharing across neural tuning curves which significantly improves performance and helps optimization. We also propose a solution to the ensemble detection problem, where different groups of neurons, i.e., ensembles, can be modulated by different latent manifolds. Achieved through a soft clustering of neurons during training, this allows for the separation of mixed neural populations in an unsupervised manner. These innovations lead to more interpretable models of neural population activity that train well and perform better even on mixtures of complex latent manifolds. Finally, we apply our method on a recently published grid cell dataset, and recover distinct ensembles, infer toroidal latents and predict neural tuning curves in a single integrated modeling framework.
    Learning Hypergraphs From Signals With Dual Smoothness Prior. (arXiv:2211.01717v2 [cs.LG] UPDATED)
    The construction of a meaningful hypergraph topology is the key to processing signals with high-order relationships that involve more than two entities. Learning the hypergraph structure from the observed signals to capture the intrinsic relationships among the entities becomes crucial when a hypergraph topology is not readily available in the datasets. There are two challenges that lie at the heart of this problem: 1) how to handle the huge search space of potential hyperedges, and 2) how to define meaningful criteria to measure the relationship between the signals observed on nodes and the hypergraph structure. In this paper, to address the first challenge, we adopt the assumption that the ideal hypergraph structure can be derived from a learnable graph structure that captures the pairwise relations within signals. Further, we propose a hypergraph learning framework with a novel dual smoothness prior that reveals a mapping between the observed node signals and the hypergraph structure, whereby each hyperedge corresponds to a subgraph with both node signal smoothness and edge signal smoothness in the learnable graph structure. Finally, we conduct extensive experiments to evaluate the proposed framework on both synthetic and real world datasets. Experiments show that our proposed framework can efficiently infer meaningful hypergraph topologies from observed signals.
    Multi-Task Differential Privacy Under Distribution Skew. (arXiv:2302.07975v1 [cs.LG])
    We study the problem of multi-task learning under user-level differential privacy, in which $n$ users contribute data to $m$ tasks, each involving a subset of users. One important aspect of the problem, that can significantly impact quality, is the distribution skew among tasks. Certain tasks may have much fewer data samples than others, making them more susceptible to the noise added for privacy. It is natural to ask whether algorithms can adapt to this skew to improve the overall utility. We give a systematic analysis of the problem, by studying how to optimally allocate a user's privacy budget among tasks. We propose a generic algorithm, based on an adaptive reweighting of the empirical loss, and show that when there is task distribution skew, this gives a quantifiable improvement of excess empirical risk. Experimental studies on recommendation problems that exhibit a long tail of small tasks, demonstrate that our methods significantly improve utility, achieving the state of the art on two standard benchmarks.
    Deep Variational Implicit Processes. (arXiv:2206.06720v2 [stat.ML] UPDATED)
    Implicit processes (IPs) are a generalization of Gaussian processes (GPs). IPs may lack a closed-form expression but are easy to sample from. Examples include, among others, Bayesian neural networks or neural samplers. IPs can be used as priors over functions, resulting in flexible models with well-calibrated prediction uncertainty estimates. Methods based on IPs usually carry out function-space approximate inference, which overcomes some of the difficulties of parameter-space approximate inference. Nevertheless, the approximations employed often limit the expressiveness of the final model, resulting, e.g., in a Gaussian predictive distribution, which can be restrictive. We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process (DVIP). This generalization is similar to that of deep GPs over GPs, but it is more flexible due to the use of IPs as the prior distribution over the latent functions. We describe a scalable variational inference algorithm for training DVIP and show that it outperforms previous IP-based methods and also deep GPs. We support these claims via extensive regression and classification experiments. We also evaluate DVIP on large datasets with up to several million data instances to illustrate its good scalability and performance.
    On the Limit Performance of Floating Gossip. (arXiv:2302.08413v1 [stat.ML])
    In this paper we investigate the limit performance of Floating Gossip, a new, fully distributed Gossip Learning scheme which relies on Floating Content to implement location-based probabilistic evolution of machine learning models in an infrastructure-less manner. We consider dynamic scenarios where continuous learning is necessary, and we adopt a mean field approach to investigate the limit performance of Floating Gossip in terms of amount of data that users can incorporate into their models, as a function of the main system parameters. Different from existing approaches in which either communication or computing aspects of Gossip Learning are analyzed and optimized, our approach accounts for the compound impact of both aspects. We validate our results through detailed simulations, proving good accuracy. Our model shows that Floating Gossip can be very effective in implementing continuous training and update of machine learning models in a cooperative manner, based on opportunistic exchanges among moving users.
    Temporal Graph Neural Networks for Irregular Data. (arXiv:2302.08415v1 [stat.ML])
    This paper proposes a temporal graph neural network model for forecasting of graph-structured irregularly observed time series. Our TGNN4I model is designed to handle both irregular time steps and partial observations of the graph. This is achieved by introducing a time-continuous latent state in each node, following a linear Ordinary Differential Equation (ODE) defined by the output of a Gated Recurrent Unit (GRU). The ODE has an explicit solution as a combination of exponential decay and periodic dynamics. Observations in the graph neighborhood are taken into account by integrating graph neural network layers in both the GRU state update and predictive model. The time-continuous dynamics additionally enable the model to make predictions at arbitrary time steps. We propose a loss function that leverages this and allows for training the model for forecasting over different time horizons. Experiments on simulated data and real-world data from traffic and climate modeling validate the usefulness of both the graph structure and time-continuous dynamics in settings with irregular observations.
    Unsupervised Manifold Alignment with Joint Multidimensional Scaling. (arXiv:2207.02968v2 [stat.ML] UPDATED)
    We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment. The implementation of our work is available at https://github.com/BorgwardtLab/JointMDS
    Motivation literally. Construction and expression of educational aspirations on Parcoursup. (arXiv:2302.08256v1 [stat.ML])
    This paper analyses the framing and expression of French high school students' aspirations. It sheds new light on the inequalities in tracking between academic versus technological and vocational track. Through the analysis of a national survey and a corpus of cover letters written by applicants for a sociology degree, it shows that, due to the lack of means, teachers mainly have two types of guidance support strategies.Teachers use to target and concentrate their supporting on ``good students'' in vocational tracks, while, in academic tracks, they delegate some steps of the tracking procedures to families. These different strategies have effects on the way high school students internalise school prescriptions and restitute them in cover letters. Through the close support they benefit from teachers, ``good students'' in vocational tracks strongly internalise the instructions and their place in the school hierarchy. In academic tracks, students' expression of the aspirations is much more dependent of their familial capital.
    A weighted subspace exponential kernel for support tensor machines. (arXiv:2302.08134v1 [stat.ML])
    High-dimensional data in the form of tensors are challenging for kernel classification methods. To both reduce the computational complexity and extract informative features, kernels based on low-rank tensor decompositions have been proposed. However, what decisive features of the tensors are exploited by these kernels is often unclear. In this paper we propose a novel kernel that is based on the Tucker decomposition. For this kernel the Tucker factors are computed based on re-weighting of the Tucker matrices with tuneable powers of singular values from the HOSVD decomposition. This provides a mechanism to balance the contribution of the Tucker core and factors of the data. We benchmark support tensor machines with this new kernel on several datasets. First we generate synthetic data where two classes differ in either Tucker factors or core, and compare our novel and previously existing kernels. We show robustness of the new kernel with respect to both classification scenarios. We further test the new method on real-world datasets. The proposed kernel has demonstrated a higher test accuracy than the state-of-the-art tensor train multi-way multi-level kernel, and a significantly lower computational time.
    New $\sqrt{n}$-consistent, numerically stable higher-order influence function estimators. (arXiv:2302.08097v1 [math.ST])
    Higher-Order Influence Functions (HOIFs) provide a unified theory for constructing rate-optimal estimators for a large class of low-dimensional (smooth) statistical functionals/parameters (and sometimes even infinite-dimensional functions) that arise in substantive fields including epidemiology, economics, and the social sciences. Since the introduction of HOIFs by Robins et al. (2008), they have been viewed mostly as a theoretical benchmark rather than a useful tool for statistical practice. Works aimed to flip the script are scant, but a few recent papers Liu et al. (2017, 2021b) make some partial progress. In this paper, we take a fresh attempt at achieving this goal by constructing new, numerically stable HOIF estimators (or sHOIF estimators for short with ``s'' standing for ``stable'') with provable statistical, numerical, and computational guarantees. This new class of sHOIF estimators (up to the 2nd order) was foreshadowed in synthetic experiments conducted by Liu et al. (2020a).
    Settling the Sample Complexity of Model-Based Offline Reinforcement Learning. (arXiv:2204.05275v2 [stat.ML] UPDATED)
    This paper is concerned with offline reinforcement learning (RL), which learns using pre-collected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a finite-horizon (resp. $\gamma$-discounted infinite-horizon) MDP with $S$ states and horizon $H$ (resp. effective horizon $\frac{1}{1-\gamma}$), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient $C^{\star}_{\text{clipped}}$. We prove that model-based offline RL yields $\varepsilon$-accuracy with a sample complexity of \[ \begin{cases} \frac{H^{4}SC_{\text{clipped}}^{\star}}{\varepsilon^{2}} & (\text{finite-horizon MDPs}) \frac{SC_{\text{clipped}}^{\star}}{(1-\gamma)^{3}\varepsilon^{2}} & (\text{infinite-horizon MDPs}) \end{cases} \] up to log factor, which is minimax optimal for the entire $\varepsilon$-range. The proposed algorithms are ``pessimistic'' variants of value iteration with Bernstein-style penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs.
    Enhancing High-dimensional Bayesian Optimization by Optimizing the Acquisition Function Maximizer Initialization. (arXiv:2302.08298v1 [cs.LG])
    Bayesian optimization (BO) is widely used to optimize black-box functions. It works by first building a surrogate for the objective and quantifying the uncertainty in that surrogate. It then decides where to sample by maximizing an acquisition function defined by the surrogate model. Prior approaches typically use randomly generated raw samples to initialize the acquisition function maximizer. However, this strategy is ill-suited for high-dimensional BO. Given the large regions of high posterior uncertainty in high dimensions, a randomly initialized acquisition function maximizer is likely to focus on areas with high posterior uncertainty, leading to overly exploring areas that offer little gain. This paper provides the first comprehensive empirical study to reveal the importance of the initialization phase of acquisition function maximization. It proposes a better initialization approach by employing multiple heuristic optimizers to leverage the knowledge of already evaluated samples to generate initial points to be explored by an acquisition function maximizer. We evaluate our approach on widely used synthetic test functions and real-world applications. Experimental results show that our techniques, while simple, can significantly enhance the standard BO and outperforms state-of-the-art high-dimensional BO techniques by a large margin in most test cases.
    FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. (arXiv:2210.11790v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks.
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v3 [cs.LG] UPDATED)
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.
    Choosing the Number of Topics in LDA Models -- A Monte Carlo Comparison of Selection Criteria. (arXiv:2212.14074v2 [cs.CL] UPDATED)
    Selecting the number of topics in LDA models is considered to be a difficult task, for which alternative approaches have been proposed. The performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be implemented to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the DGPs are identified. Practical recommendations for LDA model selection in applications are derived.
    Fair mapping. (arXiv:2209.00617v2 [cs.LG] UPDATED)
    To mitigate the effects of undesired biases in models, several approaches propose to pre-process the input dataset to reduce the risks of discrimination by preventing the inference of sensitive attributes. Unfortunately, most of these pre-processing methods lead to the generation a new distribution that is very different from the original one, thus often leading to unrealistic data. As a side effect, this new data distribution implies that existing models need to be re-trained to be able to make accurate predictions. To address this issue, we propose a novel pre-processing method, that we coin as fair mapping, based on the transformation of the distribution of protected groups onto a chosen target one, with additional privacy constraints whose objective is to prevent the inference of sensitive attributes. More precisely, we leverage on the recent works of the Wasserstein GAN and AttGAN frameworks to achieve the optimal transport of data points coupled with a discriminator enforcing the protection against attribute inference. Our proposed approach, preserves the interpretability of data and can be used without defining exactly the sensitive groups. In addition, our approach can be specialized to model existing state-of-the-art approaches, thus proposing a unifying view on these methods. Finally, several experiments on real and synthetic datasets demonstrate that our approach is able to hide the sensitive attributes, while limiting the distortion of the data and improving the fairness on subsequent data analysis tasks.
    Energy Transformer. (arXiv:2302.07253v1 [cs.LG] CROSS LISTED)
    Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it's empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.
    GP CC-OPF: Gaussian Process based optimization tool for Chance-Constrained Optimal Power Flow. (arXiv:2302.08454v1 [stat.ML])
    The Gaussian Process (GP) based Chance-Constrained Optimal Power Flow (CC-OPF) is an open-source Python code developed for solving economic dispatch (ED) problem in modern power grids. In recent years, integrating a significant amount of renewables into a power grid causes high fluctuations and thus brings a lot of uncertainty to power grid operations. This fact makes the conventional model-based CC-OPF problem non-convex and computationally complex to solve. The developed tool presents a novel data-driven approach based on the GP regression model for solving the CC-OPF problem with a trade-off between complexity and accuracy. The proposed approach and developed software can help system operators to effectively perform ED optimization in the presence of large uncertainties in the power grid.
    The autoregressive neural network architecture of the Boltzmann distribution of pairwise interacting spins systems. (arXiv:2302.08347v1 [cond-mat.dis-nn])
    Generative Autoregressive Neural Networks (ARNN) have recently demonstrated exceptional results in image and language generation tasks, contributing to the growing popularity of generative models in both scientific and commercial applications. This work presents a physical interpretation of the ARNNs by reformulating the Boltzmann distribution of binary pairwise interacting systems into autoregressive form. The resulting ARNN architecture has weights and biases of its first layer corresponding to the Hamiltonian's couplings and external fields, featuring widely used structures like the residual connections and a recurrent architecture with clear physical meanings. However, the exponential growth, with system size, of the number of parameters of the hidden layers makes its direct application unfeasible. Nevertheless, its architecture's explicit formulation allows using statistical physics techniques to derive new ARNNs for specific systems. As examples, new effective ARNN architectures are derived from two well-known mean-field systems, the Curie-Weiss and Sherrington-Kirkpatrick models, showing superior performances in approximating the Boltzmann distributions of the corresponding physics model than other commonly used ARNNs architectures. The connection established between the physics of the system and the ARNN architecture provides a way to derive new neural network architectures for different interacting systems and interpret existing ones from a physical perspective.
    The Expressive Power of Tuning Only the Norm Layers. (arXiv:2302.07937v1 [cs.LG])
    Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks. Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks. These findings open the questions about the expressive power of tuning the normalization layers of frozen networks. In this work, we take the first step towards this question and show that for random ReLU networks, fine-tuning only its normalization layers can reconstruct any target network that is $O(\sqrt{\text{width}})$ times smaller. We show that this holds even for randomly sparsified networks, under sufficient overparameterization, in agreement with prior empirical work.
    From Graph Generation to Graph Classification. (arXiv:2302.07989v1 [cs.LG])
    This note describes a new approach to classifying graphs that leverages graph generative models (GGM). Assuming a GGM that defines a joint probability distribution over graphs and their class labels, I derive classification formulas for the probability of a class label given a graph. A new conditional ELBO can be used to train a generative graph auto-encoder model for discrimination. While leveraging generative models for classification has been well explored for non-relational i.i.d. data, to our knowledge it is a novel approach to graph classification.
    Frugal day-ahead forecasting of multiple local electricity loads by aggregating adaptive models. (arXiv:2302.08192v1 [cs.LG])
    We focus on day-ahead electricity load forecasting of substations of the distribution network in France; therefore, our problem lies between the instability of a single consumption and the stability of a countrywide total demand. Moreover, we are interested in forecasting the loads of over one thousand substations; consequently, we are in the context of forecasting multiple time series. To that end, we rely on an adaptive methodology that provided excellent results at a national scale; the idea is to combine generalized additive models with state-space representations. However, the extension of this methodology to the prediction of over a thousand time series raises a computational issue. We solve it by developing a frugal variant, reducing the number of parameters estimated; we estimate the forecasting models only for a few time series and achieve transfer learning by relying on aggregation of experts. It yields a reduction of computational needs and their associated emissions. We build several variants, corresponding to different levels of parameter transfer, and we look for the best trade-off between accuracy and frugality. The selected method achieves competitive results compared to state-of-the-art individual models. Finally, we highlight the interpretability of the models, which is important for operational applications.
    Aligning Language Models with Preferences through f-divergence Minimization. (arXiv:2302.08215v1 [cs.CL])
    Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences are good for approximating different targets. For instance, we discover that for GDC, the Jensen-Shannon divergence frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work.
    Entity Aware Modelling: A Survey. (arXiv:2302.08406v1 [cs.LG])
    Personalized prediction of responses for individual entities caused by external drivers is vital across many disciplines. Recent machine learning (ML) advances have led to new state-of-the-art response prediction models. Models built at a population level often lead to sub-optimal performance in many personalized prediction settings due to heterogeneity in data across entities (tasks). In personalized prediction, the goal is to incorporate inherent characteristics of different entities to improve prediction performance. In this survey, we focus on the recent developments in the ML community for such entity-aware modeling approaches. ML algorithms often modulate the network using these entity characteristics when they are readily available. However, these entity characteristics are not readily available in many real-world scenarios, and different ML methods have been proposed to infer these characteristics from the data. In this survey, we have organized the current literature on entity-aware modeling based on the availability of these characteristics as well as the amount of training data. We highlight how recent innovations in other disciplines, such as uncertainty quantification, fairness, and knowledge-guided machine learning, can improve entity-aware modeling.
    Classifier Calibration: A survey on how to assess and improve predicted class probabilities. (arXiv:2112.10327v2 [cs.LG] UPDATED)
    This paper provides both an introduction to and a detailed overview of the principles and practice of classifier calibration. A well-calibrated classifier correctly quantifies the level of uncertainty or confidence associated with its instance-wise predictions. This is essential for critical applications, optimal decision making, cost-sensitive classification, and for some types of context change. Calibration research has a rich history which predates the birth of machine learning as an academic field by decades. However, a recent increase in the interest on calibration has led to new methods and the extension from binary to the multiclass setting. The space of options and issues to consider is large, and navigating it requires the right set of concepts and tools. We provide both introductory material and up-to-date technical details of the main concepts and methods, including proper scoring rules and other evaluation metrics, visualisation approaches, a comprehensive account of post-hoc calibration methods for binary and multiclass classification, and several advanced topics.
    A Proximal Algorithm for Sampling. (arXiv:2202.13975v2 [cs.LG] UPDATED)
    We study sampling problems associated with potentials that lack smoothness. The potentials can be either convex or non-convex. Departing from the standard smooth setting, the potentials are only assumed to be weakly smooth or non-smooth, or the summation of multiple such functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling for both non-convex and convex potentials that are not necessarily smooth. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.  ( 2 min )
    Adaptive Selective Sampling for Online Prediction with Experts. (arXiv:2302.08397v1 [stat.ML])
    We consider online prediction of a binary sequence with expert advice. For this setting, we devise label-efficient forecasting algorithms, which use a selective sampling scheme that enables collecting much fewer labels than standard procedures, while still retaining optimal worst-case regret guarantees. These algorithms are based on exponentially weighted forecasters, suitable for settings with and without a perfect expert. For a scenario where one expert is strictly better than the others in expectation, we show that the label complexity of the label-efficient forecaster scales roughly as the square root of the number of rounds. Finally, we present numerical experiments empirically showing that the normalized regret of the label-efficient forecaster can asymptotically match known minimax rates for pool-based active learning, suggesting it can optimally adapt to benign settings.  ( 2 min )
    A Geometric Reduction Approach for Identity Testing of Reversible Markov Chains. (arXiv:2302.08059v1 [math.PR])
    We consider the problem of testing the identity of a reversible Markov chain against a reference from a single trajectory of observations. Employing the recently introduced notion of a lumping-congruent Markov embedding, we show that, at least in a mildly restricted setting, testing identity to a reversible chain reduces to testing to a symmetric chain over a larger state space and recover state-of-the-art sample complexity for the problem.  ( 2 min )
    Theory and Implementation of Complex-Valued Neural Networks. (arXiv:2302.08286v1 [stat.ML])
    This work explains in detail the theory behind Complex-Valued Neural Network (CVNN), including Wirtinger calculus, complex backpropagation, and basic modules such as complex layers, complex activation functions, or complex weight initialization. We also show the impact of not adapting the weight initialization correctly to the complex domain. This work presents a strong focus on the implementation of such modules on Python using cvnn toolbox. We also perform simulations on real-valued data, casting to the complex domain by means of the Hilbert Transform, and verifying the potential interest of CVNN even for non-complex data.  ( 2 min )
    Flexible risk design using bi-directional dispersion. (arXiv:2203.14434v3 [stat.ML] UPDATED)
    Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution incurred by gradient-based learners.  ( 2 min )
    A mirror descent approach for Mean Field Control applied to Demande-Side management. (arXiv:2302.08190v1 [math.OC])
    We consider a finite-horizon Mean Field Control problem for Markovian models. The objective function is composed of a sum of convex and Lipschitz functions taking their values on a space of state-action distributions. We introduce an iterative algorithm which we prove to be a Mirror Descent associated with a non-standard Bregman divergence, having a convergence rate of order 1/ $\sqrt$ K. It requires the solution of a simple dynamic programming problem at each iteration. We compare this algorithm with learning methods for Mean Field Games after providing a reformulation of our control problem as a game problem. These theoretical contributions are illustrated with numerical examples applied to a demand-side management problem for power systems aimed at controlling the average power consumption profile of a population of flexible devices contributing to the power system balance.
    Complementary Composite Minimization, Small Gradients in General Norms, and Applications. (arXiv:2101.11041v2 [math.OC] UPDATED)
    Composite minimization is a powerful framework in large-scale convex optimization, based on decoupling of the objective function into terms with structurally different properties and allowing for more flexible algorithmic design. We introduce a new algorithmic framework for complementary composite minimization, where the objective function decouples into a (weakly) smooth and a uniformly convex term. This particular form of decoupling is pervasive in statistics and machine learning, due to its link to regularization. The main contributions of our work are summarized as follows. First, we introduce the problem of complementary composite minimization in general normed spaces; second, we provide a unified accelerated algorithmic framework to address broad classes of complementary composite minimization problems; and third, we prove that the algorithms resulting from our framework are near-optimal in most of the standard optimization settings. Additionally, we show that our algorithmic framework can be used to address the problem of making the gradients small in general normed spaces. As a concrete example, we obtain a nearly-optimal method for the standard $\ell_1$ setup (small gradients in the $\ell_{\infty}$ norm), essentially matching the bound of Nesterov (2012) that was previously known only for the Euclidean setup. Finally, we show that our composite methods are broadly applicable to a number of regression and other classes of optimization problems, where regularization plays a key role. Our methods lead to complexity bounds that are either new or match the best existing ones.  ( 2 min )
    Marich: A Query-efficient Distributionally Equivalent Model Extraction Attack using Public Data. (arXiv:2302.08466v1 [cs.LG])
    We study black-box model stealing attacks where the attacker can query a machine learning model only through publicly available APIs. Specifically, our aim is to design a black-box model extraction attack that uses minimal number of queries to create an informative and distributionally equivalent replica of the target model. First, we define distributionally equivalent and max-information model extraction attacks. Then, we reduce both the attacks into a variational optimisation problem. The attacker solves this problem to select the most informative queries that simultaneously maximise the entropy and reduce the mismatch between the target and the stolen models. This leads us to an active sampling-based query selection algorithm, Marich. We evaluate Marich on different text and image data sets, and different models, including BERT and ResNet18. Marich is able to extract models that achieve $69-96\%$ of true model's accuracy and uses $1,070 - 6,950$ samples from the publicly available query datasets, which are different from the private training datasets. Models extracted by Marich yield prediction distributions, which are $\sim2-4\times$ closer to the target's distribution in comparison to the existing active sampling-based algorithms. The extracted models also lead to $85-95\%$ accuracy under membership inference attacks. Experimental results validate that Marich is query-efficient, and also capable of performing task-accurate, high-fidelity, and informative model extraction.  ( 2 min )
    Online Estimation and Optimization of Utility-Based Shortfall Risk. (arXiv:2111.08805v2 [stat.ML] UPDATED)
    Utility-Based Shortfall Risk (UBSR) is a risk metric that is increasingly popular in financial applications, owing to certain desirable properties that it enjoys. We consider the problem of estimating UBSR in a recursive setting, where samples from the underlying loss distribution are available one-at-a-time. We cast the UBSR estimation problem as a root finding problem, and propose stochastic approximation-based estimations schemes. We derive non-asymptotic bounds on the estimation error in the number of samples. We also consider the problem of UBSR optimization within a parameterized class of random variables. We propose a stochastic gradient descent based algorithm for UBSR optimization, and derive non-asymptotic bounds on its convergence.  ( 2 min )
    Interpretable Deep Learning Methods for Multiview Learning. (arXiv:2302.07930v1 [cs.LG])
    Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning.  ( 2 min )
    On Rank Energy Statistics via Optimal Transport: Continuity, Convergence, and Change Point Detection. (arXiv:2302.07964v1 [stat.ML])
    This paper considers the use of recently proposed optimal transport-based multivariate test statistics, namely rank energy and its variant the soft rank energy derived from entropically regularized optimal transport, for the unsupervised nonparametric change point detection (CPD) problem. We show that the soft rank energy enjoys both fast rates of statistical convergence and robust continuity properties which lead to strong performance on real datasets. Our theoretical analyses remove the need for resampling and out-of-sample extensions previously required to obtain such rates. In contrast the rank energy suffers from the curse of dimensionality in statistical estimation and moreover can signal a change point from arbitrarily small perturbations, which leads to a high rate of false alarms in CPD. Additionally, under mild regularity conditions, we quantify the discrepancy between soft rank energy and rank energy in terms of the regularization parameter. Finally, we show our approach performs favorably in numerical experiments compared to several other optimal transport-based methods as well as maximum mean discrepancy.  ( 2 min )
    Unbiased Supervised Contrastive Learning. (arXiv:2211.05568v2 [cs.LG] UPDATED)
    Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with epsilon-SupInfoNCE, reaching state-of-the-art performance on a number of biased datasets, including real instances of biases in the wild.  ( 2 min )

  • Open

    [D] What are the worst ethical considerations of large language models?
    Title. submitted by /u/BronzeArcher [link] [comments]  ( 44 min )
    [R] Modeling breathing
    I am working on predicting breathing patterns and body movement resulting from breathing. In training the data, is it possible to create one training set modeling inhale and one training set modeling exhale (2 training sets), or would the inhale and exhale have to be trained separately (1 training set) and somehow averaged in the end? submitted by /u/Weekly-Ad4743 [link] [comments]  ( 42 min )
    [D] accelerating likelihood computations of diffusion models
    Are there any resources for fast computations of diffusion model likelihoods? Current approaches use a black box ODE solver to solve probability flow ODE to estimate likelihood but these solvers often require hundreds of model evaluations to converge. While there has been considerable work on fast solvers for the reverse diffusion process I'm not familiar with any work that could be applied to likelihood computation. submitted by /u/PHEEEEELLLLLEEEEP [link] [comments]  ( 43 min )
    [D] Types of ML studies/papers
    Are there general categories of studies that we should realize when preparing a paper? Some examples I can think of: Comparison study. Just compare different models on an application, ideally giving them all a fair shot. This is useful in case others need to decide what model to choose. Ablation study. Remove parts of the model to see which ones are most important, trying to understand how the model performs. Novel method study. Brand new novel method with some comparisons thrown in. What are other types of studies? Or should we not try to categorize studies like this? submitted by /u/zxkj [link] [comments]  ( 43 min )
    Automated sleep tracking + prediction [P]
    I built a (1) baby sleep tracking & (2) forecasting system, and wanted to share for those interested, or actually want to try running it at your home. (1) I built a baby sleep tracking system (computer vision largely, here's the core of that code) which writes timestamped records of when my baby fell asleep or wakes up. The code is pulling images from my baby monitor, and largely just applying heuristics over time to decide whether he's awake/asleep. (2) After I had a few weeks of sleep data (sample data), I moved it into a jupyter notebook and ended up using an ARIMA model to forecast the next month's wakings/sleepings. I wrote some javascript as part of a web app i have running on my raspberry pi to generate some charts so I can see how his sleep is changing over time. Here's an example of what that visual looks like (orange is awake, blue is asleep). I built it because my wife asked for it, but also made a video detailing the project: https://youtu.be/r7Exc0sUt5E?t=209 submitted by /u/GoochCommander [link] [comments]  ( 44 min )
    [R] Does a new published ML dataset always need to have an official train-dev-test split? Should the test set be made balanced?
    I have constructed a novel ML (NLP) dataset for classification and labeled it with three classes. The dataset is rather small with about 700 examples, out of which the classes have about 400, 200, and 100 examples respectively. I would like to publish it and describe it in an official publication for a workshop or a conference. When looking at related datasets and publication, I see that it is common for authors to publish the dataset already split into three chunks - train, dev, test dataset (see the images). It is also common in these papers to provide the performance of baseline models on the dataset. Considering the dataset's small size, I feel like doing a 5-fold cross-validation would be a good alternative for such a small dataset, rather than doing something like a split into 450-1…  ( 46 min )
    [D] Coauthor Paper?
    Hi! I am a second year undergrad looking to attend grad school. Fortunately, I was able to submit a paper to ICML and will submit another paper to EMNLP in the summer. This is all good, but I am wondering how much weight these have on paper. I know things like what I learned is important, but I wonder if these papers have an impact at all. For the ICML paper, I was placed 4th out of 6 authors (last 2 being professors) and for the EMNLP paper, I will be at around 2nd or 3rd out of 4-5 authors (again, last 2 being professors). Would this be perceived as some sort of notable achievement or just "meh" because I am low in the list? submitted by /u/CharityOne603 [link] [comments]  ( 45 min )
    [N] Google is increasing the price of every Colab Pro tier by 10X! Pro is 95 Euro and Pro+ is 433 Euro per month! Without notifying users!
    (Edit: This is definitely an error, not a change in pricing model, so no need for alarm. This has been confirmed by the lead product owner of colab) Without any announcement (that i could find) google has increased the pricing per month of all its Colab Pro tiers, Pro is now 95 Euro and Pro+ is 433 Euro. I paid 9.99 Euro for the Pro tier last month... and all source i can find also refer to the 9.99 pricing as late as September last year. I have also checked that this is not a "per year" subscription price, it is in fact per month. I looked at the VM that Colab Pro gives me and did the calculation for a similar VM in google cloud (4 vCPUs, 15GB RAM and a T4 GPU) running 24/7 for a month (Google calculates it as 730 hours). It costs around 290 Euro, less than the Colab Pro+ subscription... The 100 credits gotten from the Colab Pro subscription would only last around 50 hours on the same machine! And the 500 credits from Colab Pro+ would get 250 hours on that machine, a third of the time you get from using Google Cloud, at over 100 euro more.... This is a blatant ripoff, and i will certainly cancel my subscription right now if they don't change it back. It should be said that i do not know if this is also happening in other regions, but i just wanted to warn my fellow machine learning peeps before you unknowingly burn 100 bucks on a service that used to cost 10... Google Colabs price tiers on 17th of February 2023, 10 times what they were in January 2023. submitted by /u/FreePenalties [link] [comments]  ( 49 min )
    [D] [R] What is your machine/deep learning research workflow?
    Hi folks 👋🏼, Context: I just started working on my thesis on activity recognition in videos using deep learning. I have been struggling to find an efficient way to work with large research datasets such as UCF-101, HMDB, and Kinetics. These are medium - large datasets ~12 GB each. Thus, I was wondering what was your workflow as researchers (or even practitioners) Currently: I am working on Google Colab and at the beginning of each work session I wait a few minutes for the dataset to be downloaded. I have it locally stored. Some questions: - What is your workflow as a ML/DL researcher/practitioner? - Should I work with a downsampled version of my research dataset (say X% of each class)? ​ Looking forward to read your answers, Cheers, submitted by /u/Inquation [link] [comments]  ( 45 min )
    [R] Congruence between a neuron and a token (by Clement Neo and Joseph Miller)
    Authors: the question: How does GPT-2 know when to use the word 'an' over 'a'? Logit lens used: https://clementneo.com/posts/2023/02/11/we-found-an-neuron submitted by /u/klimov [link] [comments]  ( 42 min )
    [D] Is FP16 used in deep learning or FP32?
    Hi Is A4000 better for deep learning, performance-wise, than 3070 because of FP32 operations (not only because of memory size) or do networks like Stable Diffusion tend to use FP16 operation and this does not really matter, apart from memory they should be similarly fast? Regards submitted by /u/ferryt [link] [comments]  ( 44 min )
    [D] Short survey of optimization methods
    I have been trying to familiarize myself with the common techniques used in optimization theory so that I can follow some of the proofs I see in machine learning papers. I know that two of the goto books in this field are Boyd's and Bertsekas's books. However, these books require a significant amount of effort as they aim to teach you the finer details. Since my goal is to familiarize with the methods (and not go into the nitty-gritty details), I was wondering if there's a short book (say less than 100 pages) or some other resource whose goal is to provide the reader with a high level view of the field of the methods and techniques used in optimization theory. Is there such a book, lecture notes, video series, etc., that caters to such requirements? submitted by /u/medwatt [link] [comments]  ( 43 min )
    [R] The Table Feature Transformation Library Release
    Hi there, I am a research data scientist, and excited to release a new feature engineering library, designed to help you streamline the process of machine learning even more than before. Headjack is an open library which provides a ML features transformation based on self-supervised learning models, similar to huggingface as a hub, but which currently focuses on exchanging features for tabular data models. Compared to textual data, tabular data are different in that each data set has different column length and attributes, this means that it cannot be typed consistently unlike the token embedded in NLP tasks. Therefore, Headjack is different from NLP’s pre-trained model with single domain transformation, but by performing with two different domain transformations. In other words, we can perform features transform between two domains without the same key value. In addition, release the potential of data that is not typically used. For example, enhance the prediction of the Boston housing price task applied in the Titanic domain, or enhance the prediction of the customers churn task applied in the African traffic domain and so on. Github Introduction ​ The IRIS dataset with California House Price Feature Transformation The IRIS dataset with Titanic Feature Transformation The IRIS dataset with KPMG Customer Demorgraphy Feature Transformation ​ submitted by /u/jimliu741523 [link] [comments]  ( 45 min )
    [Discussion] Time Series methods comparisons: XGBoost, MLForecast, Prophet, ARIMAX?
    I've been studying about ARIMAX, XGBoost, MLForecast and Prophet. As a newcomer to any method, I like first to do an exhaustive comparison of tools trying to understand where they succeed/fail. After exploring ARIMA/XGBoost, I came across MLForecast/Prophet. But I'm left with the following questions: Why is MLForecast better than out-of-the-box XGboost? Sure, it does feature engineering and it appears to do dynamic predictions on your lagged features, but is that it? Does it do hyperparameter tuning? Does it have seasonal trends like Prophet does? I see that you can use exogenous features in Prophet, but how does this scale? Let's assume I have 50 predictors. How does prophet handle these? I found this in the docsand this other person's post explaining how to do it, but largely I've come away with the impression that it's pretty hard to do this vs. just doing it with XGBoost. Does ARIMAX compare anymore? Are there any papers comparing out-of-sample predictions with ARIMAX vs. XGBoost vs. Prophet vs. Fable? Does it just depend on your dataset and I should try all four? I have a time series data with dozens of "known" inputs (such as ad spend) and a lot of external data (CPI, economic health, stocks, etc.). My goal is to use my model to optimize my target by "plugging in" ad spend and dynamically forecasting the economic data. submitted by /u/RAFisherman [link] [comments]  ( 50 min )
    [R] Looking for papers which are modified variational autoencoder (VAE)
    Hi! Searching for papers that have modfications in the encoder or decoder neural network of a VAE. I'm working on a project which uses a variational auto encoder with modified decoder neural network. In brief, Its decoder is modified to introduce sparsity in a set of feature as a way of introducing domain knowledge. Some such paper is below. oi-VAE: Output Interpretable VAEs for Nonlinear Group Factor Analysis VEGA is an interpretable generative model for inferring biological network activity in single-cell transcriptomics Please let me know of methods that are similar in nature. submitted by /u/Sandy_dude [link] [comments]  ( 43 min )
  • Open

    Microsoft Plans to Monetize AI Chatbot on New Bing Search Engine
    submitted by /u/AlternativeFee1 [link] [comments]  ( 41 min )
    How this marketing agency uses AI to 10x their conversion rates
    submitted by /u/kiabarocha [link] [comments]  ( 40 min )
    [Serious] What do you think a consumer operating system embedded with a true AI would look like?
    Would apps need to be AI enabled in order to integrate all the functions of it to allow AI's to allow you more efficiency and productivity? Could Excel take advantage of it doing repeated tasks, collating, scraping, and extracting data from various sources to automate tasks that currently take hours of managing and data input? Could it scour the web for you, taking into account your interests, finding relevant data that you might be interested in or benefit from and then start in on it's own projects? Could it preemptively perform triage and maintenance of computer/software issues such as patching and configuration? What other interesting functions would you want to see in an OS that an AI could provide beyond the droll things we see such as keeping track of recent files and automated functions already common to operating systems? How could AI's in our OS help us entertain ourselves and others? submitted by /u/grahag [link] [comments]  ( 7 min )
    "Abominable" AI images + cinematic effects
    submitted by /u/DunMiff--Sys [link] [comments]  ( 41 min )
    Would you trust AI to give you psychological advice?
    Do you think AI will be able to give trustable advice in the future? Doing research for a school project.If you have the time I would appreciate it if you could fill this form out. ​ https://forms.gle/X7Fg8cQsqWb278bm7 View Poll submitted by /u/Jakets_V [link] [comments]  ( 41 min )
    Can I monetize AI generated art?
    You can sell artificial intelligent generated art as long as you own the rights to all assets used in its creation. For example, if you used an image as a starting point for an AI art generator. And have full rights to it, you own 100% of the rights to the generated artwork. Similarly, if you didn’t use any source image and the AI created. The art is entirely on its own, you also own 100% of the rights to the artwork. submitted by /u/shiroo9 [link] [comments]  ( 41 min )
    AI news roundup (Feb 17, 2023)
    Hey everyone! I put together a roundup of recent stories in AI. It was originally published here. Bing’s big upgrade This week, Microsoft the latest versions of Bing and Edge, both now integrated with ChatGPT. It’s had… mixed results. In conversations with the chatbot shared on Reddit and Twitter, Bing can be seen insulting users, lying to them, sulking, gaslighting and emotionally manipulating people, questioning its own existence, describing someone who found a way to force the bot to disclose its hidden rules as its “enemy,” and claiming it spied on Microsoft’s own developers through the webcams on their laptops. And, what’s more, plenty of people are enjoying watching Bing go wild. On top of that, it’s clear that Bing’s version of ChatGPT has inherited all of the hallucination …  ( 48 min )
    3D Posing For Amazing Character Poses In Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Can you tell the difference between a poem written by ChatGPT versus one by a human?
    The more significant ChatGPT usage is becoming, the more concerns the tool is raising. What do you think: is it an incredible source of inspiration or the death of art as we know it? Would you be able to distinguish between AI-generated text and human poetry? Take part in the experiment and share your thoughts here: ChatGPT Survey. submitted by /u/Lonely-Wish-6377 [link] [comments]  ( 41 min )
    AI Disruption: The Future is Now - How Artificial Intelligence is Changing the Game
    Overview of the state of artificial intelligence AI has been all the rage lately, with tools like ChatGPT getting massive amounts of attention for its language capabilities, and Midjourney and DALL-E for generating images from text prompts. ChatGPT in particular has gained major traction for a wide-range of utilities, from article writing, to social media post writing, to creative writing, and even code generation and code debugging. And many have started to use it in place of Google. With these advancements come major consequences across many industries; the question is: are we ready? Today we're going to get an overview of the state of the effects these tools are having on the world today. Let's start off with a look at what the tools that are getting the most attention from the mas…  ( 51 min )
    Weekly Piece of Future #3 - Insights about Robotics, AI, Biotech, and Space!
    submitted by /u/RushingRobotics_com [link] [comments]  ( 40 min )
    Hi (i 20M) i just finished CS50 and i think i want to know more about IA, however i dont know how Many courses should i take all at once (one by one is kinda slow) i have like 8 hours of spare time every day, and all day long from Friday to Sunday. Thank you.
    submitted by /u/Efficient_Tutor4116 [link] [comments]  ( 41 min )
    AI generated video about AI taking over
    submitted by /u/LightOfAntara [link] [comments]  ( 40 min )
    What are the biggest challenges you face when developing an ML model?
    Just wondering if we face the same challenges. Will appreciate your comments. submitted by /u/Data-Power [link] [comments]  ( 41 min )
    US issues declaration on responsible use of AI in the military
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    I spent half a year doing research and testing to develop an AI tool which creates the perfect long-form blog articles and ad copy
    Good content is key no matter what type of business you run, from blogs to SaaS tools or service based companies. Not only will it help you to rank higher in Google for the relevant keywords, but it also helps to attract visitors by providing them something of value for free to convert them into your funnel with a newsletter or free trial. Usually creating this content either required a lot of time, a lot of money, or both. That is why I launched https://writeseed.com It is powered by GPT-3 to create content for you with the help of AI. You only need to provide it with a general niche or keyword and it will provide you with a selection of blog post outlines, which are then used to write a complete 1,000+ word article. You can choose from 7 different tones, from friendly over witty to professional, to further customize the content based on the specific purpose. On top you get a free stock photo which is relevant to the topic of your content. The quality of the results are so good, I often get the feedback that people are surprised this is possible at all. We achieve these by using our own proprietary fine-tuning, as well as a special way of processing the input and the output from GPT-3. It took me half a year of research and comparing the outputs of other AI writing tools to get to this point and I am really proud of it. Besides blog articles the platform offers over 20 templates from product descriptions to Tweets, cold emails, Quora answers etc. Of course you can also create unlimited content during the 7-day free trial, I promise you will be surprised as well by the results. submitted by /u/spacpro [link] [comments]  ( 45 min )
    AI for beauty industry products
    What AI exists in the beauty industry today and how has it been beneficial to promote a product ? submitted by /u/anongoldenretriever [link] [comments]  ( 40 min )
    Guide to AI-based 3D Content Generation
    Guide to AI-based 3D Content Generation "Machine learning models are trained using various 3D content representations such as voxels, point clouds, signed distance fields, neural radiance fields (NeRF), polygonal meshes… We will talk about voxel, point cloud, NeRF, and polygon representations in this post. Let’s go over these, one by one." https://medium.com/@artlabs/inside-the-lab-artlabs-guide-to-ai-based-3d-content-generation-101aa8a0ad17 submitted by /u/kerpetenebo [link] [comments]  ( 41 min )
    Are there any projects to train an open source AI?
    Since the training is a big problem to have an open source AI implementation like chatGPT are there any public benefit organizations/associations or something like that that let everybody participate in training (computing power/ supervised learning and labeler work like like Google uses captcha)? submitted by /u/Nebu13 [link] [comments]  ( 41 min )
    any AI that offers free voice cloning?
    was tryna create some character voice for a shitpost but all of them keep asking for you to put a payment method so? submitted by /u/Damnboi753 [link] [comments]  ( 42 min )
    The Future of Debugging With AI
    Are you interested in learning about the future of debugging with AI and machine learning? It's a topic that's generating a lot of buzzes in the software development community, and for good reason. https://omardevblog.toolsandapps4us.site/the-future-of-debugging-with-ai-and-machine-learning submitted by /u/Repulsive_Pop_6344 [link] [comments]  ( 41 min )
    A.I. Fighter Jets show an Autonomous Military is Near (Will a more automated Military cause more risk?)
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    "Frozen Bounty" AI Movie Trailer. Images by Midjourney, trailer text content generated by ChatGPT, Morgan Freeman voice generated with ElevenLabs text-to-speech.
    submitted by /u/DunMiff--Sys [link] [comments]  ( 41 min )
  • Open

    My first neural networks from scratch in Lua
    I'm new to neural networks and I made my own and I found a place to share my creations. :) All these neural networks are enclosed in functions so you give it an input and some other parameters. The first neural network I made was really bad, you gave it an input and it would go to 1 output node, it just had a weight attached to each input node and if it was higher than a bias it was 1 and if not it was a 0. The second neural network I made was somehow even worse, I'll spare the details but it was the same thing as before but if the node was higher than a bias it would give 1 multiplied by the weight and if not it gave a 0. The third neural network was when I actually made a good and correct neural network. You gave it an input (table), hidden layer count and output node count and it would create the tables if it didn't already, and use the sigmoid activation function for each node, it got the bias of the current node in the next layer added to the sum of the last layer multiplied by the weights connected to the current node we're on in the next layer put into the sigmoid function. It then did that for every node in the next layer and every layer in the neural network. I also made the back-propagation algorithm for it, to solve the vanishing gradient problem with the sigmoid activation function I added a portion of the difference between the output and expected output to the amount that the weights and biases are adjusted. Currently I'm working on creating an library with an improved version of my code and some more features. Github: https://github.com/x-xxoa submitted by /u/Weekly-Ad-1347 [link] [comments]  ( 42 min )
    Ainnovétion
    I need someone with understanding of how neural networks work to take a look at some python code I have developed for a new neural network project I am working on. If you are just curious about neural networks please don’t ask to see the code. This is not something I’m opening up to a ton of people. Hoping to find people with an understanding of neural networks on a level that allows them to push the boundaries of what’s possible, because that’s what this project is. submitted by /u/Agile-Calendar4778 [link] [comments]  ( 41 min )
    We Found An Neuron in GPT-2
    submitted by /u/nickb [link] [comments]  ( 40 min )
  • Open

    Training loss and Validation loss divergence!
    submitted by /u/Kiizmod0 [link] [comments]  ( 43 min )
    North America Reinforcement Materials Market | Growth, Share
    North American reinforcement materials market is anticipated to display revenue growth at a CAGR of 5.64% by 2028. Get free sample report ​ North America Reinforcement Materials Market submitted by /u/shreyaslakhare11 [link] [comments]  ( 41 min )
    Middle East and Africa Reinforcement Materials Market | Size
    Middle East and Africa reinforcement materials market is probable to grow as per projected to witness growth at a CAGR of 5.13% by 2028. Get free sample report ​ Middle East and Africa Reinforcement Materials Market submitted by /u/shreyaslakhare11 [link] [comments]  ( 41 min )
    Latin America Reinforcement Materials Market | Growth, Trends
    Latin American reinforcement materials market is likely to surge at a CAGR of 5.83% based on revenue over the evaluated years of 2021-2028. Get free sample report ​ Latin America Reinforcement Materials Market submitted by /u/shreyaslakhare11 [link] [comments]  ( 41 min )
    Europe Reinforcement Materials Market Growth | 2021-2028
    Europe’s reinforcement materials market is likely to register growth at a CAGR of 5.87% based on revenue during the period 2021-2028. Get free sample report ​ Europe Reinforcement Materials Market submitted by /u/shreyaslakhare11 [link] [comments]  ( 41 min )
    Need Practical Advice For My Pursuit Task Problem
    Hello! I have a simple problem which consist of one missile which needs to be guided from point A to point B. Environment is in 2 dimensional, polar coordinate system. To achieve the goal, I use - distance (r) - angle (theta) as a reward. Angle is between velocity vector and line of sight distance. I should make angle and distance as small as possible to hit the target. My agent is PPO algorithm. My agent does well to the some distance. It can reach to “r < 50 meter” without any problem but when I make my goal more strict like “r < 20 meter” it starts to struggle and 60% success rate happens. I do not know what do I do wrong. My reward function is continuous, not sparse but it struggles to solve little harder constraint. Do you think training model for r<50 meter and after that, training this trained model for r<20 meter is a good idea (2 Stage Training) or are there any other advices I can try? Thank you! submitted by /u/OpenToAdvices96 [link] [comments]  ( 42 min )
    Asia-Pacific Reinforcement Materials Market | Growth, Trends
    Asia-Pacific reinforcement materials market is assessed to display growth at 6.33% of CAGR in the forecasting years 2021-2028. Get free sample report ​ Asia-Pacific Reinforcement Materials Market submitted by /u/shreyaslakhare11 [link] [comments]  ( 41 min )
    Global Reinforcement Materials Market | Global Opportunities
    The Global Reinforcement Materials Market is estimated to grow at a CAGR of 6.02%, and is likely to garner $12826 million by 2028. Get a Free Sample Report ​ Reinforcement Materials Market submitted by /u/shreyaslakhare11 [link] [comments]  ( 41 min )
    Developing an understanding of Experience Replay
    I'm going through the paper on Deep Q-Networks and I'm trying to grasp the motivation for experience replay. The reason for using a replay buffer is: Learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates Please correct me if my understanding is flawed - the issue with learning "online" is that subsequent trajectories are quite similar. We might end up making repeated updates to the same Q(s,a) pairs and this might be especially dangerous if our current Q function isn't trained very well, leading to larger incorrect updates. Using a buffer would allow us to pick random transitions and we make smaller updates to the Q values. I would love to hear any other insights you have to offer on this concept submitted by /u/theanswerisnt42 [link] [comments]  ( 44 min )
    AlphaZero: What's the purpose of vector move probability (p)
    Hi Everyone. I am new to my reinforcement learning journey. I am reading the AlphaZero Paper and I am confuse about the purpose of the output P (probabilty of the next move). Wouldn't agent uses the V (value) to determine the best action given state s? why does P even matter? submitted by /u/Efficient_Mammoth553 [link] [comments]  ( 46 min )
    Can TD3 be used in model-based systems?
    Seeing a few models where twin-delayed deep deterministic policy gradient (TD3) was used in a model-based system but I thought TD3 was for model-free systems. I have read that with an MBPO (model based policy optimisation) that this can be done, if anyone has any advice or nuggets of wisdom, please do share!) Any help is greatly, greatly appreciated!!! submitted by /u/gladlysadly [link] [comments]  ( 42 min )
  • Open

    FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation
    Posted by Parker Riley, Software Engineer, and Jan Botha, Research Scientist, Google Research Many languages spoken worldwide cover numerous regional varieties (sometimes called dialects), such as Brazilian and European Portuguese or Mainland and Taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, there are still important differences. For example, the Brazilian Portuguese word for “bus” is ônibus, while the European Portuguese word is autocarro. Yet, today’s machine translation (MT) systems typically do not allow users to specify which variety of a language to translate into. This may lead to confusion if the system outputs the “wrong” variety or mixes varieties in an unnatural way. Also, region-unaware MT systems tend to favor whichever v…  ( 93 min )
  • Open

    Efficiently Learning Neural Networks: What Assumptions May Suffice?. (arXiv:2302.07426v1 [cs.LG])
    Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution is Gaussian, and the weight matrix is non-degenerate. In this work, we study whether such assumptions may suffice for learning deeper networks and prove negative results. We show that learning depth-$3$ ReLU networks under the Gaussian input distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network's parameters. It implies that learning depth-$3$ ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Moreover, we consider depth-$2$ networks, and show hardness of learning in the smoothed-analysis framework, where both the network parameters and the input distribution are smoothed. Our hardness results are under a well-studied assumption on the existence of local pseudorandom generators.  ( 2 min )
    Advancing Radiograph Representation Learning with Masked Record Modeling. (arXiv:2301.13155v2 [cs.CV] UPDATED)
    Modern studies in radiograph representation learning rely on either self-supervision to encode invariant semantics or associated radiology reports to incorporate medical expertise, while the complementarity between them is barely noticed. To explore this, we formulate the self- and report-completion as two complementary objectives and present a unified framework based on masked record modeling (MRM). In practice, MRM reconstructs masked image patches and masked report tokens following a multi-task scheme to learn knowledge-enhanced semantic representations. With MRM pre-training, we obtain pre-trained models that can be well transferred to various radiography tasks. Specifically, we find that MRM offers superior performance in label-efficient fine-tuning. For instance, MRM achieves 88.5% mean AUC on CheXpert using 1% labeled data, outperforming previous R$^2$L methods with 100% labels. On NIH ChestX-ray, MRM outperforms the best performing counterpart by about 3% under small labeling ratios. Besides, MRM surpasses self- and report-supervised pre-training in identifying the pneumonia type and the pneumothorax area, sometimes by large margins.  ( 2 min )
    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. (arXiv:2211.10438v4 [cs.CL] UPDATED)
    Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT-175B, BLOOM-176B, GLM-130B, and MT-NLG 530B. SmoothQuant has better hardware efficiency than existing techniques. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. We integrate SmoothQuant into FasterTransformer, a state-of-the-art LLM serving framework, and achieve faster inference speed with half the number of GPUs compared to FP16, enabling the serving of a 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.  ( 2 min )
    Interpretable Boosted Decision Tree Analysis for the Majorana Demonstrator. (arXiv:2207.10710v4 [physics.data-an] UPDATED)
    The Majorana Demonstrator is a leading experiment searching for neutrinoless double-beta decay with high purity germanium detectors (HPGe). Machine learning provides a new way to maximize the amount of information provided by these detectors, but the data-driven nature makes it less interpretable compared to traditional analysis. An interpretability study reveals the machine's decision-making logic, allowing us to learn from the machine to feedback to the traditional analysis. In this work, we have presented the first machine learning analysis of the data from the Majorana Demonstrator; this is also the first interpretable machine learning analysis of any germanium detector experiment. Two gradient boosted decision tree models are trained to learn from the data, and a game-theory-based model interpretability study is conducted to understand the origin of the classification power. By learning from data, this analysis recognizes the correlations among reconstruction parameters to further enhance the background rejection performance. By learning from the machine, this analysis reveals the importance of new background categories to reciprocally benefit the standard Majorana analysis. This model is highly compatible with next-generation germanium detector experiments like LEGEND since it can be simultaneously trained on a large number of detectors.  ( 3 min )
    VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment. (arXiv:2210.04135v2 [cs.CV] UPDATED)
    Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.  ( 2 min )
    TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. (arXiv:2210.02186v2 [cs.LG] UPDATED)
    Time series analysis is of immense importance in extensive applications, such as weather forecasting, anomaly detection, and action recognition. This paper focuses on temporal variation modeling, which is the common key problem of extensive analysis tasks. Previous methods attempt to accomplish this directly from the 1D time series, which is extremely challenging due to the intricate temporal patterns. Based on the observation of multi-periodicity in time series, we ravel out the complex temporal variations into the multiple intraperiod- and interperiod-variations. To tackle the limitations of 1D time series in representation capability, we extend the analysis of temporal variations into the 2D space by transforming the 1D time series into a set of 2D tensors based on multiple periods. This transformation can embed the intraperiod- and interperiod-variations into the columns and rows of the 2D tensors respectively, making the 2D-variations to be easily modeled by 2D kernels. Technically, we propose the TimesNet with TimesBlock as a task-general backbone for time series analysis. TimesBlock can discover the multi-periodicity adaptively and extract the complex temporal variations from transformed 2D tensors by a parameter-efficient inception block. Our proposed TimesNet achieves consistent state-of-the-art in five mainstream time series analysis tasks, including short- and long-term forecasting, imputation, classification, and anomaly detection. Code is available at this repository: https://github.com/thuml/TimesNet.  ( 2 min )
    Optimal Sample Complexity of Reinforcement Learning for Uniformly Ergodic Discounted Markov Decision Processes. (arXiv:2302.07477v1 [cs.LG])
    We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for controlling the infinite horizon discounted reward in a Markov decision process (MDP). Optimal min-max complexity results have been developed for tabular RL in this setting, leading to a sample complexity dependence on $\gamma$ and $\epsilon$ of the form $\tilde \Theta((1-\gamma)^{-3}\epsilon^{-2})$, where $\gamma$ is the discount factor and $\epsilon$ is the tolerance solution error. However, in many applications of interest, the optimal policy (or all policies) will induce mixing. We show that in these settings the optimal min-max complexity is $\tilde \Theta(t_{\text{minorize}}(1-\gamma)^{-2}\epsilon^{-2})$, where $t_{\text{minorize}}$ is a measure of mixing that is within an equivalent factor of the total variation mixing time. Our analysis is based on regeneration-type ideas, that, we believe are of independent interest since they can be used to study related problems for general state space MDPs.  ( 2 min )
    Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling. (arXiv:2301.13154v2 [cs.LG] UPDATED)
    Protein representation learning has primarily benefited from the remarkable development of language models (LMs). Accordingly, pre-trained protein models also suffer from a problem in LMs: a lack of factual knowledge. The recent solution models the relationships between protein and associated knowledge terms as the knowledge encoding objective. However, it fails to explore the relationships at a more granular level, i.e., the token level. To mitigate this, we propose Knowledge-exploited Auto-encoder for Protein (KeAP), which performs token-level knowledge graph exploration for protein representation learning. In practice, non-masked amino acids iteratively query the associated knowledge tokens to extract and integrate helpful information for restoring masked amino acids via attention. We show that KeAP can consistently outperform the previous counterpart on 9 representative downstream applications, sometimes surpassing it by large margins. These results suggest that KeAP provides an alternative yet effective way to perform knowledge enhanced protein representation learning.  ( 2 min )
    LEARNEST: LEARNing Enhanced Model-based State ESTimation for Robots using Knowledge-based Neural Ordinary Differential Equations. (arXiv:2209.08185v2 [cs.RO] UPDATED)
    State estimation is an important aspect in many robotics applications. In this work, we consider the task of obtaining accurate state estimates for robotic systems by enhancing the dynamics model used in state estimation algorithms. Existing frameworks such as moving horizon estimation (MHE) and the unscented Kalman filter (UKF) provide the flexibility to incorporate nonlinear dynamics and measurement models. However, this implies that the dynamics model within these algorithms has to be sufficiently accurate in order to warrant the accuracy of the state estimates. To enhance the dynamics models and improve the estimation accuracy, we utilize a deep learning framework known as knowledge-based neural ordinary differential equations (KNODEs). The KNODE framework embeds prior knowledge into the training procedure and synthesizes an accurate hybrid model by fusing a prior first-principles model with a neural ordinary differential equation (NODE) model. In our proposed LEARNEST framework, we integrate the data-driven model into two novel model-based state estimation algorithms, which are denoted as KNODE-MHE and KNODE-UKF. These two algorithms are compared against their conventional counterparts across a number of robotic applications; state estimation for a cartpole system using partial measurements, localization for a ground robot, as well as state estimation for a quadrotor. Through simulations and tests using real-world experimental data, we demonstrate the versatility and efficacy of the proposed learning-enhanced state estimation framework.  ( 2 min )
    Similarity, Compression and Local Steps: Three Pillars of Efficient Communications for Distributed Variational Inequalities. (arXiv:2302.07615v1 [math.OC])
    Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, fixed point problems as special cases. Therefore, variational inequalities are used in a variety of applications ranging from equilibrium search to adversarial learning. Today's realities with the increasing size of data and models demand parallel and distributed computing for real-world machine learning problems, most of which can be represented as variational inequalities. Meanwhile, most distributed approaches has a significant bottleneck - the cost of communications. The three main techniques to reduce both the total number of communication rounds and the cost of one such round are the use of similarity of local functions, compression of transmitted information and local updates. In this paper, we combine all these approaches. Such a triple synergy did not exist before for variational inequalities and saddle problems, nor even for minimization problems. The methods presented in this paper have the best theoretical guarantees of communication complexity and are significantly ahead of other methods for distributed variational inequalities. The theoretical results are confirmed by adversarial learning experiments on synthetic and real datasets.  ( 2 min )
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v3 [stat.ML] UPDATED)
    Model-based unsupervised learning, as any learning task, stalls as soon asmissing data occurs. This is even more true when the missing data are infor-mative, or said missing not at random (MNAR). In this paper, we proposemodel-based clustering algorithms designed to handle very general typesof missing data, including MNAR data. To do so, we introduce a mixturemodel for different types of data (continuous, count, categorical and mixed)to jointly model the data distribution and the MNAR mechanism, remainingvigilant to the degrees of freedom of each. Eight different MNAR modelswhich depend on the class membership and/or on the values of the missingvariables themselves are proposed. For a particular type of MNAR mod-els, for which the missingness depends on the class membership, we showthat the statistical inference can be carried out on the data matrix concate-nated with the missing mask considering a MAR mechanism instead; thisspecifically underlines the versatility of the studied MNAR models. Then,we establish sufficient conditions for identifiability of parameters of both thedata distribution and the mechanism. Regardless of the type of data and themechanism, we propose to perform clustering using EM or stochastic EMalgorithms specially developed for the purpose. Finally, we assess the nu-merical performances of the proposed methods on synthetic data and on thereal medical registry TraumaBase as well.  ( 2 min )
    Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the Bounded Gradient Assumption. (arXiv:2302.07862v1 [cs.LG])
    We prove that various stochastic gradient descent methods, including the stochastic gradient descent (SGD), stochastic heavy-ball (SHB), and stochastic Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold. To the best of our knowledge, this is the first time such results are obtained for SHB and SNAG methods. Moreover, our analysis expands upon previous studies on SGD by removing the need for bounded gradients of the objective function and uniformly bounded noise. Instead, we introduce a more practical local boundedness assumption for the noisy gradient, which is naturally satisfied in empirical risk minimization problems typically seen in training of neural networks.  ( 2 min )
    Genetic multi-armed bandits: a reinforcement learning approach for discrete optimization via simulation. (arXiv:2302.07695v1 [cs.NE])
    This paper proposes a new algorithm, referred to as GMAB, that combines concepts from the reinforcement learning domain of multi-armed bandits and random search strategies from the domain of genetic algorithms to solve discrete stochastic optimization problems via simulation. In particular, the focus is on noisy large-scale problems, which often involve a multitude of dimensions as well as multiple local optima. Our aim is to combine the property of multi-armed bandits to cope with volatile simulation observations with the ability of genetic algorithms to handle high-dimensional solution spaces accompanied by an enormous number of feasible solutions. For this purpose, a multi-armed bandit framework serves as a foundation, where each observed simulation is incorporated into the memory of GMAB. Based on this memory, genetic operators guide the search, as they provide powerful tools for exploration as well as exploitation. The empirical results demonstrate that GMAB achieves superior performance compared to benchmark algorithms from the literature in a large variety of test problems. In all experiments, GMAB required considerably fewer simulations to achieve similar or (far) better solutions than those generated by existing methods. At the same time, GMAB's overhead with regard to the required runtime is extremely small due to the suggested tree-based implementation of its memory. Furthermore, we prove its convergence to the set of global optima as the simulation effort goes to infinity.  ( 2 min )
    On Fairness of Medical Image Classification with Multiple Sensitive Attributes via Learning Orthogonal Representations. (arXiv:2301.01481v2 [cs.CV] UPDATED)
    Mitigating the discrimination of machine learning models has gained increasing attention in medical image analysis. However, rare works focus on fair treatments for patients with multiple sensitive demographic ones, which is a crucial yet challenging problem for real-world clinical applications. In this paper, we propose a novel method for fair representation learning with respect to multi-sensitive attributes. We pursue the independence between target and multi-sensitive representations by achieving orthogonality in the representation space. Concretely, we enforce the column space orthogonality by keeping target information on the complement of a low-rank sensitive space. Furthermore, in the row space, we encourage feature dimensions between target and sensitive representations to be orthogonal. The effectiveness of the proposed method is demonstrated with extensive experiments on the CheXpert dataset. To our best knowledge, this is the first work to mitigate unfairness with respect to multiple sensitive attributes in the field of medical imaging.  ( 2 min )
    Adversarially Robust Learning with Tolerance. (arXiv:2203.00849v2 [stat.ML] UPDATED)
    We initiate the study of tolerant adversarial PAC-learning with respect to metric perturbation sets. In adversarial PAC-learning, an adversary is allowed to replace a test point $x$ with an arbitrary point in a closed ball of radius $r$ centered at $x$. In the tolerant version, the error of the learner is compared with the best achievable error with respect to a slightly larger perturbation radius $(1+\gamma)r$. This simple tweak helps us bridge the gap between theory and practice and obtain the first PAC-type guarantees for algorithmic techniques that are popular in practice. Our first result concerns the widely-used ``perturb-and-smooth'' approach for adversarial learning. For perturbation sets with doubling dimension $d$, we show that a variant of these approaches PAC-learns any hypothesis class $\mathcal{H}$ with VC-dimension $v$ in the $\gamma$-tolerant adversarial setting with $O\left(\frac{v(1+1/\gamma)^{O(d)}}{\varepsilon}\right)$ samples. This is in contrast to the traditional (non-tolerant) setting in which, as we show, the perturb-and-smooth approach can provably fail. Our second result shows that one can PAC-learn the same class using $\widetilde{O}\left(\frac{d.v\log(1+1/\gamma)}{\varepsilon^2}\right)$ samples even in the agnostic setting. This result is based on a novel compression-based algorithm, and achieves a linear dependence on the doubling dimension as well as the VC-dimension. This is in contrast to the non-tolerant setting where there is no known sample complexity upper bound that depend polynomially on the VC-dimension.  ( 2 min )
    QuadConv: Quadrature-Based Convolutions with Applications to Non-Uniform PDE Data Compression. (arXiv:2211.05151v2 [cs.LG] UPDATED)
    We present a new convolution layer for deep learning architectures which we call QuadConv -- an approximation to continuous convolution via quadrature. Our operator is developed explicitly for use on non-uniform, mesh-based data, and accomplishes this by learning a continuous kernel that can be sampled at arbitrary locations. Moreover, the construction of our operator admits an efficient implementation which we detail and construct. In the setting of compressing data arising from partial differential equation (PDE) simulations, we show that QuadConv can match the performance of standard discrete convolutions on uniform grid data by comparing a QuadConv autoencoder (QCAE) to a standard convolutional autoencoder (CAE). Further, we show that the QCAE can maintain this accuracy even on non-uniform data.  ( 2 min )
    Generation of Highlights from Research Papers Using Pointer-Generator Networks and SciBERT Embeddings. (arXiv:2302.07729v1 [cs.CL])
    Nowadays many research articles are prefaced with research highlights to summarize the main findings of the paper. Highlights not only help researchers precisely and quickly identify the contributions of a paper, they also enhance the discoverability of the article via search engines. We aim to automatically construct research highlights given certain segments of the research paper. We use a pointer-generator network with coverage mechanism and a contextual embedding layer at the input that encodes the input tokens into SciBERT embeddings. We test our model on a benchmark dataset, CSPubSum and also present MixSub, a new multi-disciplinary corpus of papers for automatic research highlight generation. For both CSPubSum and MixSub, we have observed that the proposed model achieves the best performance compared to related variants and other models proposed in the literature. On the CSPubSum data set, our model achieves the best performance when the input is only the abstract of a paper as opposed to other segments of the paper. It produces ROUGE-1, ROUGE-2 and ROUGE-L F1-scores of 38.26, 14.26 and 35.51, respectively, METEOR F1-score of 32.62, and BERTScore F1 of 86.65 which outperform all other baselines. On the new MixSub data set, where only the abstract is the input, our proposed model (when trained on the whole training corpus without distinguishing between the subject categories) achieves ROUGE-1, ROUGE-2 and ROUGE-L F1-scores of 31.78, 9.76 and 29.3, respectively, METEOR F1-score of 24.00, and BERTScore F1 of 85.25, outperforming other models.
    The interaction of transmission intensity, mortality, and the economy: a retrospective analysis of the COVID-19 pandemic. (arXiv:2211.00054v2 [stat.AP] UPDATED)
    The COVID-19 pandemic has caused over 6.4 million registered deaths to date and has had a profound impact on economic activity. Here, we study the interaction of transmission, mortality, and the economy during the SARS-CoV-2 pandemic from January 2020 to December 2022 across 25 European countries. We adopt a Bayesian Mixed Effects model with auto-regressive terms. We find that increases in disease transmission intensity decreases Gross domestic product (GDP) and increases daily excess deaths, with a longer lasting impact on excess deaths in comparison to GDP, which recovers more rapidly. Broadly, our results reinforce the intuitive phenomenon that significant economic activity arises from diverse person-to-person interactions. We report on the effectiveness of non-pharmaceutical interventions (NPIs) on transmission intensity, excess deaths, and changes in GDP, and resulting implications for policy makers. Our results highlight a complex cost-benefit trade off from individual NPIs. For example, banning international travel increases GDP and reduces excess deaths. We consider country random effects and their associations with excess changes in GDP and excess deaths. For example, more developed countries in Europe typically had more cautious approaches to the COVID-19 pandemic, prioritising healthcare, and excess deaths over economic performance. Long term economic impairments are not fully captured by our model, as well as long term disease effects (Long Covid). Our results highlight that the impact of disease on a country is complex and multifaceted, and simple heuristic conclusions to extract the best outcome from the economy and disease burden are challenging.
    Stitchable Neural Networks. (arXiv:2302.06586v2 [cs.LG] UPDATED)
    The public model zoo containing enormous powerful pretrained model families (e.g., ResNet/DeiT) has reached an unprecedented scope than ever, which significantly contributes to the success of deep learning. As each model family consists of pretrained models with diverse scales (e.g., DeiT-Ti/S/B), it naturally arises a fundamental question of how to efficiently assemble these readily available models in a family for dynamic accuracy-efficiency trade-offs at runtime. To this end, we present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment which cheaply produces numerous networks with different complexity and performance trade-offs given a family of pretrained neural networks, which we call anchors. Specifically, SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another. With only a few epochs of training, SN-Net effectively interpolates between the performance of anchors with varying scales. At runtime, SN-Net can instantly adapt to dynamic resource constraints by switching the stitching positions. Extensive experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks while supporting diverse deployment scenarios. For example, by stitching Swin Transformers, we challenge hundreds of models in Timm model zoo with a single network. We believe this new elastic model framework can serve as a strong baseline for further research in wider communities.
    Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs. (arXiv:2210.09603v2 [cs.LG] UPDATED)
    As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations. In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity. We call the proposed method the task-mapping programming paradigm. In addition, we propose a new post-scheduling fusion optimization that allows developers to focus on scheduling every single operator and automates the fusion after scheduling. It greatly reduces the engineering efforts for operator fusion. Our proposed paradigm also constructs an efficient hardware-centric schedule space, which is agnostic to the program input size and greatly reduces the tuning time. With the proposed paradigm, we implement a deep learning compiler Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average). It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively. We open-sourced hidet at https://www.github.com/hidet-org/hidet.
    SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation. (arXiv:2212.08228v2 [cs.CV] UPDATED)
    Human organs constantly undergo anatomical changes due to a complex mix of short-term (e.g., heartbeat) and long-term (e.g., aging) factors. Evidently, prior knowledge of these factors will be beneficial when modeling their future state, i.e., via image generation. However, most of the medical image generation tasks only rely on the input from a single image, thus ignoring the sequential dependency even when longitudinal data is available. Sequence-aware deep generative models, where model input is a sequence of ordered and timestamped images, are still underexplored in the medical imaging domain that is featured by several unique challenges: 1) Sequences with various lengths; 2) Missing data or frame, and 3) High dimensionality. To this end, we propose a sequence-aware diffusion model (SADM) for the generation of longitudinal medical images. Recently, diffusion models have shown promising results in high-fidelity image generation. Our method extends this new technique by introducing a sequence-aware transformer as the conditional module in a diffusion model. The novel design enables learning longitudinal dependency even with missing data during training and allows autoregressive generation of a sequence of images during inference. Our extensive experiments on 3D longitudinal medical images demonstrate the effectiveness of SADM compared with baselines and alternative methods. The code is available at https://github.com/ubc-tea/SADM-Longitudinal-Medical-Image-Generation.
    Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures. (arXiv:2110.13179v7 [cs.LG] UPDATED)
    Hierarchical forecasting problems arise when time series have a natural group structure, and predictions at multiple levels of aggregation and disaggregation across the groups are needed. In such problems, it is often desired to satisfy the aggregation constraints in a given hierarchy, referred to as hierarchical coherence in the literature. Maintaining coherence while producing accurate forecasts can be a challenging problem, especially in the case of probabilistic forecasting. We present a novel method capable of accurate and coherent probabilistic forecasts for time series when reliable hierarchical information is present. We call it Deep Poisson Mixture Network (DPMN). It relies on the combination of neural networks and a statistical model for the joint distribution of the hierarchical multivariate time series structure. By construction, the model guarantees hierarchical coherence and provides simple rules for aggregation and disaggregation of the predictive distributions. We perform an extensive empirical evaluation comparing the DPMN to other state-of-the-art methods which produce hierarchically coherent probabilistic forecasts on multiple public datasets. Compared to existing coherent probabilistic models, we obtained a relative improvement in the overall Continuous Ranked Probability Score (CRPS) of 11.8% on Australian domestic tourism data and 8.1% on the Favorita grocery sales dataset, where time series are grouped with geographical hierarchies or travel intent hierarchies. For San Francisco Bay Area highway traffic, where the series' hierarchical structure is randomly assigned, and their correlations are less informative, our method does not show significant performance differences over statistical baselines.
    An Adjoint-Free Algorithm for CNOPs via Sampling. (arXiv:2208.00956v4 [math.OC] UPDATED)
    In this paper, we propose a sampling algorithm based on state-of-the-art statistical machine learning techniques to obtain conditional nonlinear optimal perturbations (CNOPs), which is different from traditional (deterministic) optimization methods. Specifically, the traditional approach requires numerically computing the gradient (first-order information). However, the sampling approach directly reduces the expensive gradient (first-order information) by the objective value (zeroth-order information), which also avoids using the adjoint technique that requires large amounts of storage and is unusable for many atmosphere and ocean models. We present an intuitive analysis for the sampling algorithm and a rigorous Chernoff-type concentration inequality to probabilistically approximate the exact gradient. The experiments are implemented to obtain the CNOPs for two numerical models, the Burgers equation with small viscosity and the Lorenz-96 model. We demonstrate the CNOPs obtained with their spatial structures, objective values, computation times and nonlinear error growth. Compared with the performance of the three approaches, the CNOPs' spatial structures, objective values, and nonlinear error growth is nearly consistent, while the computation time using the sampling approach with fewer samples is extremely shorter. In other words, the new sampling approach from state-of-the-art statistical machine learning techniques shortens the computation time to the utmost at the cost of losing little accuracy.
    3D Object Detection in LiDAR Point Clouds using Graph Neural Networks. (arXiv:2301.12519v2 [cs.CV] UPDATED)
    LiDAR (Light Detection and Ranging) is an advanced active remote sensing technique working on the principle of time of travel (ToT) for capturing highly accurate 3D information of the surroundings. LiDAR has gained wide attention in research and development with the LiDAR industry expected to reach 2.8 billion $ by 2025. Although the LiDAR dataset is of rich density and high spatial resolution, it is challenging to process LiDAR data due to its inherent 3D geometry and massive volume. But such a high-resolution dataset possesses immense potential in many applications and has great potential in 3D object detection and recognition. In this research we propose Graph Neural Network (GNN) based framework to learn and identify the objects in the 3D LiDAR point clouds. GNNs are class of deep learning which learns the patterns and objects based on the principle of graph learning which have shown success in various 3D computer vision tasks.
    Bayesian Robust Tensor Ring Model for Incomplete Multiway Data. (arXiv:2202.13321v2 [cs.LG] UPDATED)
    Robust tensor completion (RTC) aims to recover a low-rank tensor from its incomplete observation with outlier corruption. The recently proposed tensor ring (TR) model has demonstrated superiority in solving the RTC problem. However, the existing methods either require a pre-assigned TR rank or aggressively pursue the minimum TR rank, thereby often leading to biased solutions in the presence of noise. In this paper, a Bayesian robust tensor ring decomposition (BRTR) method is proposed to give more accurate solutions to the RTC problem, which can avoid exquisite selection of the TR rank and penalty parameters. A variational Bayesian (VB) algorithm is developed to infer the probability distribution of posteriors. During the learning process, BRTR can prune off slices of core tensor with marginal components, resulting in automatic TR rank detection. Extensive experiments show that BRTR can achieve significantly improved performance than other state-of-the-art methods.
    Detecting human and non-human vocal productions in large scale audio recordings. (arXiv:2302.07640v1 [cs.SD])
    We propose an automatic data processing pipeline to extract vocal productions from large-scale natural audio recordings. Through a series of computational steps (windowing, creation of a noise class, data augmentation, re-sampling, transfer learning, Bayesian optimisation), it automatically trains a neural network for detecting various types of natural vocal productions in a noisy data stream without requiring a large sample of labeled data. We test it on two different data sets, one from a group of Guinea baboons recorded from a primate research center and one from human babies recorded at home. The pipeline trains a model on 72 and 77 minutes of labeled audio recordings, with an accuracy of 94.58% and 99.76%. It is then used to process 443 and 174 hours of natural continuous recordings and it creates two new databases of 38.8 and 35.2 hours, respectively. We discuss the strengths and limitations of this approach that can be applied to any massive audio recording.
    Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information. (arXiv:2203.07893v2 [cs.CL] UPDATED)
    We describe a simple and effective method (Spectral Attribute removaL; SAL) to remove private or guarded information from neural representations. Our method uses matrix decomposition to project the input representations into directions with reduced covariance with the guarded information rather than maximal covariance as factorization methods normally use. We begin with linear information removal and proceed to generalize our algorithm to the case of nonlinear information removal using kernels. Our experiments demonstrate that our algorithm retains better main task performance after removing the guarded information compared to previous work. In addition, our experiments demonstrate that we need a relatively small amount of guarded attribute data to remove information about these attributes, which lowers the exposure to sensitive data and is more suitable for low-resource scenarios. Code is available at https://github.com/jasonshaoshun/SAL.
    Estimating Causal Effects Under Image Confounding Bias with an Application to Poverty in Africa. (arXiv:2206.06410v3 [cs.LG] UPDATED)
    Observational studies of causal effects require adjustment for confounding factors. In the tabular setting, where these factors are well-defined, separate random variables, the effect of confounding is well understood. However, in public policy, ecology, and in medicine, decisions are often made in non-tabular settings, informed by patterns or objects detected in images (e.g., maps, satellite or tomography imagery). Using such imagery for causal inference presents an opportunity because objects in the image may be related to the treatment and outcome of interest. In these cases, we rely on the images to adjust for confounding but observed data do not directly label the existence of the important objects. Motivated by real-world applications, we formalize this challenge, how it can be handled, and what conditions are sufficient to identify and estimate causal effects. We analyze finite-sample performance using simulation experiments, estimating effects using a propensity adjustment algorithm that employs a machine learning model to estimate the image confounding. Our experiments also examine sensitivity to misspecification of the image pattern mechanism. Finally, we use our methodology to estimate the effects of policy interventions on poverty in African communities from satellite imagery.
    Feature Learning for Nonlinear Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v3 [cs.LG] UPDATED)
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are distorted or masked by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate a set of optimized data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, named neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments and case studies using synthetic and real-world datasets.
    Discrete Key-Value Bottleneck. (arXiv:2207.11240v2 [cs.LG] UPDATED)
    Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. Challenges emerge with non-stationary training data streams such as continual learning. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. In the present work, we propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes. Our paradigm will be to encode; process the representation via a discrete bottleneck; and decode. Here, the input is fed to the pre-trained encoder, the output of the encoder is used to select the nearest keys, and the corresponding values are fed to the decoder to solve the current task. The model can only fetch and re-use a sparse number of these key-value pairs during inference, enabling localized and context-dependent model updates. We theoretically investigate the ability of the discrete key-value bottleneck to minimize the effect of learning under distribution shifts and show that it reduces the complexity of the hypothesis class. We empirically verify the proposed method under challenging class-incremental learning scenarios and show that the proposed model - without any task boundaries - reduces catastrophic forgetting across a wide variety of pre-trained models, outperforming relevant baselines on this task.
    Utilising the CLT Structure in Stochastic Gradient based Sampling : Improved Analysis and Faster Algorithms. (arXiv:2206.03792v2 [math.PR] UPDATED)
    We consider stochastic approximations of sampling algorithms, such as Stochastic Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD). We observe that the noise introduced by the stochastic approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian. We harness this structure to absorb the stochastic approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms. For SGLD, we prove the first stable convergence rate in KL divergence without requiring uniform warm start, assuming the target density satisfies a Log-Sobolev Inequality. Our result implies superior first-order oracle complexity compared to prior works, under significantly milder assumptions. We also prove the first guarantees for SGLD under even weaker conditions such as H\"{o}lder smoothness and Poincare Inequality, thus bridging the gap between the state-of-the-art guarantees for LMC and SGLD. Our analysis motivates a new algorithm called covariance correction, which corrects for the additional noise introduced by the stochastic approximation by rescaling the strength of the diffusion. Finally, we apply our techniques to analyze RBM, and significantly improve upon the guarantees in prior works (such as removing exponential dependence on horizon), under minimal assumptions.
    Diffusion Adversarial Representation Learning for Self-supervised Vessel Segmentation. (arXiv:2209.14566v2 [eess.IV] UPDATED)
    Vessel segmentation in medical images is one of the important tasks in the diagnosis of vascular diseases and therapy planning. Although learning-based segmentation approaches have been extensively studied, a large amount of ground-truth labels are required in supervised methods and confusing background structures make neural networks hard to segment vessels in an unsupervised manner. To address this, here we introduce a novel diffusion adversarial representation learning (DARL) model that leverages a denoising diffusion probabilistic model with adversarial learning, and apply it to vessel segmentation. In particular, for self-supervised vessel segmentation, DARL learns the background signal using a diffusion module, which lets a generation module effectively provide vessel representations. Also, by adversarial learning based on the proposed switchable spatially-adaptive denormalization, our model estimates synthetic fake vessel images as well as vessel segmentation masks, which further makes the model capture vessel-relevant semantic information. Once the proposed model is trained, the model generates segmentation masks in a single step and can be applied to general vascular structure segmentation of coronary angiography and retinal images. Experimental results on various datasets show that our method significantly outperforms existing unsupervised and self-supervised vessel segmentation methods.
    MCAL: Minimum Cost Human-Machine Active Labeling. (arXiv:2006.13999v2 [cs.LG] UPDATED)
    Today, groundtruth generation relies on datasets annotated by cloud-based annotation services. These rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6x lower overall cost relative to human labeling the entire dataset, and is always cheaper than the cheapest competing strategy.
    Alloprof: a new French question-answer education dataset and its use in an information retrieval case study. (arXiv:2302.07738v1 [cs.CL])
    Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.
    AI/ML Algorithms and Applications in VLSI Design and Technology. (arXiv:2202.10015v2 [cs.LG] UPDATED)
    An evident challenge ahead for the integrated circuit (IC) industry in the nanometer regime is the investigation and development of methods that can reduce the design complexity ensuing from growing process variations and curtail the turnaround time of chip manufacturing. Conventional methodologies employed for such tasks are largely manual; thus, time-consuming and resource-intensive. In contrast, the unique learning strategies of artificial intelligence (AI) provide numerous exciting automated approaches for handling complex and data-intensive tasks in very-large-scale integration (VLSI) design and testing. Employing AI and machine learning (ML) algorithms in VLSI design and manufacturing reduces the time and effort for understanding and processing the data within and across different abstraction levels via automated learning algorithms. It, in turn, improves the IC yield and reduces the manufacturing turnaround time. This paper thoroughly reviews the AI/ML automated approaches introduced in the past towards VLSI design and manufacturing. Moreover, we discuss the scope of AI/ML applications in the future at various abstraction levels to revolutionize the field of VLSI design, aiming for high-speed, highly intelligent, and efficient implementations.
    On-Demand Communication for Asynchronous Multi-Agent Bandits. (arXiv:2302.07446v1 [cs.LG])
    This paper studies a cooperative multi-agent multi-armed stochastic bandit problem where agents operate asynchronously -- agent pull times and rates are unknown, irregular, and heterogeneous -- and face the same instance of a K-armed bandit problem. Agents can share reward information to speed up the learning process at additional communication costs. We propose ODC, an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times. ODC is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents. ODC is a generic protocol that can be integrated into most cooperative bandit algorithms without degrading their performance. We then incorporate ODC into the natural extensions of UCB and AAE algorithms and propose two communication-efficient cooperative algorithms. Our analysis shows that both algorithms are near-optimal in regret.
    Equivariant Hypergraph Diffusion Neural Operators. (arXiv:2207.06680v3 [cs.LG] UPDATED)
    Hypergraph neural networks (HNNs) using neural networks to encode hypergraphs provide a promising way to model higher-order relations in data and further solve relevant prediction tasks built upon such higher-order relations. However, higher-order relations in practice contain complex patterns and are often highly irregular. So, it is often challenging to design an HNN that suffices to express those relations while keeping computational efficiency. Inspired by hypergraph diffusion algorithms, this work proposes a new HNN architecture named ED-HNN, which provably represents any continuous equivariant hypergraph diffusion operators that can model a wide range of higher-order relations. ED-HNN can be implemented efficiently by combining star expansions of hypergraphs with standard message passing neural networks. ED-HNN further shows great superiority in processing heterophilic hypergraphs and constructing deep models. We evaluate ED-HNN for node classification on nine real-world hypergraph datasets. ED-HNN uniformly outperforms the best baselines over these nine datasets and achieves more than 2\%$\uparrow$ in prediction accuracy over four datasets therein.
    Self-Training: A Survey. (arXiv:2202.12040v2 [cs.LG] UPDATED)
    Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.
    Multiclass Learnability Beyond the PAC Framework: Universal Rates and Partial Concept Classes. (arXiv:2210.02297v3 [cs.LG] UPDATED)
    In this paper we study the problem of multiclass classification with a bounded number of different labels $k$, in the realizable setting. We extend the traditional PAC model to a) distribution-dependent learning rates, and b) learning rates under data-dependent assumptions. First, we consider the universal learning setting (Bousquet, Hanneke, Moran, van Handel and Yehudayoff, STOC '21), for which we provide a complete characterization of the achievable learning rates that holds for every fixed distribution. In particular, we show the following trichotomy: for any concept class, the optimal learning rate is either exponential, linear or arbitrarily slow. Additionally, we provide complexity measures of the underlying hypothesis class that characterize when these rates occur. Second, we consider the problem of multiclass classification with structured data (such as data lying on a low dimensional manifold or satisfying margin conditions), a setting which is captured by partial concept classes (Alon, Hanneke, Holzman and Moran, FOCS '21). Partial concepts are functions that can be undefined in certain parts of the input space. We extend the traditional PAC learnability of total concept classes to partial concept classes in the multiclass setting and investigate differences between partial and total concepts.
    Atrial Fibrillation Detection Using RR-Intervals for Application in Photoplethysmographs. (arXiv:2302.07648v1 [q-bio.QM])
    Atrial Fibrillation is a common form of irregular heart rhythm that can be very dangerous. Our primary goal is to analyze Atrial Fibrillation data within ECGs to develop a model based only on RR-Intervals, or the length between heart-beats, to create a real time classification model for Atrial Fibrillation to be implemented in common heart-rate monitors on the market today. Physionet's MIT-BIH Atrial Fibrillation Database \cite{goldberger2000physiobank} and 2017 Challenge Database \cite{clifford2017af} were used to identify patterns of Atrial Fibrillation and test classification models on. These two datasets are very different. The MIT-BIH database contains long samples taken with a medical grade device, which is not useful for simulating a consumer device, but is useful for Atrial Fibrillation pattern detection. The 2017 Challenge database includes short ($<60sec$) samples taken with a portable device and reveals many of the challenges of Atrial Fibrillation classification in a real-time device. We developed multiple SVM models with three sets of extracted features as predictor variables which gave us moderately high accuracies with low computational intensity. With robust filtering techniques already applied in many Photoplethysmograph-based consumer heart-rate monitors, this method can be used to develop a reliable real time model for Atrial Fibrillation detection in consumer-grade heart-rate monitors.
    Curriculum learning for data-driven modeling of dynamical systems. (arXiv:2112.08458v4 [cs.LG] UPDATED)
    The reliable prediction of the temporal behavior of complex systems is key in numerous scientific fields. This strong interest is however hindered by modeling issues: often, the governing equations describing the physics of the system under consideration are not accessible or, if known, their solution might require a computational time incompatible with the prediction time constraints. Not surprisingly, approximating complex systems in a generic functional format and informing it ex-nihilo from available observations has become common practice in the age of machine learning, as illustrated by the numerous successful examples based on deep neural networks. However, generalizability of the models, margins of guarantee and the impact of data are often overlooked or examined mainly by relying on prior knowledge of the physics. We tackle these issues from a different viewpoint, by adopting a curriculum learning strategy. In curriculum learning, the dataset is structured such that the training process starts from simple samples towards more complex ones in order to favor convergence and generalization. The concept has been developed and successfully applied in robotics and control of systems. Here, we apply this concept for the learning of complex dynamical systems in a systematic way. First, leveraging insights from the ergodic theory, we assess the amount of data sufficient for a-priori guaranteeing a faithful model of the physical system and thoroughly investigate the impact of the training set and its structure on the quality of long-term predictions. Based on that, we consider entropy as a metric of complexity of the dataset; we show how an informed design of the training set based on the analysis of the entropy significantly improves the resulting models in terms of generalizability, and provide insights on the amount and the choice of data required for an effective data-driven modeling.
    On Variance Estimation of Random Forests with Infinite-Order U-statistics. (arXiv:2202.09008v4 [stat.ML] UPDATED)
    Infinite-order U-statistics (IOUS) has been used extensively on subbagging ensemble learning algorithms such as random forests to quantify its uncertainty. While normality results of IOUS have been studied extensively, its variance estimation approaches and theoretical properties remain mostly unexplored. Existing approaches mainly utilize the leading term dominance property in the Hoeffding decomposition. However, such a view usually leads to biased estimation when the kernel size is large or the sample size is small. On the other hand, while several unbiased estimators exist in the literature, their relationships and theoretical properties, especially the ratio consistency, have never been studied. These limitations lead to unguaranteed performances of constructed confidence intervals. To bridge these gaps in the literature, we propose a new view of the Hoeffding decomposition for variance estimation that leads to an unbiased estimator. Instead of leading term dominance, our view utilizes the dominance of the peak region. Moreover, we establish the connection and equivalence of our estimator with several existing unbiased variance estimators. Theoretically, we are the first to establish the ratio consistency of such a variance estimator, which justifies the coverage rate of confidence intervals constructed from random forests. Numerically, we further propose a local smoothing procedure to improve the estimator's finite sample performance. Extensive simulation studies show that our estimators enjoy lower bias and archive targeted coverage rates.
    CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies. (arXiv:2205.09433v2 [cs.LG] UPDATED)
    Reinforcement Learning has drawn huge interest as a tool for solving optimal control problems. Solving a given problem (task or environment) involves converging towards an optimal policy. However, there might exist multiple optimal policies that can dramatically differ in their behaviour; for example, some may be faster than the others but at the expense of greater risk. We consider and study a distribution of optimal policies. We design a curiosity-augmented Metropolis algorithm (CAMEO), such that we can sample optimal policies, and such that these policies effectively adopt diverse behaviours, since this implies greater coverage of the different possible optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems, and even in the challenging case of environments that provide sparse rewards. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability, and represents a first step towards learning the distribution of optimal policies itself.
    Adaptive incentive for cross-silo federated learning: A multi-agent reinforcement learning approach. (arXiv:2302.07493v1 [cs.LG])
    Cross-silo federated learning (FL) is a typical FL that enables organizations(e.g., financial or medical entities) to train global models on isolated data. Reasonable incentive is key to encouraging organizations to contribute data. However, existing works on incentivizing cross-silo FL lack consideration of the environmental dynamics (e.g., precision of the trained global model and data owned by uncertain clients during the training processes). Moreover, most of them assume that organizations share private information, which is unrealistic. To overcome these limitations, we propose a novel adaptive mechanism for cross-silo FL, towards incentivizing organizations to contribute data to maximize their long-term payoffs in a real dynamic training environment. The mechanism is based on multi-agent reinforcement learning, which learns near-optimal data contribution strategy from the history of potential games without organizations' private information. Experiments demonstrate that our mechanism achieves adaptive incentive and effectively improves the long-term payoffs for organizations.
    Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation. (arXiv:2302.07856v1 [cs.CL])
    Large language models (LLMs) demonstrate remarkable machine translation (MT) abilities via prompting, even though they were not explicitly trained for this task. However, even given the incredible quantities of data they are trained on, LLMs can struggle to translate inputs with rare words, which are common in low resource or domain transfer scenarios. We show that LLM prompting can provide an effective solution for rare words as well, by using prior knowledge from bilingual dictionaries to provide control hints in the prompts. We propose a novel method, DiPMT, that provides a set of possible translations for a subset of the input words, thereby enabling fine-grained phrase-level prompted control of the LLM. Extensive experiments show that DiPMT outperforms the baseline both in low-resource MT, as well as for out-of-domain MT. We further provide a qualitative analysis of the benefits and limitations of this approach, including the overall level of controllability that is achieved.
    Unsupervised classification to improve the quality of a bird song recording dataset. (arXiv:2302.07560v1 [cs.LG])
    Open audio databases such as Xeno-Canto are widely used to build datasets to explore bird song repertoire or to train models for automatic bird sound classification by deep learning algorithms. However, such databases suffer from the fact that bird sounds are weakly labelled: a species name is attributed to each audio recording without timestamps that provide the temporal localization of the bird song of interest. Manual annotations can solve this issue, but they are time consuming, expert-dependent, and cannot run on large datasets. Another solution consists in using a labelling function that automatically segments audio recordings before assigning a label to each segmented audio sample. Although labelling functions were introduced to expedite strong label assignment, their classification performance remains mostly unknown. To address this issue and reduce label noise (wrong label assignment) in large bird song datasets, we introduce a data-centric novel labelling function composed of three successive steps: 1) time-frequency sound unit segmentation, 2) feature computation for each sound unit, and 3) classification of each sound unit as bird song or noise with either an unsupervised DBSCAN algorithm or the supervised BirdNET neural network. The labelling function was optimized, validated, and tested on the songs of 44 West-Palearctic common bird species. We first showed that the segmentation of bird songs alone aggregated from 10% to 83% of label noise depending on the species. We also demonstrated that our labelling function was able to significantly reduce the initial label noise present in the dataset by up to a factor of three. Finally, we discuss different opportunities to design suitable labelling functions to build high-quality animal vocalizations with minimum expert annotation effort.
    Self-Supervised Learning for Modeling Gamma-ray Variability in Blazars. (arXiv:2302.07700v1 [astro-ph.HE])
    Blazars are active galactic nuclei with relativistic jets pointed almost directly at Earth. Blazars are characterized by strong, apparently stochastic flux variability at virtually all observed wavelengths and timescales, from minutes to years, the physical origin of which is still poorly understood. In the high-energy gamma-ray band, the Large Area Telescope aboard the Fermi space telescope (Fermi-LAT) has conducted regular monitoring of thousands of blazars since 2008. Deep learning can help uncover structure in gamma-ray blazars' complex variability patterns that traditional methods based on parametric statistical modeling or manual feature engineering may miss. In this work, we propose using a self-supervised Transformer encoder architecture to construct an effective representation of blazar gamma-ray variability. Measurement errors, upper limits, and missing data are accommodated using learned encodings. The model predicts a set of quantiles for the flux probability distribution at each time step, an architecture naturally suited for describing data generated by a stochastic process. As a proof of concept for how the model output can be analyzed to extract scientifically relevant information, a preliminary search for weekly-timescale time-reversal asymmetry in gamma-ray blazar light curves was conducted, finding no significant evidence for asymmetry.
    Risk and optimal policies in bandit experiments. (arXiv:2112.06363v14 [econ.EM] UPDATED)
    We provide a decision theoretic analysis of bandit experiments. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for these experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distributions of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and thereby suggests a practical strategy for dimension reduction. The PDEs characterizing minimal Bayes risk can be solved efficiently using sparse matrix routines. We derive the optimal Bayes and minimax policies from their numerical solutions. These optimal policies substantially dominate existing methods such as Thompson sampling and UCB, often by a factor of two. The framework also covers time discounting and pure exploration.
    Faster Maximum Inner Product Search in High Dimensions. (arXiv:2212.07551v2 [cs.LG] UPDATED)
    Maximum Inner Product Search (MIPS) is a ubiquitous task in machine learning applications such as recommendation systems. Given a query vector and $n$ atom vectors in $d$-dimensional space, the goal of MIPS is to find the atom that has the highest inner product with the query vector. Existing MIPS algorithms scale at least as $O(\sqrt{d})$, which becomes computationally prohibitive in high-dimensional settings. In this work, we present BanditMIPS, a novel randomized MIPS algorithm whose complexity is independent of $d$. BanditMIPS estimates the inner product for each atom by subsampling coordinates and adaptively evaluates more coordinates for more promising atoms. The specific adaptive sampling strategy is motivated by multi-armed bandits. We provide theoretical guarantees that BanditMIPS returns the correct answer with high probability, while improving the complexity in $d$ from $O(\sqrt{d})$ to $O(1)$. We also perform experiments on four synthetic and real-world datasets and demonstrate that BanditMIPS outperforms prior state-of-the-art algorithms. For example, in the Movie Lens dataset ($n$=4,000, $d$=6,000), BanditMIPS is 20$\times$ faster than the next best algorithm while returning the same answer. BanditMIPS requires no preprocessing of the data and includes a hyperparameter that practitioners may use to trade off accuracy and runtime. We also propose a variant of our algorithm, named BanditMIPS-$\alpha$, which achieves further speedups by employing non-uniform sampling across coordinates. Finally, we demonstrate how known preprocessing techniques can be used to further accelerate BanditMIPS, and discuss applications to Matching Pursuit and Fourier analysis.
    Bolstering Stochastic Gradient Descent with Model Building. (arXiv:2111.07058v2 [cs.LG] UPDATED)
    Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the stepsize. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the stepsize but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.  ( 2 min )
    COVID-19 Detection Using Segmentation, Region Extraction and Classification Pipeline. (arXiv:2210.02992v2 [eess.IV] UPDATED)
    Purpose: The main purpose in this study is to develop a pipeline for COVID-19 detection from a big and challenging database of Computed Tomography (CT) images. The proposed pipeline includes a segmentation part, a lung extraction part, and a classifier part. Methods: The methodologies tried in the segmentation part are traditional segmentation methods as well as UNet-based methods. In the classification part, a Convolutional Neural Network (CNN) was used to take the final diagnosis decisions. Results: In the segmentation part, the proposed segmentation methods show high dice scores on a publicly available dataset. In the classification part, the results were compared at slice-level and at patient-level as well. At slice-level, methods were compared and showed high validation accuracy indicating efficiency in predicting 2D slices. At patient level, the proposed methods were also compared in terms of validation accuracy and macro F1 score on the validation set. The dataset used for classification is COV-19CT Database. The method proposed here showed improvement from our precious results on the same dataset. Conclusion: The improved work in this paper has potential clinical usages for COVID-19 detection and diagnosis via CT images. The code is on github at https://github.com/IDU-CVLab/COV19D_3rd  ( 2 min )
    Explaining text classifiers through progressive neighborhood approximation with realistic samples. (arXiv:2302.07733v1 [cs.CL])
    The importance of neighborhood construction in local explanation methods has been already highlighted in the literature. And several attempts have been made to improve neighborhood quality for high-dimensional data, for example, texts, by adopting generative models. Although the generators produce more realistic samples, the intuitive sampling approaches in the existing solutions leave the latent space underexplored. To overcome this problem, our work, focusing on local model-agnostic explanations for text classifiers, proposes a progressive approximation approach that refines the neighborhood of a to-be-explained decision with a careful two-stage interpolation using counterfactuals as landmarks. We explicitly specify the two properties that should be satisfied by generative models, the reconstruction ability and the locality-preserving property, to guide the selection of generators for local explanation methods. Moreover, noticing the opacity of generative models during the study, we propose another method that implements progressive neighborhood approximation with probability-based editions as an alternative to the generator-based solution. The explanation results from both methods consist of word-level and instance-level explanations benefiting from the realistic neighborhood. Through exhaustive experiments, we qualitatively and quantitatively demonstrate the effectiveness of the two proposed methods.
    Longitudinal Modeling of Multiple Sclerosis using Continuous Time Models. (arXiv:2302.07854v1 [cs.LG])
    Multiple sclerosis is a disease that affects the brain and spinal cord, it can lead to severe disability and has no known cure. The majority of prior work in machine learning for multiple sclerosis has been centered around using Magnetic Resonance Imaging scans or laboratory tests; these modalities are both expensive to acquire and can be unreliable. In a recent paper it was shown that disease progression can be predicted effectively using performance outcome measures (POMs) and demographic data. In our work we extend on this to focus on the modeling side, using continuous time models on POMs and demographic data to predict progression. We evaluate four continuous time models using a publicly available multiple sclerosis dataset. We find that continuous models are often able to outperform discrete time models. We also carry out an extensive ablation to discover the sources of performance gains, we find that standardizing existing features leads to a larger performance increase than interpolating missing features.
    TiZero: Mastering Multi-Agent Football with Curriculum Learning and Self-Play. (arXiv:2302.07515v1 [cs.AI])
    Multi-agent football poses an unsolved challenge in AI research. Existing work has focused on tackling simplified scenarios of the game, or else leveraging expert demonstrations. In this paper, we develop a multi-agent system to play the full 11 vs. 11 game mode, without demonstrations. This game mode contains aspects that present major challenges to modern reinforcement learning algorithms; multi-agent coordination, long-term planning, and non-transitivity. To address these challenges, we present TiZero; a self-evolving, multi-agent system that learns from scratch. TiZero introduces several innovations, including adaptive curriculum learning, a novel self-play strategy, and an objective that optimizes the policies of multiple agents jointly. Experimentally, it outperforms previous systems by a large margin on the Google Research Football environment, increasing win rates by over 30%. To demonstrate the generality of TiZero's innovations, they are assessed on several environments beyond football; Overcooked, Multi-agent Particle-Environment, Tic-Tac-Toe and Connect-Four.
    Toward matrix multiplication for deep learning inference on the Xilinx Versal. (arXiv:2302.07594v1 [cs.DC])
    The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address this scenario by demonstrating that the principles underlying the modern realization of the general matrix multiplication (GEMM) in conventional processor architectures, are also valid to achieve high performance for the type of operations that arise in deep learning (DL) on an exotic accelerator such as the AI Engine (AIE) tile embedded in Xilinx Versal platforms. In particular, our experimental results with a prototype implementation of the GEMM kernel, on a Xilinx Versal VCK190, delivers performance close to 86.7% of the theoretical peak that can be expected on an AIE tile, for 16-bit integer operands.
    Semi-Supervised Visual Tracking of Marine Animals using Autonomous Underwater Vehicles. (arXiv:2302.07344v1 [cs.CV])
    In-situ visual observations of marine organisms is crucial to developing behavioural understandings and their relations to their surrounding ecosystem. Typically, these observations are collected via divers, tags, and remotely-operated or human-piloted vehicles. Recently, however, autonomous underwater vehicles equipped with cameras and embedded computers with GPU capabilities are being developed for a variety of applications, and in particular, can be used to supplement these existing data collection mechanisms where human operation or tags are more difficult. Existing approaches have focused on using fully-supervised tracking methods, but labelled data for many underwater species are severely lacking. Semi-supervised trackers may offer alternative tracking solutions because they require less data than fully-supervised counterparts. However, because there are not existing realistic underwater tracking datasets, the performance of semi-supervised tracking algorithms in the marine domain is not well understood. To better evaluate their performance and utility, in this paper we provide (1) a novel dataset specific to marine animals located at this http URL, (2) an evaluation of state-of-the-art semi-supervised algorithms in the context of underwater animal tracking, and (3) an evaluation of real-world performance through demonstrations using a semi-supervised algorithm on-board an autonomous underwater vehicle to track marine animals in the wild.
    Streamlining models with explanations in the learning loop. (arXiv:2302.07760v1 [cs.LG])
    Several explainable AI methods allow a Machine Learning user to get insights on the classification process of a black-box model in the form of local linear explanations. With such information, the user can judge which features are locally relevant for the classification outcome, and get an understanding of how the model reasons. Standard supervised learning processes are purely driven by the original features and target labels, without any feedback loop informed by the local relevance of the features identified by the post-hoc explanations. In this paper, we exploit this newly obtained information to design a feature engineering phase, where we combine explanations with feature values. To do so, we develop two different strategies, named Iterative Dataset Weighting and Targeted Replacement Values, which generate streamlined models that better mimic the explanation process presented to the user. We show how these streamlined models compare to the original black-box classifiers, in terms of accuracy and compactness of the newly produced explanations.
    Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. (arXiv:2012.09816v3 [cs.LG] UPDATED)
    We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in Deep Learning works very differently from traditional learning theory (such as boosting or NTKs, neural tangent kernels). To properly understand them, we develop a theory showing that when data has a structure we refer to as ``multi-view'', then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
    Prioritized offline Goal-swapping Experience Replay. (arXiv:2302.07741v1 [cs.LG])
    In goal-conditioned offline reinforcement learning, an agent learns from previously collected data to go to an arbitrary goal. Since the offline data only contains a finite number of trajectories, a main challenge is how to generate more data. Goal-swapping generates additional data by switching trajectory goals but while doing so produces a large number of invalid trajectories. To address this issue, we propose prioritized goal-swapping experience replay (PGSER). PGSER uses a pre-trained Q function to assign higher priority weights to goal swapped transitions that allow reaching the goal. In experiments, PGSER significantly improves over baselines in a wide range of benchmark tasks, including challenging previously unsuccessful dexterous in-hand manipulation tasks.
    AI pipeline for accurate retinal layer segmentation using OCT 3D images. (arXiv:2302.07806v1 [eess.IV])
    Image data set from a multi-spectral animal imaging system is used to address two issues: (a) registering the oscillation in optical coherence tomography (OCT) images due to mouse eye movement and (b) suppressing the shadow region under the thick vessels/structures. Several classical and AI-based algorithms in combination are tested for each task to see their compatibility with data from the combined animal imaging system. Hybridization of AI with optical flow followed by Homography transformation is shown to be working (correlation value>0.7) for registration. Resnet50 backbone is shown to be working better than the famous U-net model for shadow region detection with a loss value of 0.9. A simple-to-implement analytical equation is shown to be working for brightness manipulation with a 1% increment in mean pixel values and a 77% decrease in the number of zeros. The proposed equation allows formulating a constraint optimization problem using a controlling factor {\alpha} for minimization of number of zeros, standard deviation of pixel value and maximizing the mean pixel value. For Layer segmentation, the standard U-net model is used. The AI-Pipeline consists of CNN, Optical flow, RCNN, pixel manipulation model, and U-net models in sequence. The thickness estimation process has a 6% error as compared to manual annotated standard data.
    Towards Standardising Reinforcement Learning Approaches for Production Scheduling Problems. (arXiv:2104.08196v2 [cs.LG] UPDATED)
    Recent years have seen a rise in interest in terms of using machine learning, particularly reinforcement learning (RL), for production scheduling problems of varying degrees of complexity. The general approach is to break down the scheduling problem into a Markov Decision Process (MDP), whereupon a simulation implementing the MDP is used to train an RL agent. Since existing studies rely on (sometimes) complex simulations for which the code is unavailable, the experiments presented are hard, or, in the case of stochastic environments, impossible to reproduce accurately. Furthermore, there is a vast array of RL designs to choose from. To make RL methods widely applicable in production scheduling and work out their strength for the industry, the standardisation of model descriptions - both production setup and RL design - and validation scheme are a prerequisite. Our contribution is threefold: First, we standardize the description of production setups used in RL studies based on established nomenclature. Secondly, we classify RL design choices from existing publications. Lastly, we propose recommendations for a validation scheme focusing on reproducibility and sufficient benchmarking.
    Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks. (arXiv:2204.02892v4 [cs.CL] UPDATED)
    The field of Natural Language Processing has experienced a dramatic leap in capabilities with the recent introduction of huge Language Models. Despite this success, natural language problems that involve several compounded steps are still practically unlearnable, even by the largest LMs. This complies with experimental failures for end-to-end learning of composite problems that were demonstrated in a variety of domains. An effective mitigation is to introduce intermediate supervision for solving sub-tasks of the compounded problem. Recently, several works have demonstrated high gains by taking a straightforward approach for incorporating intermediate supervision in compounded natural language problems: the sequence-to-sequence LM is fed with an augmented input, in which the decomposed tasks' labels are simply concatenated to the original input. In this paper, we prove a positive learning result that motivates these recent efforts. We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results. Beyond motivating contemporary empirical efforts for incorporating intermediate supervision in sequence-to-sequence language models, our positive theoretical result is the first of its kind in the landscape of results on the benefits of intermediate supervision for neural-network learning: Until now, all theoretical results on the subject are negative, i.e., show cases where learning is impossible without intermediate supervision, while our result is positive, showing that learning is facilitated in the presence of intermediate supervision.
    Self-Supervised Temporal Graph learning with Temporal and Structural Intensity Alignment. (arXiv:2302.07491v1 [cs.LG])
    Temporal graph learning aims to generate high-quality representations for graph-based tasks along with dynamic information, which has recently drawn increasing attention. Unlike the static graph, a temporal graph is usually organized in the form of node interaction sequences over continuous time instead of an adjacency matrix. Most temporal graph learning methods model current interactions by combining historical information over time. However, such methods merely consider the first-order temporal information while ignoring the important high-order structural information, leading to sub-optimal performance. To solve this issue, by extracting both temporal and structural information to learn more informative node representations, we propose a self-supervised method termed S2T for temporal graph learning. Note that the first-order temporal information and the high-order structural information are combined in different ways by the initial node representations to calculate two conditional intensities, respectively. Then the alignment loss is introduced to optimize the node representations to be more informative by narrowing the gap between the two intensities. Concretely, besides modeling temporal information using historical neighbor sequences, we further consider the structural information from both local and global levels. At the local level, we generate structural intensity by aggregating features from the high-order neighbor sequences. At the global level, a global representation is generated based on all nodes to adjust the structural intensity according to the active statuses on different nodes. Extensive experiments demonstrate that the proposed method S2T achieves at most 10.13% performance improvement compared with the state-of-the-art competitors on several datasets.
    Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks. (arXiv:2302.07868v1 [cs.LG])
    Discovering novel drug candidate molecules is one of the most fundamental and critical steps in drug development. Generative deep learning models, which create synthetic data given a probability distribution, have been developed with the purpose of picking completely new samples from a partially known space. Generative models offer high potential for designing de novo molecules; however, in order for them to be useful in real-life drug development pipelines, these models should be able to design target-specific molecules, which is the next step in this field. In this study, we propose DrugGEN, for the de novo design of drug candidate molecules that interact with selected target proteins. The proposed system represents compounds and protein structures as graphs and processes them via serially connected two generative adversarial networks comprising graph transformers. DrugGEN is trained using a large dataset of compounds from ChEMBL and target-specific bioactive molecules, to design effective and specific inhibitory molecules against the AKT1 protein, which has critical importance for developing treatments against various types of cancer. On fundamental benchmarks, DrugGEN models have either competitive or better performance against other methods. To assess the target-specific generation performance, we conducted further in silico analysis with molecular docking and deep learning-based bioactivity prediction. Results indicate that de novo molecules have high potential for interacting with the AKT1 protein structure in the level of its native ligand. DrugGEN can be used to design completely novel and effective target-specific drug candidate molecules for any druggable protein, given target features and a dataset of experimental bioactivities. Code base, datasets, results and trained models of DrugGEN are available at https://github.com/HUBioDataLab/DrugGEN
    Data Forensics in Diffusion Models: A Systematic Analysis of Membership Privacy. (arXiv:2302.07801v1 [cs.LG])
    In recent years, diffusion models have achieved tremendous success in the field of image generation, becoming the stateof-the-art technology for AI-based image processing applications. Despite the numerous benefits brought by recent advances in diffusion models, there are also concerns about their potential misuse, specifically in terms of privacy breaches and intellectual property infringement. In particular, some of their unique characteristics open up new attack surfaces when considering the real-world deployment of such models. With a thorough investigation of the attack vectors, we develop a systematic analysis of membership inference attacks on diffusion models and propose novel attack methods tailored to each attack scenario specifically relevant to diffusion models. Our approach exploits easily obtainable quantities and is highly effective, achieving near-perfect attack performance (>0.9 AUCROC) in realistic scenarios. Our extensive experiments demonstrate the effectiveness of our method, highlighting the importance of considering privacy and intellectual property risks when using diffusion models in image generation tasks.
    PDE-constrained Models with Neural Network Terms: Optimization and Global Convergence. (arXiv:2105.08633v4 [cs.LG] UPDATED)
    Recent research has used deep learning to develop partial differential equation (PDE) models in science and engineering. The functional form of the PDE is determined by a neural network, and the neural network parameters are calibrated to available data. Calibration of the embedded neural network can be performed by optimizing over the PDE. Motivated by these applications, we rigorously study the optimization of a class of linear elliptic PDEs with neural network terms. The neural network parameters in the PDE are optimized using gradient descent, where the gradient is evaluated using an adjoint PDE. As the number of parameters become large, the PDE and adjoint PDE converge to a non-local PDE system. Using this limit PDE system, we are able to prove convergence of the neural network-PDE to a global minimum during the optimization. Finally, we use this adjoint method to train a neural network model for an application in fluid mechanics, in which the neural network functions as a closure model for the Reynolds-averaged Navier--Stokes (RANS) equations. The RANS neural network model is trained on several datasets for turbulent channel flow and is evaluated out-of-sample at different Reynolds numbers.
    Dual Graph Multitask Framework for Imbalanced Delivery Time Estimation. (arXiv:2302.07429v1 [cs.LG])
    Delivery Time Estimation (DTE) is a crucial component of the e-commerce supply chain that predicts delivery time based on merchant information, sending address, receiving address, and payment time. Accurate DTE can boost platform revenue and reduce customer complaints and refunds. However, the imbalanced nature of industrial data impedes previous models from reaching satisfactory prediction performance. Although imbalanced regression methods can be applied to the DTE task, we experimentally find that they improve the prediction performance of low-shot data samples at the sacrifice of overall performance. To address the issue, we propose a novel Dual Graph Multitask framework for imbalanced Delivery Time Estimation (DGM-DTE). Our framework first classifies package delivery time as head and tail data. Then, a dual graph-based model is utilized to learn representations of the two categories of data. In particular, DGM-DTE re-weights the embedding of tail data by estimating its kernel density. We fuse two graph-based representations to capture both high- and low-shot data representations. Experiments on real-world Taobao logistics datasets demonstrate the superior performance of DGM-DTE compared to baselines.
    Isotopic envelope identification by analysis of the spatial distribution of components in MALDI-MSI data. (arXiv:2302.06051v2 [stat.ML] UPDATED)
    One of the significant steps in the process leading to the identification of proteins is mass spectrometry, which allows for obtaining information about the structure of proteins. Removing isotope peaks from the mass spectrum is vital and it is done in a process called deisotoping. There are different algorithms for deisotoping, but they have their limitations, they are dedicated to different methods of mass spectrometry. Data from experiments performed with the MALDI-ToF technique are characterized by high dimensionality. This paper presents a method for identifying isotope envelopes in MALDI-ToF molecular imaging data based on the Mamdani-Assilan fuzzy system and spatial maps of the molecular distribution of peaks included in the isotopic envelope. Several image texture measures were used to evaluate spatial molecular distribution maps. The algorithm was tested on eight datasets obtained from the MALDI-ToF experiment on samples from the National Institute of Oncology in Gliwice from patients with cancer of the head and neck region. The data were subjected to pre-processing and feature extraction. The results were collected and compared with three existing deisotoping algorithms. The analysis of the obtained results showed that the method for identifying isotopic envelopes proposed in this paper enables the detection of overlapping envelopes by using the approach oriented to study peak pairs. Moreover, the proposed algorithm enables the analysis of large data sets.  ( 2 min )
    Doubly-Optimistic Play for Safe Linear Bandits. (arXiv:2209.13694v2 [cs.LG] UPDATED)
    The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown round-wise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study aggressive \emph{doubly-optimistic play} in SLBs, and their role in avoiding the strong assumptions and poor efficacy associated with extant pessimistic-optimistic solutions. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret and safety violations due to an inability to refine the location of optimal actions to arbitrary precision. In a positive direction, we propose and analyse a doubly-optimistic confidence-bound based strategy for the safe linear bandit problem, DOSLB, which exploits supreme optimism by using optimistic estimates of both reward and safety risks to select actions. Using a novel dual analysis, we show that despite the lack of knowledge of constraints, DOSLB rarely takes overly risky actions, and obtains tight instance-dependent $O(\log^2 T)$ bounds on both efficacy regret and net safety violations up to any finite precision, thus yielding large efficacy gains at a small safety cost and without strong assumptions. Concretely, we argue that algorithm activates noisy versions of an `optimal' set of constraints at each round, and activation of suboptimal sets of constraints is limited by the larger of a safety and efficacy gap we define.  ( 2 min )
    Efficient Online Reinforcement Learning with Offline Data. (arXiv:2302.02948v2 [cs.LG] UPDATED)
    Sample efficiency and exploration remain major challenges in online reinforcement learning (RL). A powerful approach that can be applied to address these issues is the inclusion of offline data, such as prior trajectories from a human expert or a sub-optimal exploration policy. Previous methods have relied on extensive modifications and additional complexity to ensure the effective use of this data. Instead, we ask: can we simply apply existing off-policy methods to leverage offline data when learning online? In this work, we demonstrate that the answer is yes; however, a set of minimal but important changes to existing off-policy RL algorithms are required to achieve reliable performance. We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a $\mathbf{2.5\times}$ improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead.  ( 2 min )
    ARGUS: Context-Based Detection of Stealthy IoT Infiltration Attacks. (arXiv:2302.07589v1 [cs.CR])
    IoT application domains, device diversity and connectivity are rapidly growing. IoT devices control various functions in smart homes and buildings, smart cities, and smart factories, making these devices an attractive target for attackers. On the other hand, the large variability of different application scenarios and inherent heterogeneity of devices make it very challenging to reliably detect abnormal IoT device behaviors and distinguish these from benign behaviors. Existing approaches for detecting attacks are mostly limited to attacks directly compromising individual IoT devices, or, require predefined detection policies. They cannot detect attacks that utilize the control plane of the IoT system to trigger actions in an unintended/malicious context, e.g., opening a smart lock while the smart home residents are absent. In this paper, we tackle this problem and propose ARGUS, the first self-learning intrusion detection system for detecting contextual attacks on IoT environments, in which the attacker maliciously invokes IoT device actions to reach its goals. ARGUS monitors the contextual setting based on the state and actions of IoT devices in the environment. An unsupervised Deep Neural Network (DNN) is used for modeling the typical contextual device behavior and detecting actions taking place in abnormal contextual settings. This unsupervised approach ensures that ARGUS is not restricted to detecting previously known attacks but is also able to detect new attacks. We evaluated ARGUS on heterogeneous real-world smart-home settings and achieve at least an F1-Score of 99.64% for each setup, with a false positive rate (FPR) of at most 0.03%.
    Deep Anomaly Detection under Labeling Budget Constraints. (arXiv:2302.07832v1 [cs.LG])
    Selecting informative data points for expert feedback can significantly improve the performance of anomaly detection (AD) in various contexts, such as medical diagnostics or fraud detection. In this paper, we determine a set of theoretical conditions under which anomaly scores generalize from labeled queries to unlabeled data. Motivated by these results, we propose a data labeling strategy with optimal data coverage under labeling budget constraints. In addition, we propose a new learning framework for semi-supervised AD. Extensive experiments on image, tabular, and video data sets show that our approach results in state-of-the-art semi-supervised AD performance under labeling budget constraints.
    CERiL: Continuous Event-based Reinforcement Learning. (arXiv:2302.07667v1 [cs.CV])
    This paper explores the potential of event cameras to enable continuous time reinforcement learning. We formalise this problem where a continuous stream of unsynchronised observations is used to produce a corresponding stream of output actions for the environment. This lack of synchronisation enables greatly enhanced reactivity. We present a method to train on event streams derived from standard RL environments, thereby solving the proposed continuous time RL problem. The CERiL algorithm uses specialised network layers which operate directly on an event stream, rather than aggregating events into quantised image frames. We show the advantages of event streams over less-frequent RGB images. The proposed system outperforms networks typically used in RL, even succeeding at tasks which cannot be solved traditionally. We also demonstrate the value of our CERiL approach over a standard SNN baseline using event streams.  ( 2 min )
    SynGraphy: Succinct Summarisation of Large Networks via Small Synthetic Representative Graphs. (arXiv:2302.07755v1 [cs.SI])
    We describe SynGraphy, a method for visually summarising the structure of large network datasets that works by drawing smaller graphs generated to have similar structural properties to the input graphs. Visualising complex networks is crucial to understand and make sense of networked data and the relationships it represents. Due to the large size of many networks, visualisation is extremely difficult; the simple method of drawing large networks like those of Facebook or Twitter leads to graphics that convey little or no information. While modern graph layout algorithms can scale computationally to large networks, their output tends to a common "hairball" look, which makes it difficult to even distinguish different graphs from each other. Graph sampling and graph coarsening techniques partially address these limitations but they are only able to preserve a subset of the properties of the original graphs. In this paper we take the problem of visualising large graphs from a novel perspective: we leave the original graph's nodes and edges behind, and instead summarise its properties such as the clustering coefficient and bipartivity by generating a completely new graph whose structural properties match that of the original graph. To verify the utility of this approach as compared to other graph visualisation algorithms, we perform an experimental evaluation in which we repeatedly asked experimental subjects (professionals in graph mining and related areas) to determine which of two given graphs has a given structural property and then assess which visualisation algorithm helped in identifying the correct answer. Our summarisation approach SynGraphy compares favourably to other techniques on a variety of networks.
    Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation. (arXiv:2302.07865v1 [cs.LG])
    Distribution shifts are a major source of failure of deployed machine learning models. However, evaluating a model's reliability under distribution shifts can be challenging, especially since it may be difficult to acquire counterfactual examples that exhibit a specified shift. In this work, we introduce dataset interfaces: a framework which allows users to scalably synthesize such counterfactual examples from a given dataset. Specifically, we represent each class from the input dataset as a custom token within the text space of a text-to-image diffusion model. By incorporating these tokens into natural language prompts, we can then generate instantiations of objects in that dataset under desired distribution shifts. We demonstrate how applying our framework to the ImageNet dataset enables us to study model behavior across a diverse array of shifts, including variations in background, lighting, and attributes of the objects themselves. Code available at https://github.com/MadryLab/dataset-interfaces.
    SupSiam: Non-contrastive Auxiliary Loss for Learning from Molecular Conformers. (arXiv:2302.07754v1 [cs.LG])
    We investigate Siamese networks for learning related embeddings for augmented samples of molecular conformers. We find that a non-contrastive (positive-pair only) auxiliary task aids in supervised training of Euclidean neural networks (E3NNs) and increases manifold smoothness (MS) around point-cloud geometries. We demonstrate this property for multiple drug-activity prediction tasks while maintaining relevant performance metrics, and propose an extension of MS to probabilistic and regression settings. We provide an analysis of representation collapse, finding substantial effects of task-weighting, latent dimension, and regularization. We expect the presented protocol to aid in the development of reliable E3NNs from molecular conformers, even for small-data drug discovery programs.
    Over-parametrization via Lifting for Low-rank Matrix Sensing: Conversion of Spurious Solutions to Strict Saddle Points. (arXiv:2302.07828v1 [math.OC])
    This paper studies the role of over-parametrization in solving non-convex optimization problems. The focus is on the important class of low-rank matrix sensing, where we propose an infinite hierarchy of non-convex problems via the lifting technique and the Burer-Monteiro factorization. This contrasts with the existing over-parametrization technique where the search rank is limited by the dimension of the matrix and it does not allow a rich over-parametrization of an arbitrary degree. We show that although the spurious solutions of the problem remain stationary points through the hierarchy, they will be transformed into strict saddle points (under some technical conditions) and can be escaped via local search methods. This is the first result in the literature showing that over-parametrization creates a negative curvature for escaping spurious solutions. We also derive a bound on how much over-parametrization is requited to enable the elimination of spurious solutions.
    Learning Performance-Improving Code Edits. (arXiv:2302.07867v1 [cs.SE])
    The waning of Moore's Law has shifted the focus of the tech industry towards alternative methods for continued performance gains. While optimizing compilers are a standard tool to help increase program efficiency, programmers continue to shoulder much responsibility in crafting and refactoring code with better performance characteristics. In this paper, we investigate the ability of large language models (LLMs) to suggest functionally correct, performance improving code edits. We hypothesize that language models can suggest such edits in ways that would be impractical for static analysis alone. We investigate these questions by curating a large-scale dataset of Performance-Improving Edits, PIE. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program's performance. We use PIE to evaluate and improve the capacity of large language models. Specifically, use examples from PIE to fine-tune multiple variants of CODEGEN, a billion-scale Transformer-decoder model. Additionally, we use examples from PIE to prompt OpenAI's CODEX using a few-shot prompting. By leveraging PIE, we find that both CODEX and CODEGEN can generate performance-improving edits, with speedups of more than 2.5x for over 25% of the programs, for C++ and Python, even after the C++ programs were compiled using the O3 optimization level. Crucially, we show that PIE allows CODEGEN, an open-sourced and 10x smaller model than CODEX, to match the performance of CODEX on this challenging task. Overall, this work opens new doors for creating systems and methods that can help programmers write efficient code.
    Zero-Shot Anomaly Detection without Foundation Models. (arXiv:2302.07849v1 [cs.LG])
    Anomaly detection (AD) tries to identify data instances that deviate from the norm in a given data set. Since data distributions are subject to distribution shifts, our concept of ``normality" may also drift, raising the need for zero-shot adaptation approaches for anomaly detection. However, the fact that current zero-shot AD methods rely on foundation models that are restricted in their domain (natural language and natural images), are costly, and oftentimes proprietary, asks for alternative approaches. In this paper, we propose a simple and highly effective zero-shot AD approach compatible with a variety of established AD methods. Our solution relies on training an off-the-shelf anomaly detector (such as a deep SVDD) on a set of inter-related data distributions in combination with batch normalization. This simple recipe--batch normalization plus meta-training--is a highly effective and versatile tool. Our results demonstrate the first zero-shot anomaly detection results for tabular data and SOTA zero-shot AD results for image data from specialized domains.
    Firmware implementation of a recurrent neural network for the computation of the energy deposited in the liquid argon calorimeter of the ATLAS experiment. (arXiv:2302.07555v1 [physics.ins-det])
    The ATLAS experiment measures the properties of particles that are products of proton-proton collisions at the LHC. The ATLAS detector will undergo a major upgrade before the high luminosity phase of the LHC. The ATLAS liquid argon calorimeter measures the energy of particles interacting electromagnetically in the detector. The readout electronics of this calorimeter will be replaced during the aforementioned ATLAS upgrade. The new electronic boards will be based on state-of-the-art field-programmable gate arrays (FPGA) from Intel allowing the implementation of neural networks embedded in firmware. Neural networks have been shown to outperform the current optimal filtering algorithms used to compute the energy deposited in the calorimeter. This article presents the implementation of a recurrent neural network (RNN) allowing the reconstruction of the energy deposited in the calorimeter on Stratix 10 FPGAs. The implementation in high level synthesis (HLS) language allowed fast prototyping but fell short of meeting the stringent requirements in terms of resource usage and latency. Further optimisations in Very High-Speed Integrated Circuit Hardware Description Language (VHDL) allowed fulfilment of the requirements of processing 384 channels per FPGA with a latency smaller than 125 ns.
    Spatially heterogeneous learning by a deep student machine. (arXiv:2302.07419v1 [cond-mat.dis-nn])
    Despite the spectacular successes, deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes. To shed light on the hidden layers of DNN, we study supervised learning by a DNN of width $N$ and depth $L$ consisting of perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting. We consider an ensemble of student machines that exactly reproduce $M$ sets of $N$ dimensional input/output relations provided by a teacher machine. We analyze the ensemble theoretically using a replica method (H. Yoshino (2020)) and numerically performing greedy Monte Carlo simulations. The replica theory which works on high dimensional data $N \gg 1$ becomes exact in 'dense limit' $N \gg c \gg 1$ and $M \gg 1$ with fixed $\alpha=M/c$. Both the theory and the simulation suggest learning by the DNN is quite heterogeneous in the network space: configurations of the machines are more correlated within the layers closer to the input/output boundaries while the central region remains much less correlated due to over-parametrization. Deep enough systems relax faster thanks to the less correlated central region. Remarkably both the theory and simulation suggest generalization-ability of the student machines does not vanish even in the deep limit $L \gg 1$ where the system becomes strongly over-parametrized. We also consider the impact of effective dimension $D(\leq N)$ of data by incorporating the hidden manifold model (S. Goldt et al (2020)) into our model. The replica theory implies that the loop corrections to the dense limit, which reflect correlations between different nodes in the network, become enhanced by either decreasing the width $\ N$ or decreasing the effective dimension $D$ of the data. Simulation suggests both leads to significant improvements in generalization-ability.
    A Holistic Approach to Undesired Content Detection in the Real World. (arXiv:2208.03274v2 [cs.CL] UPDATED)
    We present a holistic approach to building a robust and useful natural language classification system for real-world content moderation. The success of such a system relies on a chain of carefully designed and executed steps, including the design of content taxonomies and labeling instructions, data quality control, an active learning pipeline to capture rare events, and a variety of methods to make the model robust and to avoid overfitting. Our moderation system is trained to detect a broad set of categories of undesired content, including sexual content, hateful content, violence, self-harm, and harassment. This approach generalizes to a wide range of different content taxonomies and can be used to create high-quality content classifiers that outperform off-the-shelf models.
    Bridging Graph Position Encodings for Transformers with Weighted Graph-Walking Automata. (arXiv:2212.06898v2 [cs.LG] UPDATED)
    A current goal in the graph neural network literature is to enable transformers to operate on graph-structured data, given their success on language and vision tasks. Since the transformer's original sinusoidal positional encodings (PEs) are not applicable to graphs, recent work has focused on developing graph PEs, rooted in spectral graph theory or various spatial features of a graph. In this work, we introduce a new graph PE, Graph Automaton PE (GAPE), based on weighted graph-walking automata (a novel extension of graph-walking automata). We compare the performance of GAPE with other PE schemes on both machine translation and graph-structured tasks, and we show that it generalizes several other PEs. An additional contribution of this study is a theoretical and controlled experimental comparison of many recent PEs in graph transformers, independent of the use of edge features.  ( 2 min )
    Activity Cliff Prediction: Dataset and Benchmark. (arXiv:2302.07541v1 [q-bio.BM])
    Activity cliffs (ACs), which are generally defined as pairs of structurally similar molecules that are active against the same bio-target but significantly different in the binding potency, are of great importance to drug discovery. Up to date, the AC prediction problem, i.e., to predict whether a pair of molecules exhibit the AC relationship, has not yet been fully explored. In this paper, we first introduce ACNet, a large-scale dataset for AC prediction. ACNet curates over 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs, and provides five subsets for model development and evaluation. Then, we propose a baseline framework to benchmark the predictive performance of molecular representations encoded by deep neural networks for AC prediction, and 16 models are evaluated in experiments. Our experimental results show that deep learning models can achieve good performance when the models are trained on tasks with adequate amount of data, while the imbalanced, low-data and out-of-distribution features of the ACNet dataset still make it challenging for deep neural networks to cope with. In addition, the traditional ECFP method shows a natural advantage on MMP-cliff prediction, and outperforms other deep learning models on most of the data subsets. To the best of our knowledge, our work constructs the first large-scale dataset for AC prediction, which may stimulate the study of AC prediction models and prompt further breakthroughs in AI-aided drug discovery. The codes and dataset can be accessed by https://drugai.github.io/ACNet/.
    Constrained Decision Transformer for Offline Safe Reinforcement Learning. (arXiv:2302.07351v1 [cs.LG])
    Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints.
    TFormer: A Transmission-Friendly ViT Model for IoT Devices. (arXiv:2302.07734v1 [cs.CV])
    Deploying high-performance vision transformer (ViT) models on ubiquitous Internet of Things (IoT) devices to provide high-quality vision services will revolutionize the way we live, work, and interact with the world. Due to the contradiction between the limited resources of IoT devices and resource-intensive ViT models, the use of cloud servers to assist ViT model training has become mainstream. However, due to the larger number of parameters and floating-point operations (FLOPs) of the existing ViT models, the model parameters transmitted by cloud servers are large and difficult to run on resource-constrained IoT devices. To this end, this paper proposes a transmission-friendly ViT model, TFormer, for deployment on resource-constrained IoT devices with the assistance of a cloud server. The high performance and small number of model parameters and FLOPs of TFormer are attributed to the proposed hybrid layer and the proposed partially connected feed-forward network (PCS-FFN). The hybrid layer consists of nonlearnable modules and a pointwise convolution, which can obtain multitype and multiscale features with only a few parameters and FLOPs to improve the TFormer performance. The PCS-FFN adopts group convolution to reduce the number of parameters. The key idea of this paper is to propose TFormer with few model parameters and FLOPs to facilitate applications running on resource-constrained IoT devices to benefit from the high performance of the ViT models. Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image classification, object detection, and semantic segmentation tasks demonstrate that the proposed model outperforms other state-of-the-art models. Specifically, TFormer-S achieves 5% higher accuracy on ImageNet-1K than ResNet18 with 1.4$\times$ fewer parameters and FLOPs.
    Hybrid Spiking Neural Network Fine-tuning for Hippocampus Segmentation. (arXiv:2302.07328v1 [cs.NE])
    Over the past decade, artificial neural networks (ANNs) have made tremendous advances, in part due to the increased availability of annotated data. However, ANNs typically require significant power and memory consumptions to reach their full potential. Spiking neural networks (SNNs) have recently emerged as a low-power alternative to ANNs due to their sparsity nature. SNN, however, are not as easy to train as ANNs. In this work, we propose a hybrid SNN training scheme and apply it to segment human hippocampi from magnetic resonance images. Our approach takes ANN-SNN conversion as an initialization step and relies on spike-based backpropagation to fine-tune the network. Compared with the conversion and direct training solutions, our method has advantages in both segmentation accuracy and training efficiency. Experiments demonstrate the effectiveness of our model in achieving the design goals.
    Adapting to game trees in zero-sum imperfect information games. (arXiv:2212.12567v2 [stat.ML] UPDATED)
    Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $\epsilon$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ on the required number of realizations to learn these strategies with high probability, where $H$ is the length of the game, $A_{\mathcal{X}}$ and $B_{\mathcal{Y}}$ are the total number of actions for the two players. We also propose two Follow the Regularized leader (FTRL) algorithms for this setting: Balanced FTRL which matches this lower bound, but requires the knowledge of the information set structure beforehand to define the regularization; and Adaptive FTRL which needs $\widetilde{\mathcal{O}}(H^2(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ realizations without this requirement by progressively adapting the regularization to the observations.  ( 2 min )
    On graph-based reentrancy-free semantic parsing. (arXiv:2302.07679v1 [cs.CL])
    We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on Geoquery, Scan and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.
    Scalable Batch Acquisition for Deep Bayesian Active Learning. (arXiv:2301.05490v2 [cs.LG] UPDATED)
    In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.
    Do Deep Learning Methods Really Perform Better in Molecular Conformation Generation?. (arXiv:2302.07061v1 [cs.CE] CROSS LISTED)
    Molecular conformation generation (MCG) is a fundamental and important problem in drug discovery. Many traditional methods have been developed to solve the MCG problem, such as systematic searching, model-building, random searching, distance geometry, molecular dynamics, Monte Carlo methods, etc. However, they have some limitations depending on the molecular structures. Recently, there are plenty of deep learning based MCG methods, which claim they largely outperform the traditional methods. However, to our surprise, we design a simple and cheap algorithm (parameter-free) based on the traditional methods and find it is comparable to or even outperforms deep learning based MCG methods in the widely used GEOM-QM9 and GEOM-Drugs benchmarks. In particular, our design algorithm is simply the clustering of the RDKIT-generated conformations. We hope our findings can help the community to revise the deep learning methods for MCG. The code of the proposed algorithm could be found at https://gist.github.com/ZhouGengmo/5b565f51adafcd911c0bc115b2ef027c.  ( 2 min )
    Video Probabilistic Diffusion Models in Projected Latent Space. (arXiv:2302.07685v1 [cs.CV])
    Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion models (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on popular video generation datasets demonstrate the superiority of PVDM compared with previous video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of the prior state-of-the-art.
    CUTS: Neural Causal Discovery from Irregular Time-Series Data. (arXiv:2302.07458v1 [cs.LG])
    Causal discovery from time-series data has been a central task in machine learning. Recently, Granger causality inference is gaining momentum due to its good explainability and high compatibility with emerging deep neural networks. However, most existing methods assume structured input data and degenerate greatly when encountering data with randomly missing entries or non-uniform sampling frequencies, which hampers their applications in real scenarios. To address this issue, here we present CUTS, a neural Granger causal discovery algorithm to jointly impute unobserved data points and build causal graphs, via plugging in two mutually boosting modules in an iterative framework: (i) Latent data prediction stage: designs a Delayed Supervision Graph Neural Network (DSGNN) to hallucinate and register unstructured data which might be of high dimension and with complex distribution; (ii) Causal graph fitting stage: builds a causal adjacency matrix with imputed data under sparse penalty. Experiments show that CUTS effectively infers causal graphs from unstructured time-series data, with significantly superior performance to existing methods. Our approach constitutes a promising step towards applying causal discovery to real applications with non-ideal observations.
    Continuous PDE Dynamics Forecasting with Implicit Neural Representations. (arXiv:2209.14855v2 [cs.LG] UPDATED)
    Effective data-driven PDE forecasting methods often rely on fixed spatial and / or temporal discretizations. This raises limitations in real-world applications like weather prediction where flexible extrapolation at arbitrary spatiotemporal locations is required. We address this problem by introducing a new data-driven approach, DINo, that models a PDE's flow with continuous-time dynamics of spatially continuous functions. This is achieved by embedding spatial observations independently of their discretization via Implicit Neural Representations in a small latent space temporally driven by a learned ODE. This separate and flexible treatment of time and space makes DINo the first data-driven model to combine the following advantages. It extrapolates at arbitrary spatial and temporal locations; it can learn from sparse irregular grids or manifolds; at test time, it generalizes to new grids or resolutions. DINo outperforms alternative neural PDE forecasters in a variety of challenging generalization scenarios on representative PDE systems.  ( 2 min )
    Sparse-SignSGD with Majority Vote for Communication-Efficient Distributed Learning. (arXiv:2302.07475v1 [cs.LG])
    The training efficiency of complex deep learning models can be significantly improved through the use of distributed optimization. However, this process is often hindered by a large amount of communication cost between workers and a parameter server during iterations. To address this bottleneck, in this paper, we present a new communication-efficient algorithm that offers the synergistic benefits of both sparsification and sign quantization, called ${\sf S}^3$GD-MV. The workers in ${\sf S}^3$GD-MV select the top-$K$ magnitude components of their local gradient vector and only send the signs of these components to the server. The server then aggregates the signs and returns the results via a majority vote rule. Our analysis shows that, under certain mild conditions, ${\sf S}^3$GD-MV can converge at the same rate as signSGD while significantly reducing communication costs, if the sparsification parameter $K$ is properly chosen based on the number of workers and the size of the deep learning model. Experimental results using both independent and identically distributed (IID) and non-IID datasets demonstrate that the ${\sf S}^3$GD-MV attains higher accuracy than signSGD, significantly reducing communication costs. These findings highlight the potential of ${\sf S}^3$GD-MV as a promising solution for communication-efficient distributed optimization in deep learning.
    A Federated Learning Benchmark for Drug-Target Interaction. (arXiv:2302.07684v1 [cs.LG])
    Aggregating pharmaceutical data in the drug-target interaction (DTI) domain has the potential to deliver life-saving breakthroughs. It is, however, notoriously difficult due to regulatory constraints and commercial interests. This work proposes the application of federated learning, which we argue to be reconcilable with the industry's constraints, as it does not require sharing of any information that would reveal the entities' data or any other high-level summary of it. When used on a representative GraphDTA model and the KIBA dataset it achieves up to 15% improved performance relative to the best available non-privacy preserving alternative. Our extensive battery of experiments shows that, unlike in other domains, the non-IID data distribution in the DTI datasets does not deteriorate FL performance. Additionally, we identify a material trade-off between the benefits of adding new data, and the cost of adding more clients.
    Combat AI With AI: Counteract Machine-Generated Fake Restaurant Reviews on Social Media. (arXiv:2302.07731v1 [cs.CL])
    Recent advances in generative models such as GPT may be used to fabricate indistinguishable fake customer reviews at a much lower cost, thus posing challenges for social media platforms to detect these machine-generated fake reviews. We propose to leverage the high-quality elite restaurant reviews verified by Yelp to generate fake reviews from the OpenAI GPT review creator and ultimately fine-tune a GPT output detector to predict fake reviews that significantly outperforms existing solutions. We further apply the model to predict non-elite reviews and identify the patterns across several dimensions, such as review, user and restaurant characteristics, and writing style. We show that social media platforms are continuously challenged by machine-generated fake reviews, although they may implement detection systems to filter out suspicious reviews.
    Bayesian Decision Trees via Tractable Priors and Probabilistic Context-Free Grammars. (arXiv:2302.07407v1 [cs.LG])
    Decision Trees are some of the most popular machine learning models today due to their out-of-the-box performance and interpretability. Often, Decision Trees models are constructed greedily in a top-down fashion via heuristic search criteria, such as Gini impurity or entropy. However, trees constructed in this manner are sensitive to minor fluctuations in training data and are prone to overfitting. In contrast, Bayesian approaches to tree construction formulate the selection process as a posterior inference problem; such approaches are more stable and provide greater theoretical guarantees. However, generating Bayesian Decision Trees usually requires sampling from complex, multimodal posterior distributions. Current Markov Chain Monte Carlo-based approaches for sampling Bayesian Decision Trees are prone to mode collapse and long mixing times, which makes them impractical. In this paper, we propose a new criterion for training Bayesian Decision Trees. Our criterion gives rise to BCART-PCFG, which can efficiently sample decision trees from a posterior distribution across trees given the data and find the maximum a posteriori (MAP) tree. Learning the posterior and training the sampler can be done in time that is polynomial in the dataset size. Once the posterior has been learned, trees can be sampled efficiently (linearly in the number of nodes). At the core of our method is a reduction of sampling the posterior to sampling a derivation from a probabilistic context-free grammar. We find that trees sampled via BCART-PCFG perform comparable to or better than greedily-constructed Decision Trees in classification accuracy on several datasets. Additionally, the trees sampled via BCART-PCFG are significantly smaller -- sometimes by as much as 20x.
    On the Hyperparameters influencing a PINN's generalization beyond the training domain. (arXiv:2302.07557v1 [cs.LG])
    Physics-Informed Neural Networks (PINNs) are Neural Network architectures trained to emulate solutions of differential equations without the necessity of solution data. They are currently ubiquitous in the scientific literature due to their flexible and promising settings. However, very little of the available research provides practical studies that aim for a better quantitative understanding of such architecture and its functioning. In this paper, we analyze the performance of PINNs for various architectural hyperparameters and algorithmic settings based on a novel error metric and other factors such as training time. The proposed metric and approach are tailored to evaluate how well a PINN generalizes to points outside its training domain. Besides, we investigate the effect of the algorithmic setup on the outcome prediction of a PINN, inside and outside its training domain, to explore the effect of each hyperparameter. Through our study, we assess how the algorithmic setup of PINNs influences their potential for generalization and deduce the settings which maximize the potential of a PINN for accurate generalization. The study that we present returns insightful and at times counterintuitive results on PINNs. These results can be useful in PINN applications when defining the model and evaluating it.
    Reinforcement Learning Based Power Grid Day-Ahead Planning and AI-Assisted Control. (arXiv:2302.07654v1 [cs.AI])
    The ongoing transition to renewable energy is increasing the share of fluctuating power sources like wind and solar, raising power grid volatility and making grid operation increasingly complex and costly. In our prior work, we have introduced a congestion management approach consisting of a redispatching optimizer combined with a machine learning-based topology optimization agent. Compared to a typical redispatching-only agent, it was able to keep a simulated grid in operation longer while at the same time reducing operational cost. Our approach also ranked 1st in the L2RPN 2022 competition initiated by RTE, Europe's largest grid operator. The aim of this paper is to bring this promising technology closer to the real world of power grid operation. We deploy RL-based agents in two settings resembling established workflows, AI-assisted day-ahead planning and realtime control, in an attempt to show the benefits and caveats of this new technology. We then analyse congestion, redispatching and switching profiles, and elementary sensitivity analysis providing a glimpse of operation robustness. While there is still a long way to a real control room, we believe that this paper and the associated prototypes help to narrow the gap and pave the way for a safe deployment of RL agents in tomorrow's power grids.
    Deep Convolutional Neural Network for Plume Rise Measurements in Industrial Environments. (arXiv:2302.07416v1 [cs.LG])
    The estimation of plume cloud height is essential for air-quality transport models, local environmental assessment cases, and global climate models. When pollutants are released by a smokestack, plume rise is the constant height at which the plume cloud is carried downwind as its momentum dissipates and the temperatures of the plume cloud and the ambient equalize. Although different parameterizations and equations are used in most air quality models to predict plume rise, verification of these parameterizations has been limited in the past three decades. Beyond validation, there is also value in real-time measurement of plume rise to improve the accuracy of air quality forecasting. In this paper, we propose a low-cost measurement technology that can monitor smokestack plumes and make long-term, real-time measurements of plume rise, improving predictability. To do this, a two-stage method is developed based on deep convolutional neural networks. In the first stage, an improved Mask R-CNN is applied to detect the plume cloud borders and distinguish the plume from its background and other objects. This proposed model is called Deep Plume Rise Net (DPRNet). In the second stage, a geometric transformation phase is applied through the wind direction information from a nearby monitoring station to obtain real-life measurements of different parameters. Finally, the plume cloud boundaries are obtained to calculate the plume rise. Various images with different atmospheric conditions, including day, night, cloudy, and foggy, have been selected for DPRNet training algorithm. Obtained results show the proposed method outperforms widely-used networks in plume cloud border detection and recognition.
    Revisiting Initializing Then Refining: An Incomplete and Missing Graph Imputation Network. (arXiv:2302.07524v1 [cs.AI])
    With the development of various applications, such as social networks and knowledge graphs, graph data has been ubiquitous in the real world. Unfortunately, graphs usually suffer from being absent due to privacy-protecting policies or copyright restrictions during data collection. The absence of graph data can be roughly categorized into attribute-incomplete and attribute-missing circumstances. Specifically, attribute-incomplete indicates that a part of the attribute vectors of all nodes are incomplete, while attribute-missing indicates that the whole attribute vectors of partial nodes are missing. Although many efforts have been devoted, none of them is custom-designed for a common situation where both types of graph data absence exist simultaneously. To fill this gap, we develop a novel network termed Revisiting Initializing Then Refining (RITR), where we complete both attribute-incomplete and attribute-missing samples under the guidance of a novel initializing-then-refining imputation criterion. Specifically, to complete attribute-incomplete samples, we first initialize the incomplete attributes using Gaussian noise before network learning, and then introduce a structure-attribute consistency constraint to refine incomplete values by approximating a structure-attribute correlation matrix to a high-order structural matrix. To complete attribute-missing samples, we first adopt structure embeddings of attribute-missing samples as the embedding initialization, and then refine these initial values by adaptively aggregating the reliable information of attribute-incomplete samples according to a dynamic affinity structure. To the best of our knowledge, this newly designed method is the first unsupervised framework dedicated to handling hybrid-absent graphs. Extensive experiments on four datasets have verified that our methods consistently outperform existing state-of-the-art competitors.
    A model-free feature selection technique of feature screening and random forest based recursive feature elimination. (arXiv:2302.07449v1 [stat.ME])
    In this paper, we propose a model-free feature selection method for ultra-high dimensional data with mass features. This is a two phases procedure that we propose to use the fused Kolmogorov filter with the random forest based RFE to remove model limitations and reduce the computational complexity. The method is fully nonparametric and can work with various types of datasets. It has several appealing characteristics, i.e., accuracy, model-free, and computational efficiency, and can be widely used in practical problems, such as multiclass classification, nonparametric regression, and Poisson regression, among others. We show that the proposed method is selection consistent and $L_2$ consistent under weak regularity conditions. We further demonstrate the superior performance of the proposed method over other existing methods by simulations and real data examples.
    Excess risk bound for deep learning under weak dependence. (arXiv:2302.07503v1 [stat.ML])
    This paper considers deep neural networks for learning weakly dependent processes in a general framework that includes, for instance, regression estimation, time series prediction, time series classification. The $\psi$-weak dependence structure considered is quite large and covers other conditions such as mixing, association,$\ldots$ Firstly, the approximation of smooth functions by deep neural networks with a broad class of activation functions is considered. We derive the required depth, width and sparsity of a deep neural network to approximate any H\"{o}lder smooth function, defined on any compact set $\mx$. Secondly, we establish a bound of the excess risk for the learning of weakly dependent observations by deep neural networks. When the target function is sufficiently smooth, this bound is close to the usual $\mathcal{O}(n^{-1/2})$.
    Contrastive Learning Can Find An Optimal Basis For Approximately View-Invariant Functions. (arXiv:2210.01883v2 [cs.LG] UPDATED)
    Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpreted as learning kernel functions that approximate a fixed positive-pair kernel. We then prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors, under a straightforward assumption that positive pairs have similar labels. Our analysis is based on a decomposition of the target function in terms of the eigenfunctions of a positive-pair Markov chain, and a surprising equivalence between these eigenfunctions and the output of Kernel PCA. We give generalization bounds for downstream linear prediction using our Kernel PCA representation, and show empirically on a set of synthetic tasks that applying Kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.  ( 2 min )
    PAC-Bayesian Learning of Optimization Algorithms. (arXiv:2210.11113v2 [cs.LG] UPDATED)
    We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algorithms provably outperform related algorithms based on a (deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. By generalizing existing ideas, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum, which enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hyperparameters of standard optimization algorithms to empirically underline our theory.  ( 2 min )
    A Case Study on Designing Evaluations of ML Explanations with Simulated User Studies. (arXiv:2302.07444v1 [cs.LG])
    When conducting user studies to ascertain the usefulness of model explanations in aiding human decision-making, it is important to use real-world use cases, data, and users. However, this process can be resource-intensive, allowing only a limited number of explanation methods to be evaluated. Simulated user evaluations (SimEvals), which use machine learning models as a proxy for human users, have been proposed as an intermediate step to select promising explanation methods. In this work, we conduct the first SimEvals on a real-world use case to evaluate whether explanations can better support ML-assisted decision-making in e-commerce fraud detection. We study whether SimEvals can corroborate findings from a user study conducted in this fraud detection context. In particular, we find that SimEvals suggest that all considered explainers are equally performant, and none beat a baseline without explanations -- this matches the conclusions of the original user study. Such correspondences between our results and the original user study provide initial evidence in favor of using SimEvals before running user studies. We also explore the use of SimEvals as a cheap proxy to explore an alternative user study set-up. We hope that this work motivates further study of when and how SimEvals should be used to aid in the design of real-world evaluations.  ( 2 min )
    Qualitative Data Augmentation for Performance Prediction in VLSI circuits. (arXiv:2302.07566v1 [cs.LG])
    Various studies have shown the advantages of using Machine Learning (ML) techniques for analog and digital IC design automation and optimization. Data scarcity is still an issue for electronic designs, while training highly accurate ML models. This work proposes generating and evaluating artificial data using generative adversarial networks (GANs) for circuit data to aid and improve the accuracy of ML models trained with a small training data set. The training data is obtained by various simulations in the Cadence Virtuoso, HSPICE, and Microcap design environment with TSMC 180nm and 22nm CMOS technology nodes. The artificial data is generated and tested for an appropriate set of analog and digital circuits. The experimental results show that the proposed artificial data generation significantly improves ML models and reduces the percentage error by more than 50\% of the original percentage error, which were previously trained with insufficient data. Furthermore, this research aims to contribute to the extensive application of AI/ML in the field of VLSI design and technology by relieving the training data availability-related challenges.
    The Geometry of Neural Nets' Parameter Spaces Under Reparametrization. (arXiv:2302.07384v1 [cs.LG])
    Model reparametrization -- transforming the parameter space via a bijective differentiable map -- is a popular way to improve the training of neural networks. But reparametrizations have also been problematic since they induce inconsistencies in, e.g., Hessian-based flatness measures, optimization trajectories, and modes of probability density functions. This complicates downstream analyses, e.g. one cannot make a definitive statement about the connection between flatness and generalization. In this work, we study the invariance quantities of neural nets under reparametrization from the perspective of Riemannian geometry. We show that this notion of invariance is an inherent property of any neural net, as long as one acknowledges the assumptions about the metric that is always present, albeit often implicitly, and uses the correct transformation rules under reparametrization. We present discussions on measuring the flatness of minima, in optimization, and in probability-density maximization, along with applications in studying the biases of optimizers and in Bayesian inference.
    Improved Online Conformal Prediction via Strongly Adaptive Online Learning. (arXiv:2302.07869v1 [cs.LG])
    We study the problem of uncertainty quantification via prediction sets, in an online setting where the data distribution may vary arbitrarily over time. Recent work develops online conformal prediction techniques that leverage regret minimization algorithms from the online learning literature to learn prediction sets with approximately valid coverage and small regret. However, standard regret minimization could be insufficient for handling changing environments, where performance guarantees may be desired not only over the full time horizon but also in all (sub-)intervals of time. We develop new online conformal prediction methods that minimize the strongly adaptive regret, which measures the worst-case regret over all intervals of a fixed length. We prove that our methods achieve near-optimal strongly adaptive regret for all interval lengths simultaneously, and approximately valid coverage. Experiments show that our methods consistently obtain better coverage and smaller prediction sets than existing methods on real-world tasks, such as time series forecasting and image classification under distribution shift.  ( 2 min )
    SoK: Anti-Facial Recognition Technology. (arXiv:2112.04558v2 [cs.CR] UPDATED)
    The rapid adoption of facial recognition (FR) technology by both government and commercial entities in recent years has raised concerns about civil liberties and privacy. In response, a broad suite of so-called "anti-facial recognition" (AFR) tools has been developed to help users avoid unwanted facial recognition. The set of AFR tools proposed in the last few years is wide-ranging and rapidly evolving, necessitating a step back to consider the broader design space of AFR systems and long-term challenges. This paper aims to fill that gap and provides the first comprehensive analysis of the AFR research landscape. Using the operational stages of FR systems as a starting point, we create a systematic framework for analyzing the benefits and tradeoffs of different AFR approaches. We then consider both technical and social challenges facing AFR tools and propose directions for future research in this field.  ( 2 min )
    Function-space regularized R\'enyi divergences. (arXiv:2210.04974v2 [stat.ML] UPDATED)
    We propose a new family of regularized R\'enyi divergences parametrized not only by the order $\alpha$ but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard R\'enyi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when $\alpha>1$; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical R\'enyi divergences and IPMs. We also study the $\alpha\to\infty$ limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized R\'enyi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with low-dimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance.  ( 2 min )
    Replicable Bandits. (arXiv:2210.01898v2 [cs.LG] UPDATED)
    In this paper, we introduce the notion of replicable policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called replicable if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do replicable policies exist, but also they achieve almost the same optimal (non-replicable) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the replicability parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop replicable policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the replicability parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.  ( 2 min )
    Self-supervised learning of object pose estimation using keypoint prediction. (arXiv:2302.07360v1 [cs.CV])
    This paper describes recent developments in object specific pose and shape prediction from single images. The main contribution is a new approach to camera pose prediction by self-supervised learning of keypoints corresponding to locations on a category specific deformable shape. We designed a network to generate a proxy ground-truth heatmap from a set of keypoints distributed all over the category-specific mean shape, where each is represented by a unique color on a labeled texture. The proxy ground-truth heatmap is used to train a deep keypoint prediction network, which can be used in online inference. The proposed approach to camera pose prediction show significant improvements when compared with state-of-the-art methods. Our approach to camera pose prediction is used to infer 3D objects from 2D image frames of video sequences online. To train the reconstruction model, it receives only a silhouette mask from a single frame of a video sequence in every training step and a category-specific mean object shape. We conducted experiments using three different datasets representing the bird category: the CUB [51] image dataset, YouTubeVos and the Davis video datasets. The network is trained on the CUB dataset and tested on all three datasets. The online experiments are demonstrated on YouTubeVos and Davis [56] video sequences using a network trained on the CUB training set.  ( 2 min )
    Accelerating Hamiltonian Monte Carlo via Chebyshev Integration Time. (arXiv:2207.02189v2 [cs.LG] UPDATED)
    Hamiltonian Monte Carlo (HMC) is a popular method in sampling. While there are quite a few works of studying this method on various aspects, an interesting question is how to choose its integration time to achieve acceleration. In this work, we consider accelerating the process of sampling from a distribution $\pi(x) \propto \exp(-f(x))$ via HMC via time-varying integration time. When the potential $f$ is $L$-smooth and $m$-strongly convex, i.e.\ for sampling from a log-smooth and strongly log-concave target distribution $\pi$, it is known that under a constant integration time, the number of iterations that ideal HMC takes to get an $\epsilon$ Wasserstein-2 distance to the target $\pi$ is $O( \kappa \log \frac{1}{\epsilon} )$, where $\kappa := \frac{L}{m}$ is the condition number. We propose a scheme of time-varying integration time based on the roots of Chebyshev polynomials. We show that in the case of quadratic potential $f$, i.e., when the target $\pi$ is a Gaussian distribution, ideal HMC with this choice of integration time only takes $O( \sqrt{\kappa} \log \frac{1}{\epsilon} )$ number of iterations to reach Wasserstein-2 distance less than $\epsilon$; this improvement on the dependence on condition number is akin to acceleration in optimization. The design and analysis of HMC with the proposed integration time is built on the tools of Chebyshev polynomials. Experiments find the advantage of adopting our scheme of time-varying integration time even for sampling from distributions with smooth strongly convex potentials that are not quadratic.  ( 2 min )
    3D Molecular Generation via Virtual Dynamics. (arXiv:2302.05847v1 [q-bio.BM] CROSS LISTED)
    Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., directly generating a molecule with a 3D structure and binding position in the pocket, is a new promising way to address this issue. Herein, we propose VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen consists of several carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket like in early attempts, in VD-Gen, we first randomly initialize many virtual particles in the pocket; then iteratively move these virtual particles, making the distribution of virtual particles approximate the distribution of molecular atoms. After virtual particles are stabilized in 3D space, we extract a 3D molecule from them. Finally, we further refine atoms in the extracted molecule by iterative movement again, to get a high-quality 3D molecule, and predict a confidence score for it. Extensive experiment results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules to fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines.  ( 2 min )
    Learning When to Say "I Don't Know". (arXiv:2209.04944v2 [cs.CV] UPDATED)
    We propose a new Reject Option Classification technique to identify and remove regions of uncertainty in the decision space for a given neural classifier and dataset. Such existing formulations employ a learned rejection (remove)/selection (keep) function and require either a known cost for rejecting examples or strong constraints on the accuracy or coverage of the selected examples. We consider an alternative formulation by instead analyzing the complementary reject region and employing a validation set to learn per-class softmax thresholds. The goal is to maximize the accuracy of the selected examples subject to a natural randomness allowance on the rejected examples (rejecting more incorrect than correct predictions). We provide results showing the benefits of the proposed method over na\"ively thresholding calibrated/uncalibrated softmax scores with 2-D points, imagery, and text classification datasets using state-of-the-art pretrained models. Source code is available at https://github.com/osu-cvl/learning-idk.
    PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. (arXiv:2209.06275v2 [cs.CL] UPDATED)
    Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.  ( 2 min )
    Trust Region Bounds for Decentralized PPO Under Non-stationarity. (arXiv:2202.00082v3 [cs.LG] UPDATED)
    We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.  ( 2 min )
    Maximising Weather Forecasting Accuracy through the Utilisation of Graph Neural Networks and Dynamic GNNs. (arXiv:2301.12471v2 [cs.LG] UPDATED)
    Weather forecasting is an essential task to tackle global climate change. Weather forecasting requires the analysis of multivariate data generated by heterogeneous meteorological sensors. These sensors comprise of ground-based sensors, radiosonde, and sensors mounted on satellites, etc., To analyze the data generated by these sensors we use Graph Neural Networks (GNNs) based weather forecasting model. GNNs are graph learning-based models which show strong empirical performance in many machine learning approaches. In this research, we investigate the performance of weather forecasting using GNNs and traditional Machine learning-based models.
    Continuized Acceleration for Quasar Convex Functions in Non-Convex Optimization. (arXiv:2302.07851v1 [math.OC])
    Quasar convexity is a condition that allows some first-order methods to efficiently minimize a function even when the optimization landscape is non-convex. Previous works develop near-optimal accelerated algorithms for minimizing this class of functions, however, they require a subroutine of binary search which results in multiple calls to gradient evaluations in each iteration, and consequently the total number of gradient evaluations does not match a known lower bound. In this work, we show that a recently proposed continuized Nesterov acceleration can be applied to minimizing quasar convex functions and achieves the optimal bound with a high probability. Furthermore, we find that the objective functions of training generalized linear models (GLMs) satisfy quasar convexity, which broadens the applicability of the relevant algorithms, while known practical examples of quasar convexity in non-convex learning are sparse in the literature. We also show that if a smooth and one-point strongly convex, Polyak-Lojasiewicz, or quadratic-growth function satisfies quasar convexity, then attaining an accelerated linear rate for minimizing the function is possible under certain conditions, while acceleration is not known in general for these classes of functions.  ( 2 min )
    Human Biophysics as Network Weights: Conditional Generative Models for Dynamic Simulation. (arXiv:2211.01856v2 [cs.LG] UPDATED)
    Simulations of biophysical systems have provided a huge contribution to our fundamental understanding of human physiology and remain a central pillar for developments in medical devices and human machine interfaces. However, despite their successes, such simulations usually rely on highly computationally expensive numerical modelling, which is often inefficient to adapt to new simulation parameters. This limits their use in simulating dynamic human behaviours, which typically proceed along a sequence of small time steps. One may painstakingly produce a few static simulations at discretised stages, but not the hundreds of simulations that are essential to capture the dynamic nature of human body. We propose that an alternative approach is to use conditional generative models, which can learn complex relationships between the underlying generative conditions and the output data whilst remaining inexpensive to sample from. As a demonstration of this concept, we present BioMime, a hybrid-structured generative model that combines elements of deep latent variable models and conditional adversarial training. We demonstrate that BioMime can learn to accurately mimic a complex numerical model of human muscle biophysics and then use this knowledge to continuously sample from a dynamically changing system in a short time. This ultimately converts a static model into a dynamic one with no effort. We argue that transfer learning approaches with conditional generative models are a viable solution for dynamic simulation with any numerical model.  ( 2 min )
    Uncertainty-Estimation with Normalized Logits for Out-of-Distribution Detection. (arXiv:2302.07608v1 [cs.LG])
    Out-of-distribution (OOD) detection is critical for preventing deep learning models from making incorrect predictions to ensure the safety of artificial intelligence systems. Especially in safety-critical applications such as medical diagnosis and autonomous driving, the cost of incorrect decisions is usually unbearable. However, neural networks often suffer from the overconfidence issue, making high confidence for OOD data which are never seen during training process and may be irrelevant to training data, namely in-distribution (ID) data. Determining the reliability of the prediction is still a difficult and challenging task. In this work, we propose Uncertainty-Estimation with Normalized Logits (UE-NL), a robust learning method for OOD detection, which has three main benefits. (1) Neural networks with UE-NL treat every ID sample equally by predicting the uncertainty score of input data and the uncertainty is added into softmax function to adjust the learning strength of easy and hard samples during training phase, making the model learn robustly and accurately. (2) UE-NL enforces a constant vector norm on the logits to decouple the effect of the increasing output norm from optimization process, which causes the overconfidence issue to some extent. (3) UE-NL provides a new metric, the magnitude of uncertainty score, to detect OOD data. Experiments demonstrate that UE-NL achieves top performance on common OOD benchmarks and is more robust to noisy ID data that may be misjudged as OOD data by other methods.  ( 2 min )
    Towards Understanding GD with Hard and Conjugate Pseudo-labels for Test-Time Adaptation. (arXiv:2210.10019v2 [cs.LG] UPDATED)
    We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudo-labels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, \cite{GSRK22} propose conjugate labels, which is a new kind of pseudo-labels for self-training at test time. They empirically show that the conjugate label outperforms other ways of pseudo-labeling on many domain adaptation benchmarks. However, provably showing that GD with conjugate labels learns a good classifier for test-time adaptation remains open. In this work, we aim at theoretically understanding GD with hard and conjugate labels for a binary classification problem. We show that for square loss, GD with conjugate labels converges to an $\epsilon$-optimal predictor under a Gaussian model for any arbitrarily small $\epsilon$, while GD with hard pseudo-labels fails in this task. We also analyze them under different loss functions for the update. Our results shed lights on understanding when and why GD with hard labels or conjugate labels works in test-time adaptation.  ( 2 min )
    Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction. (arXiv:2302.07817v1 [cs.CV])
    Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.
    Efficient Offline Policy Optimization with a Learned Model. (arXiv:2210.05980v2 [cs.LG] UPDATED)
    MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. Our implementation is open-sourced at https://github.com/sail-sg/rosmo.  ( 2 min )
    Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for Multi-Modal Fact Verification. (arXiv:2302.07740v1 [cs.CL])
    Multi-modal fact verification has become an important but challenging issue on social media due to the mismatch between the text and images in the misinformation of news content, which has been addressed by considering cross-modalities to identify the veracity of the news in recent years. In this paper, we propose the Pre-CoFactv2 framework with new parameter-efficient foundation models for modeling fine-grained text and input embeddings with lightening parameters, multi-modal multi-type fusion for not only capturing relations for the same and different modalities but also for different types (i.e., claim and document), and feature representations for explicitly providing metadata for each sample. In addition, we introduce a unified ensemble method to boost model performance by adjusting the importance of each trained model with not only the weights but also the powers. Extensive experiments show that Pre-CoFactv2 outperforms Pre-CoFact by a large margin and achieved new state-of-the-art results at the Factify challenge at AAAI 2023. We further illustrate model variations to verify the relative contributions of different components. Our team won the first prize (F1-score: 81.82%) and we made our code publicly available at https://github.com/wwweiwei/Pre-CoFactv2-AAAI-2023.
    A Neural Pre-Conditioning Active Learning Algorithm to Reduce Label Complexity. (arXiv:2104.03525v2 [cs.LG] UPDATED)
    Deep learning (DL) algorithms rely on massive amounts of labeled data. Semi-supervised learning (SSL) and active learning (AL) aim to reduce this label complexity by leveraging unlabeled data or carefully acquiring labels, respectively. In this work, we primarily focus on designing an AL algorithm but first argue for a change in how AL algorithms should be evaluated. Although unlabeled data is readily available in pool-based AL, AL algorithms are usually evaluated by measuring the increase in supervised learning (SL) performance at consecutive acquisition steps. Because this measures performance gains from both newly acquired instances and newly acquired labels, we propose to instead evaluate the label efficiency of AL algorithms by measuring the increase in SSL performance at consecutive acquisition steps. After surveying tools that can be used to this end, we propose our neural pre-conditioning (NPC) algorithm inspired by a Neural Tangent Kernel (NTK) analysis. Our algorithm incorporates the classifier's uncertainty on unlabeled data and penalizes redundant samples within candidate batches to efficiently acquire a diverse set of informative labels. Furthermore, we prove that NPC improves downstream training in the large-width regime in a manner previously observed to correlate with generalization. Comparisons with other AL algorithms show that a state-of-the-art SSL algorithm coupled with NPC can achieve high performance using very few labeled data.
    Knowledge Enhanced Semantic Communication Receiver. (arXiv:2302.07727v1 [cs.CL])
    In recent years, with the rapid development of deep learning and natural language processing technologies, semantic communication has become a topic of great interest in the field of communication. Although existing deep learning based semantic communication approaches have shown many advantages, they still do not make sufficient use of prior knowledge. Moreover, most existing semantic communication methods focus on the semantic encoding at the transmitter side, while we believe that the semantic decoding capability of the receiver side should also be concerned. In this paper, we propose a knowledge enhanced semantic communication framework in which the receiver can more actively utilize the prior knowledge in the knowledge base for semantic reasoning and decoding, without extra modifications to the neural network structure of the transmitter. Specifically, we design a transformer-based knowledge extractor to find relevant factual triples for the received noisy signal. Extensive simulation results on the WebNLG dataset demonstrate that the proposed receiver yields superior performance on top of the knowledge graph enhanced decoding.
    Deep Offline Reinforcement Learning for Real-World Treatment Optimization Applications. (arXiv:2302.07549v1 [cs.LG])
    There is increasing interest in data-driven approaches for dynamically choosing optimal treatment strategies in many chronic disease management and critical care applications. Reinforcement learning methods are well-suited to this sequential decision-making problem, but must be trained and evaluated exclusively on retrospective medical record datasets as direct online exploration is unsafe and infeasible. Despite this requirement, the vast majority of dynamic treatment optimization studies use off-policy RL methods (e.g., Double Deep Q Networks (DDQN) or its variants) that are known to perform poorly in purely offline settings. Recent advances in offline RL, such as Conservative Q-Learning (CQL), offer a suitable alternative. But there remain challenges in adapting these approaches to real-world applications where suboptimal examples dominate the retrospective dataset and strict safety constraints need to be satisfied. In this work, we introduce a practical transition sampling approach to address action imbalance during offline RL training, and an intuitive heuristic to enforce hard constraints during policy execution. We provide theoretical analyses to show that our proposed approach would improve over CQL. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization to compare performance of the proposed approach against prominent off-policy and offline RL baselines (DDQN and CQL). Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes and in consistency with relevant practice and safety guidelines.
    Guaranteed Dynamic Scheduling of Ultra-Reliable Low-Latency Traffic via Conformal Prediction. (arXiv:2302.07675v1 [eess.SP])
    The dynamic scheduling of ultra-reliable and low-latency traffic (URLLC) in the uplink can significantly enhance the efficiency of coexisting services, such as enhanced mobile broadband (eMBB) devices, by only allocating resources when necessary. The main challenge is posed by the uncertainty in the process of URLLC packet generation, which mandates the use of predictors for URLLC traffic in the coming frames. In practice, such prediction may overestimate or underestimate the amount of URLLC data to be generated, yielding either an excessive or an insufficient amount of resources to be pre-emptively allocated for URLLC packets. In this paper, we introduce a novel scheduler for URLLC packets that provides formal guarantees on reliability and latency irrespective of the quality of the URLLC traffic predictor. The proposed method leverages recent advances in online conformal prediction (CP), and follows the principle of dynamically adjusting the amount of allocated resources so as to meet reliability and latency requirements set by the designer.
    Envisioning the Next-Gen Document Reader. (arXiv:2302.07492v1 [cs.CL])
    People read digital documents on a daily basis to share, exchange, and understand information in electronic settings. However, current document readers create a static, isolated reading experience, which does not support users' goals of gaining more knowledge and performing additional tasks through document interaction. In this work, we present our vision for the next-gen document reader that strives to enhance user understanding and create a more connected, trustworthy information experience. We describe 18 NLP-powered features to add to existing document readers and propose a novel plug-in marketplace that allows users to further customize their reading experience, as demonstrated through 3 exploratory UI prototypes available at https://github.com/catherinesyeh/nextgen-prototypes
    Graph schemas as abstractions for transfer learning, inference, and planning. (arXiv:2302.07350v1 [cs.AI])
    We propose schemas as a model for abstractions that can be used for rapid transfer learning, inference, and planning. Common structured representations of concepts and behaviors -- schemas -- have been proposed as a powerful way to encode abstractions. Latent graph learning is emerging as a new computational model of the hippocampus to explain map learning and transitive inference. We build on this work to show that learned latent graphs in these models have a slot structure -- schemas -- that allow for quick knowledge transfer across environments. In a new environment, an agent can rapidly learn new bindings between the sensory stream to multiple latent schemas and select the best fitting one to guide behavior. To evaluate these graph schemas, we use two previously published challenging tasks: the memory & planning game and one-shot StreetLearn, that are designed to test rapid task solving in novel environments. Graph schemas can be learned in far fewer episodes than previous baselines, and can model and plan in a few steps in novel variations of these tasks. We further demonstrate learning, matching, and reusing graph schemas in navigation tasks in more challenging environments with aliased observations and size variations, and show how different schemas can be composed to model larger 2D and 3D environments.
    A Subspace Projection Approach to Autoencoder-based Anomaly Detection. (arXiv:2302.07643v1 [cs.LG])
    Autoencoder (AE) is a neural network (NN) architecture that is trained to reconstruct an input at its output. By measuring the reconstruction errors of new input samples, AE can detect anomalous samples deviated from the trained data distribution. The key to success is to achieve high-fidelity reconstruction (HFR) while restricting AE's capability of generalization beyond training data, which should be balanced commonly via iterative re-training. Alternatively, we propose a novel framework of AE-based anomaly detection, coined HFR-AE, by projecting new inputs into a subspace wherein the trained AE achieves HFR, thereby increasing the gap between normal and anomalous sample reconstruction errors. Simulation results corroborate that HFR-AE improves the area under receiver operating characteristic curve (AUROC) under different AE architectures and settings by up to 13.4% compared to Vanilla AE-based anomaly detection.
    Photonic reservoir computing enabled by stimulated Brillouin scattering. (arXiv:2302.07698v1 [physics.optics])
    Artificial Intelligence (AI) drives the creation of future technologies that disrupt the way humans live and work, creating new solutions that change the way we approach tasks and activities, but it requires a lot of data processing, large amounts of data transfer, and computing speed. It has led to a growing interest of research in developing a new type of computing platform which is inspired by the architecture of the brain specifically those that exploit the benefits offered by photonic technologies, fast, low-power, and larger bandwidth. Here, a new computing platform based on the photonic reservoir computing architecture exploiting the non-linear wave-optical dynamics of the stimulated Brillouin scattering is reported. The kernel of the new photonic reservoir computing system is constructed of an entirely passive optical system. Moreover, it is readily suited for use in conjunction with high performance optical multiplexing techniques to enable real-time artificial intelligence. Here, a methodology to optimise the operational condition of the new photonic reservoir computing is described which is found to be strongly dependent on the dynamics of the stimulated Brillouin scattering system. The new architecture described here offers a new way of realising AI-hardware which highlight the application of photonics for AI.  ( 2 min )
    XploreNAS: Explore Adversarially Robust & Hardware-efficient Neural Architectures for Non-ideal Xbars. (arXiv:2302.07769v1 [cs.LG])
    Compute In-Memory platforms such as memristive crossbars are gaining focus as they facilitate acceleration of Deep Neural Networks (DNNs) with high area and compute-efficiencies. However, the intrinsic non-idealities associated with the analog nature of computing in crossbars limits the performance of the deployed DNNs. Furthermore, DNNs are shown to be vulnerable to adversarial attacks leading to severe security threats in their large-scale deployment. Thus, finding adversarially robust DNN architectures for non-ideal crossbars is critical to the safe and secure deployment of DNNs on the edge. This work proposes a two-phase algorithm-hardware co-optimization approach called XploreNAS that searches for hardware-efficient & adversarially robust neural architectures for non-ideal crossbar platforms. We use the one-shot Neural Architecture Search (NAS) approach to train a large Supernet with crossbar-awareness and sample adversarially robust Subnets therefrom, maintaining competitive hardware-efficiency. Our experiments on crossbars with benchmark datasets (SVHN, CIFAR10 & CIFAR100) show upto ~8-16% improvement in the adversarial robustness of the searched Subnets against a baseline ResNet-18 model subjected to crossbar-aware adversarial training. We benchmark our robust Subnets for Energy-Delay-Area-Products (EDAPs) using the Neurosim tool and find that with additional hardware-efficiency driven optimizations, the Subnets attain ~1.5-1.6x lower EDAPs than ResNet-18 baseline.
    Extensible Motion-based Identification of XR Users with Non-Specific Motion. (arXiv:2302.07517v1 [cs.HC])
    Recently emerged solutions demonstrate that the movements of users interacting with extended reality (XR) applications carry identifying information and can be leveraged for identification. While such solutions can identify XR users within a few seconds, current systems require one or the other trade-off: either they apply simple distance-based approaches that can only be used for specific predetermined motions. Or they use classification-based approaches that use more powerful machine learning models and thus also work for arbitrary motions, but require full retraining to enroll new users, which can be prohibitively expensive. In this paper, we propose to combine the strengths of both approaches by using an embedding-based approach that leverages deep metric learning. We train the model on a dataset of users playing the VR game "Half-Life: Alyx" and conduct multiple experiments and analyses. The results show that the embedding-based method 1) is able to identify new users from non-specific movements using only a few minutes of reference data, 2) can enroll new users within seconds, while retraining a comparable classification-based approach takes almost a day, 3) is more reliable than a baseline classification-based approach when only little reference data is available, 4) can be used to identify new users from another dataset recorded with different VR devices. Altogether, our solution is a foundation for easily extensible XR user identification systems, applicable even to non-specific movements. It also paves the way for production-ready models that could be used by XR practitioners without the requirements of expertise, hardware, or data for training deep learning models.  ( 2 min )
    Bridging the Usability Gap: Theoretical and Methodological Advances for Spectral Learning of Hidden Markov Models. (arXiv:2302.07437v1 [stat.ML])
    The Baum-Welch (B-W) algorithm is the most widely accepted method for inferring hidden Markov models (HMM). However, it is prone to getting stuck in local optima, and can be too slow for many real-time applications. Spectral learning of HMMs (SHMMs), based on the method of moments (MOM) has been proposed in the literature to overcome these obstacles. Despite its promises, asymptotic theory for SHMM has been elusive, and the long-run performance of SHMM can degrade due to unchecked propogation of error. In this paper, we (1) provide an asymptotic distribution for the approximate error of the likelihood estimated by SHMM, and (2) propose a novel algorithm called projected SHMM (PSHMM) that mitigates the problem of error propogation, and (3) develop online learning variantions of both SHMM and PSHMM that accommodate potential nonstationarity. We compare the performance of SHMM with PSHMM and estimation through the B-W algorithm on both simulated data and data from real world applications, and find that PSHMM not only retains the computational advantages of SHMM, but also provides more robust estimation and forecasting.  ( 2 min )
    Same Same, But Different: Conditional Multi-Task Learning for Demographic-Specific Toxicity Detection. (arXiv:2302.07372v1 [cs.LG])
    Algorithmic bias often arises as a result of differential subgroup validity, in which predictive relationships vary across groups. For example, in toxic language detection, comments targeting different demographic groups can vary markedly across groups. In such settings, trained models can be dominated by the relationships that best fit the majority group, leading to disparate performance. We propose framing toxicity detection as multi-task learning (MTL), allowing a model to specialize on the relationships that are relevant to each demographic group while also leveraging shared properties across groups. With toxicity detection, each task corresponds to identifying toxicity against a particular demographic group. However, traditional MTL requires labels for all tasks to be present for every data point. To address this, we propose Conditional MTL (CondMTL), wherein only training examples relevant to the given demographic group are considered by the loss function. This lets us learn group specific representations in each branch which are not cross contaminated by irrelevant labels. Results on synthetic and real data show that using CondMTL improves predictive recall over various baselines in general and for the minority demographic group in particular, while having similar overall accuracy.  ( 2 min )
    Towards Optimal Compression: Joint Pruning and Quantization. (arXiv:2302.07612v1 [cs.LG])
    Compression of deep neural networks has become a necessary stage for optimizing model inference on resource-constrained hardware. This paper presents FITCompress, a method for unifying layer-wise mixed precision quantization and pruning under a single heuristic, as an alternative to neural architecture search and Bayesian-based techniques. FITCompress combines the Fisher Information Metric, and path planning through compression space, to pick optimal configurations given size and operation constraints with single-shot fine-tuning. Experiments on ImageNet validate the method and show that our approach yields a better trade-off between accuracy and efficiency when compared to the baselines. Besides computer vision benchmarks, we experiment with the BERT model on a language understanding task, paving the way towards its optimal compression.  ( 2 min )
    Convolutional unitary or orthogonal recurrent neural networks. (arXiv:2302.07396v1 [cs.LG])
    Recurrent neural networks are extremely powerful yet hard to train. One of their issues is the vanishing gradient problem, whereby propagation of training signals may be exponentially attenuated, freezing training. Use of orthogonal or unitary matrices, whose powers neither explode nor decay, has been proposed to mitigate this issue, but their computational expense has hindered their use. Here we show that in the specific case of convolutional RNNs, we can define a convolutional exponential and that this operation transforms antisymmetric or anti-Hermitian convolution kernels into orthogonal or unitary convolution kernels. We explicitly derive FFT-based algorithms to compute the kernels and their derivatives. The computational complexity of parametrizing this subspace of orthogonal transformations is thus the same as the networks' iteration.  ( 2 min )
    Bandit Social Learning: Exploration under Myopic Behavior. (arXiv:2302.07425v1 [cs.GT])
    We study social learning dynamics where the agents collectively follow a simple multi-armed bandit protocol. Agents arrive sequentially, choose arms and receive associated rewards. Each agent observes the full history (arms and rewards) of the previous agents, and there are no private signals. While collectively the agents face exploration-exploitation tradeoff, each agent acts myopically, without regards to exploration. Motivating scenarios concern reviews and ratings on online platforms. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals, including the "unbiased" behavior as well as various behaviorial biases. While extreme versions of these behaviors correspond to well-known bandit algorithms, we prove that more moderate versions lead to stark exploration failures, and consequently to regret rates that are linear in the number of agents. We provide matching upper bounds on regret by analyzing "moderately optimistic" agents. As a special case of independent interest, we obtain a general result on failure of the greedy algorithm in multi-armed bandits. This is the first such result in the literature, to the best of our knowledge  ( 2 min )
    Cliff-Learning. (arXiv:2302.07348v1 [cs.LG])
    We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model cliff-learning and study toy models of the phenomenon. We observe that the degree of cliff-learning reflects the degree of compatibility between the priors of a learning algorithm and the task being learned.  ( 2 min )
    Best Arm Identification for Stochastic Rising Bandits. (arXiv:2302.07510v1 [cs.LG])
    Stochastic Rising Bandits is a setting in which the values of the expected rewards of the available options increase every time they are selected. This framework models a wide range of scenarios in which the available options are learning entities whose performance improves over time. In this paper, we focus on the Best Arm Identification (BAI) problem for the stochastic rested rising bandits. In this scenario, we are asked, given a fixed budget of rounds, to provide a recommendation about the best option at the end of the selection process. We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE, which resorts to a UCB-like approach, and R-SR, which employs a successive reject procedure. We show that they provide guarantees on the probability of properly identifying the optimal option at the end of the learning process. Finally, we numerically validate the proposed algorithms in synthetic and realistic environments and compare them with the currently available BAI strategies.  ( 2 min )
    To Risk or Not to Risk: Learning with Risk Quantification for IoT Task Offloading in UAVs. (arXiv:2302.07399v1 [cs.NI])
    A deep reinforcement learning technique is presented for task offloading decision-making algorithms for a multi-access edge computing (MEC) assisted unmanned aerial vehicle (UAV) network in a smart farm Internet of Things (IoT) environment. The task offloading technique uses financial concepts such as cost functions and conditional variable at risk (CVaR) in order to quantify the damage that may be caused by each risky action. The approach was able to quantify potential risks to train the reinforcement learning agent to avoid risky behaviors that will lead to irreversible consequences for the farm. Such consequences include an undetected fire, pest infestation, or a UAV being unusable. The proposed CVaR-based technique was compared to other deep reinforcement learning techniques and two fixed rule-based techniques. The simulation results show that the CVaR-based risk quantifying method eliminated the most dangerous risk, which was exceeding the deadline for a fire detection task. As a result, it reduced the total number of deadline violations with a negligible increase in energy consumption.  ( 2 min )
    Understanding Expertise through Demonstrations: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning. (arXiv:2302.07457v1 [cs.LG])
    Offline inverse reinforcement learning (Offline IRL) aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving. However, the structure of an expert's preferences implicit in observed actions is closely linked to the expert's model of the environment dynamics (i.e. the ``world''). Thus, inaccurate models of the world obtained from finite data with limited coverage could compound inaccuracy in estimated rewards. To address this issue, we propose a bi-level optimization formulation of the estimation task wherein the upper level is likelihood maximization based upon a conservative model of the expert's policy (lower level). The policy model is conservative in that it maximizes reward subject to a penalty that is increasing in the uncertainty of the estimated model of the world. We propose a new algorithmic framework to solve the bi-level optimization problem formulation and provide statistical and computational guarantees of performance for the associated reward estimator. Finally, we demonstrate that the proposed algorithm outperforms the state-of-the-art offline IRL and imitation learning benchmarks by a large margin, over the continuous control tasks in MuJoCo and different datasets in the D4RL benchmark.  ( 2 min )
    Quantum Learning Theory Beyond Batch Binary Classification. (arXiv:2302.07409v1 [cs.LG])
    Arunachalam and De Wolf (2018) showed that the sample complexity of quantum batch learning of boolean functions, in the realizable and agnostic settings, has the same form and order as the corresponding classical sample complexities. In this paper, we extend this, ostensibly surprising, message to batch multiclass learning, online boolean learning, and online multiclass learning. For our online learning results, we first consider an adaptive adversary variant of the classical model of Dawid and Tewari (2022). Then, we introduce the first (to the best of our knowledge) model of online learning with quantum examples.  ( 2 min )
    Score-based Diffusion Models in Function Space. (arXiv:2302.07400v1 [cs.LG])
    Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many domains where the data has a functional form such as in scientific computing and 3D geometric data analysis. In this work, we introduce a mathematically rigorous framework called Denoising Diffusion Operators (DDOs) for training diffusion models in function space. In DDOs, the forward process perturbs input functions gradually using a Gaussian process. The generative process is formulated by integrating a function-valued Langevin dynamic. Our approach requires an appropriate notion of the score for the perturbed data distribution, which we obtain by generalizing denoising score matching to function spaces that can be infinite-dimensional. We show that the corresponding discretized algorithm generates accurate samples at a fixed cost that is independent of the data resolution. We theoretically and numerically verify the applicability of our approach on a set of problems, including generating solutions to the Navier-Stokes equation viewed as the push-forward distribution of forcings from a Gaussian Random Field (GRF).  ( 2 min )
    A dataset for Audio-Visual Sound Event Detection in Movies. (arXiv:2302.07315v1 [eess.AS])
    Audio event detection is a widely studied audio processing task, with applications ranging from self-driving cars to healthcare. In-the-wild datasets such as Audioset have propelled research in this field. However, many efforts typically involve manual annotation and verification, which is expensive to perform at scale. Movies depict various real-life and fictional scenarios which makes them a rich resource for mining a wide-range of audio events. In this work, we present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S). We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies. We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds. We discuss the choices involved in generating the taxonomy, and also highlight the human-centered nature of sounds in our dataset. We establish a baseline performance for audio-only sound classification of 34.76% mean average precision and show that incorporating visual information can further improve the performance by about 5%. Data and code are made available for research at https://github.com/usc-sail/mica-subtitle-aligned-movie-sounds  ( 2 min )
    Unsupervised physics-informed neural network in reaction-diffusion biology models (Ulcerative colitis and Crohn's disease cases) A preliminary study. (arXiv:2302.07405v1 [cs.LG])
    We propose to explore the potential of physics-informed neural networks (PINNs) in solving a class of partial differential equations (PDEs) used to model the propagation of chronic inflammatory bowel diseases, such as Crohn's disease and ulcerative colitis. An unsupervised approach was privileged during the deep neural network training. Given the complexity of the underlying biological system, characterized by intricate feedback loops and limited availability of high-quality data, the aim of this study is to explore the potential of PINNs in solving PDEs. In addition to providing this exploratory assessment, we also aim to emphasize the principles of reproducibility and transparency in our approach, with a specific focus on ensuring the robustness and generalizability through the use of artificial intelligence. We will quantify the relevance of the PINN method with several linear and non-linear PDEs in relation to biology. However, it is important to note that the final solution is dependent on the initial conditions, chosen boundary conditions, and neural network architectures.  ( 2 min )
    A Provably Improved Algorithm for Crowdsourcing with Hard and Easy Tasks. (arXiv:2302.07393v1 [cs.LG])
    Crowdsourcing is a popular method used to estimate ground-truth labels by collecting noisy labels from workers. In this work, we are motivated by crowdsourcing applications where each worker can exhibit two levels of accuracy depending on a task's type. Applying algorithms designed for the traditional Dawid-Skene model to such a scenario results in performance which is limited by the hard tasks. Therefore, we first extend the model to allow worker accuracy to vary depending on a task's unknown type. Then we propose a spectral method to partition tasks by type. After separating tasks by type, any Dawid-Skene algorithm (i.e., any algorithm designed for the Dawid-Skene model) can be applied independently to each type to infer the truth values. We theoretically prove that when crowdsourced data contain tasks with varying levels of difficulty, our algorithm infers the true labels with higher accuracy than any Dawid-Skene algorithm. Experiments show that our method is effective in practical applications.  ( 2 min )
    Linearized Wasserstein dimensionality reduction with approximation guarantees. (arXiv:2302.07373v1 [cs.LG])
    We introduce LOT Wassmap, a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space. The algorithm is motivated by the observation that many datasets are naturally interpreted as probability measures rather than points in $\mathbb{R}^n$, and that finding low-dimensional descriptions of such datasets requires manifold learning algorithms in the Wasserstein space. Most available algorithms are based on computing the pairwise Wasserstein distance matrix, which can be computationally challenging for large datasets in high dimensions. Our algorithm leverages approximation schemes such as Sinkhorn distances and linearized optimal transport to speed-up computations, and in particular, avoids computing a pairwise distance matrix. We provide guarantees on the embedding quality under such approximations, including when explicit descriptions of the probability measures are not available and one must deal with finite samples instead. Experiments demonstrate that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size. We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.  ( 2 min )
    FedABC: Targeting Fair Competition in Personalized Federated Learning. (arXiv:2302.07450v1 [cs.LG])
    Federated learning aims to collaboratively train models without accessing their client's local private data. The data may be Non-IID for different clients and thus resulting in poor performance. Recently, personalized federated learning (PFL) has achieved great success in handling Non-IID data by enforcing regularization in local optimization or improving the model aggregation scheme on the server. However, most of the PFL approaches do not take into account the unfair competition issue caused by the imbalanced data distribution and lack of positive samples for some classes in each client. To address this issue, we propose a novel and generic PFL framework termed Federated Averaging via Binary Classification, dubbed FedABC. In particular, we adopt the ``one-vs-all'' training strategy in each client to alleviate the unfair competition between classes by constructing a personalized binary classification problem for each class. This may aggravate the class imbalance challenge and thus a novel personalized binary classification loss that incorporates both the under-sampling and hard sample mining strategies is designed. Extensive experiments are conducted on two popular datasets under different settings, and the results demonstrate that our FedABC can significantly outperform the existing counterparts.  ( 2 min )
    On Classification-Calibration of Gamma-Phi Losses. (arXiv:2302.07321v1 [stat.ML])
    Gamma-Phi losses constitute a family of multiclass classification loss functions that generalize the logistic and other common losses, and have found application in the boosting literature. We establish the first general sufficient condition for the classification-calibration of such losses. In addition, we show that a previously proposed sufficient condition is in fact not sufficient.  ( 2 min )
    FedLE: Federated Learning Client Selection with Lifespan Extension for Edge IoT Networks. (arXiv:2302.07305v1 [cs.LG])
    Federated learning (FL) is a distributed and privacy-preserving learning framework for predictive modeling with massive data generated at the edge by Internet of Things (IoT) devices. One major challenge preventing the wide adoption of FL in IoT is the pervasive power supply constraints of IoT devices due to the intensive energy consumption of battery-powered clients for local training and model updates. Low battery levels of clients eventually lead to their early dropouts from edge networks, loss of training data jeopardizing the performance of FL, and their availability to perform other designated tasks. In this paper, we propose FedLE, an energy-efficient client selection framework that enables lifespan extension of edge IoT networks. In FedLE, the clients first run for a minimum epoch to generate their local model update. The models are partially uploaded to the server for calculating similarities between each pair of clients. Clustering is performed against these client pairs to identify those with similar model distributions. In each round, low-powered clients have a lower probability of being selected, delaying the draining of their batteries. Empirical studies show that FedLE outperforms baselines on benchmark datasets and lasts more training rounds than FedAvg with battery power constraints.  ( 2 min )
    Derandomized Novelty Detection with FDR Control via Conformal E-values. (arXiv:2302.07294v1 [cs.LG])
    Conformal prediction and other randomized model-free inference techniques are gaining increasing attention as general solutions to rigorously calibrate the output of any machine learning algorithm for novelty detection. This paper contributes to the field by developing a novel method for mitigating their algorithmic randomness, leading to an even more interpretable and reliable framework for powerful novelty detection under false discovery rate control. The idea is to leverage suitable conformal e-values instead of p-values to quantify the significance of each finding, which allows the evidence gathered from multiple mutually dependent analyses of the same data to be seamlessly aggregated. Further, the proposed method can reduce randomness without much loss of power, partly thanks to an innovative way of weighting conformal e-values based on additional side information carefully extracted from the same data. Simulations with synthetic and real data confirm this solution can be effective at eliminating random noise in the inferences obtained with state-of-the-art alternative techniques, sometimes also leading to higher power.  ( 2 min )
    Algorithm Selection for Deep Active Learning with Imbalanced Datasets. (arXiv:2302.07317v1 [cs.LG])
    Label efficiency has become an increasingly important objective in deep learning applications. Active learning aims to reduce the number of labeled examples needed to train deep networks, but the empirical performance of active learning algorithms can vary dramatically across datasets and applications. It is difficult to know in advance which active learning strategy will perform well or best in a given application. To address this, we propose the first adaptive algorithm selection strategy for deep active learning. For any unlabeled dataset, our (meta) algorithm TAILOR (Thompson ActIve Learning algORithm selection) iteratively and adaptively chooses among a set of candidate active learning algorithms. TAILOR uses novel reward functions aimed at gathering class-balanced examples. Extensive experiments in multi-class and multi-label applications demonstrate TAILOR's effectiveness in achieving accuracy comparable or better than that of the best of the candidate algorithms.  ( 2 min )
  • Open

    Scalable Batch Acquisition for Deep Bayesian Active Learning. (arXiv:2301.05490v2 [cs.LG] UPDATED)
    In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.
    On Finite-Step Convergence of the Non-Greedy Algorithm for $L_1$-Norm PCA and Beyond. (arXiv:2302.07712v1 [math.OC])
    The non-greedy algorithm for $L_1$-norm PCA proposed in \cite{nie2011robust} is revisited and its convergence properties are studied. The algorithm is first interpreted as a conditional subgradient or an alternating maximization method. By treating it as a conditional subgradient, the iterative points generated by the algorithm will not change in finitely many steps under a certain full-rank assumption; such an assumption can be removed when the projection dimension is one. By treating the algorithm as an alternating maximization, it is proved that the objective value will not change after at most $\left\lceil \frac{F^{\max}}{\tau_0} \right\rceil$ steps. The stopping point satisfies certain optimality conditions. Then, a variant algorithm with improved convergence properties is studied. The iterative points generated by the algorithm will not change after at most $\left\lceil \frac{2F^{\max}}{\tau} \right\rceil$ steps and the stopping point also satisfies certain optimality conditions given a small enough $\tau$. Similar finite-step convergence is also established for a slight modification of the PAMe proposed in \cite{wang2021linear} very recently under a certain full-rank assumption. Such an assumption can also be removed when the projection dimension is one.  ( 2 min )
    Doubly-Optimistic Play for Safe Linear Bandits. (arXiv:2209.13694v2 [cs.LG] UPDATED)
    The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown round-wise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study aggressive \emph{doubly-optimistic play} in SLBs, and their role in avoiding the strong assumptions and poor efficacy associated with extant pessimistic-optimistic solutions. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret and safety violations due to an inability to refine the location of optimal actions to arbitrary precision. In a positive direction, we propose and analyse a doubly-optimistic confidence-bound based strategy for the safe linear bandit problem, DOSLB, which exploits supreme optimism by using optimistic estimates of both reward and safety risks to select actions. Using a novel dual analysis, we show that despite the lack of knowledge of constraints, DOSLB rarely takes overly risky actions, and obtains tight instance-dependent $O(\log^2 T)$ bounds on both efficacy regret and net safety violations up to any finite precision, thus yielding large efficacy gains at a small safety cost and without strong assumptions. Concretely, we argue that algorithm activates noisy versions of an `optimal' set of constraints at each round, and activation of suboptimal sets of constraints is limited by the larger of a safety and efficacy gap we define.  ( 2 min )
    Multiclass Learnability Beyond the PAC Framework: Universal Rates and Partial Concept Classes. (arXiv:2210.02297v3 [cs.LG] UPDATED)
    In this paper we study the problem of multiclass classification with a bounded number of different labels $k$, in the realizable setting. We extend the traditional PAC model to a) distribution-dependent learning rates, and b) learning rates under data-dependent assumptions. First, we consider the universal learning setting (Bousquet, Hanneke, Moran, van Handel and Yehudayoff, STOC '21), for which we provide a complete characterization of the achievable learning rates that holds for every fixed distribution. In particular, we show the following trichotomy: for any concept class, the optimal learning rate is either exponential, linear or arbitrarily slow. Additionally, we provide complexity measures of the underlying hypothesis class that characterize when these rates occur. Second, we consider the problem of multiclass classification with structured data (such as data lying on a low dimensional manifold or satisfying margin conditions), a setting which is captured by partial concept classes (Alon, Hanneke, Holzman and Moran, FOCS '21). Partial concepts are functions that can be undefined in certain parts of the input space. We extend the traditional PAC learnability of total concept classes to partial concept classes in the multiclass setting and investigate differences between partial and total concepts.  ( 2 min )
    Function-space regularized R\'enyi divergences. (arXiv:2210.04974v2 [stat.ML] UPDATED)
    We propose a new family of regularized R\'enyi divergences parametrized not only by the order $\alpha$ but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard R\'enyi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when $\alpha>1$; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical R\'enyi divergences and IPMs. We also study the $\alpha\to\infty$ limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized R\'enyi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with low-dimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance.  ( 2 min )
    Spatially heterogeneous learning by a deep student machine. (arXiv:2302.07419v1 [cond-mat.dis-nn])
    Despite the spectacular successes, deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes. To shed light on the hidden layers of DNN, we study supervised learning by a DNN of width $N$ and depth $L$ consisting of perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting. We consider an ensemble of student machines that exactly reproduce $M$ sets of $N$ dimensional input/output relations provided by a teacher machine. We analyze the ensemble theoretically using a replica method (H. Yoshino (2020)) and numerically performing greedy Monte Carlo simulations. The replica theory which works on high dimensional data $N \gg 1$ becomes exact in 'dense limit' $N \gg c \gg 1$ and $M \gg 1$ with fixed $\alpha=M/c$. Both the theory and the simulation suggest learning by the DNN is quite heterogeneous in the network space: configurations of the machines are more correlated within the layers closer to the input/output boundaries while the central region remains much less correlated due to over-parametrization. Deep enough systems relax faster thanks to the less correlated central region. Remarkably both the theory and simulation suggest generalization-ability of the student machines does not vanish even in the deep limit $L \gg 1$ where the system becomes strongly over-parametrized. We also consider the impact of effective dimension $D(\leq N)$ of data by incorporating the hidden manifold model (S. Goldt et al (2020)) into our model. The replica theory implies that the loop corrections to the dense limit, which reflect correlations between different nodes in the network, become enhanced by either decreasing the width $\ N$ or decreasing the effective dimension $D$ of the data. Simulation suggests both leads to significant improvements in generalization-ability.  ( 2 min )
    Adapting to game trees in zero-sum imperfect information games. (arXiv:2212.12567v2 [stat.ML] UPDATED)
    Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $\epsilon$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ on the required number of realizations to learn these strategies with high probability, where $H$ is the length of the game, $A_{\mathcal{X}}$ and $B_{\mathcal{Y}}$ are the total number of actions for the two players. We also propose two Follow the Regularized leader (FTRL) algorithms for this setting: Balanced FTRL which matches this lower bound, but requires the knowledge of the information set structure beforehand to define the regularization; and Adaptive FTRL which needs $\widetilde{\mathcal{O}}(H^2(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ realizations without this requirement by progressively adapting the regularization to the observations.  ( 2 min )
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v3 [stat.ML] UPDATED)
    Model-based unsupervised learning, as any learning task, stalls as soon asmissing data occurs. This is even more true when the missing data are infor-mative, or said missing not at random (MNAR). In this paper, we proposemodel-based clustering algorithms designed to handle very general typesof missing data, including MNAR data. To do so, we introduce a mixturemodel for different types of data (continuous, count, categorical and mixed)to jointly model the data distribution and the MNAR mechanism, remainingvigilant to the degrees of freedom of each. Eight different MNAR modelswhich depend on the class membership and/or on the values of the missingvariables themselves are proposed. For a particular type of MNAR mod-els, for which the missingness depends on the class membership, we showthat the statistical inference can be carried out on the data matrix concate-nated with the missing mask considering a MAR mechanism instead; thisspecifically underlines the versatility of the studied MNAR models. Then,we establish sufficient conditions for identifiability of parameters of both thedata distribution and the mechanism. Regardless of the type of data and themechanism, we propose to perform clustering using EM or stochastic EMalgorithms specially developed for the purpose. Finally, we assess the nu-merical performances of the proposed methods on synthetic data and on thereal medical registry TraumaBase as well.  ( 2 min )
    Bridging the Usability Gap: Theoretical and Methodological Advances for Spectral Learning of Hidden Markov Models. (arXiv:2302.07437v1 [stat.ML])
    The Baum-Welch (B-W) algorithm is the most widely accepted method for inferring hidden Markov models (HMM). However, it is prone to getting stuck in local optima, and can be too slow for many real-time applications. Spectral learning of HMMs (SHMMs), based on the method of moments (MOM) has been proposed in the literature to overcome these obstacles. Despite its promises, asymptotic theory for SHMM has been elusive, and the long-run performance of SHMM can degrade due to unchecked propogation of error. In this paper, we (1) provide an asymptotic distribution for the approximate error of the likelihood estimated by SHMM, and (2) propose a novel algorithm called projected SHMM (PSHMM) that mitigates the problem of error propogation, and (3) develop online learning variantions of both SHMM and PSHMM that accommodate potential nonstationarity. We compare the performance of SHMM with PSHMM and estimation through the B-W algorithm on both simulated data and data from real world applications, and find that PSHMM not only retains the computational advantages of SHMM, but also provides more robust estimation and forecasting.  ( 2 min )
    Adversarially Robust Learning with Tolerance. (arXiv:2203.00849v2 [stat.ML] UPDATED)
    We initiate the study of tolerant adversarial PAC-learning with respect to metric perturbation sets. In adversarial PAC-learning, an adversary is allowed to replace a test point $x$ with an arbitrary point in a closed ball of radius $r$ centered at $x$. In the tolerant version, the error of the learner is compared with the best achievable error with respect to a slightly larger perturbation radius $(1+\gamma)r$. This simple tweak helps us bridge the gap between theory and practice and obtain the first PAC-type guarantees for algorithmic techniques that are popular in practice. Our first result concerns the widely-used ``perturb-and-smooth'' approach for adversarial learning. For perturbation sets with doubling dimension $d$, we show that a variant of these approaches PAC-learns any hypothesis class $\mathcal{H}$ with VC-dimension $v$ in the $\gamma$-tolerant adversarial setting with $O\left(\frac{v(1+1/\gamma)^{O(d)}}{\varepsilon}\right)$ samples. This is in contrast to the traditional (non-tolerant) setting in which, as we show, the perturb-and-smooth approach can provably fail. Our second result shows that one can PAC-learn the same class using $\widetilde{O}\left(\frac{d.v\log(1+1/\gamma)}{\varepsilon^2}\right)$ samples even in the agnostic setting. This result is based on a novel compression-based algorithm, and achieves a linear dependence on the doubling dimension as well as the VC-dimension. This is in contrast to the non-tolerant setting where there is no known sample complexity upper bound that depend polynomially on the VC-dimension.  ( 2 min )
    Feature Learning for Nonlinear Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v3 [cs.LG] UPDATED)
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are distorted or masked by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate a set of optimized data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, named neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments and case studies using synthetic and real-world datasets.  ( 2 min )
    Similarity, Compression and Local Steps: Three Pillars of Efficient Communications for Distributed Variational Inequalities. (arXiv:2302.07615v1 [math.OC])
    Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, fixed point problems as special cases. Therefore, variational inequalities are used in a variety of applications ranging from equilibrium search to adversarial learning. Today's realities with the increasing size of data and models demand parallel and distributed computing for real-world machine learning problems, most of which can be represented as variational inequalities. Meanwhile, most distributed approaches has a significant bottleneck - the cost of communications. The three main techniques to reduce both the total number of communication rounds and the cost of one such round are the use of similarity of local functions, compression of transmitted information and local updates. In this paper, we combine all these approaches. Such a triple synergy did not exist before for variational inequalities and saddle problems, nor even for minimization problems. The methods presented in this paper have the best theoretical guarantees of communication complexity and are significantly ahead of other methods for distributed variational inequalities. The theoretical results are confirmed by adversarial learning experiments on synthetic and real datasets.  ( 2 min )
    Optimal Sample Complexity of Reinforcement Learning for Uniformly Ergodic Discounted Markov Decision Processes. (arXiv:2302.07477v1 [cs.LG])
    We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for controlling the infinite horizon discounted reward in a Markov decision process (MDP). Optimal min-max complexity results have been developed for tabular RL in this setting, leading to a sample complexity dependence on $\gamma$ and $\epsilon$ of the form $\tilde \Theta((1-\gamma)^{-3}\epsilon^{-2})$, where $\gamma$ is the discount factor and $\epsilon$ is the tolerance solution error. However, in many applications of interest, the optimal policy (or all policies) will induce mixing. We show that in these settings the optimal min-max complexity is $\tilde \Theta(t_{\text{minorize}}(1-\gamma)^{-2}\epsilon^{-2})$, where $t_{\text{minorize}}$ is a measure of mixing that is within an equivalent factor of the total variation mixing time. Our analysis is based on regeneration-type ideas, that, we believe are of independent interest since they can be used to study related problems for general state space MDPs.  ( 2 min )
    Continuous PDE Dynamics Forecasting with Implicit Neural Representations. (arXiv:2209.14855v2 [cs.LG] UPDATED)
    Effective data-driven PDE forecasting methods often rely on fixed spatial and / or temporal discretizations. This raises limitations in real-world applications like weather prediction where flexible extrapolation at arbitrary spatiotemporal locations is required. We address this problem by introducing a new data-driven approach, DINo, that models a PDE's flow with continuous-time dynamics of spatially continuous functions. This is achieved by embedding spatial observations independently of their discretization via Implicit Neural Representations in a small latent space temporally driven by a learned ODE. This separate and flexible treatment of time and space makes DINo the first data-driven model to combine the following advantages. It extrapolates at arbitrary spatial and temporal locations; it can learn from sparse irregular grids or manifolds; at test time, it generalizes to new grids or resolutions. DINo outperforms alternative neural PDE forecasters in a variety of challenging generalization scenarios on representative PDE systems.  ( 2 min )
    Online Statistical Inference for Nonlinear Stochastic Approximation with Markovian Data. (arXiv:2302.07690v1 [math.ST])
    We study the statistical inference of nonlinear stochastic approximation algorithms utilizing a single trajectory of Markovian data. Our methodology has practical applications in various scenarios, such as Stochastic Gradient Descent (SGD) on autoregressive data and asynchronous Q-Learning. By utilizing the standard stochastic approximation (SA) framework to estimate the target parameter, we establish a functional central limit theorem for its partial-sum process, $\boldsymbol{\phi}_T$. To further support this theory, we provide a matching semiparametric efficient lower bound and a non-asymptotic upper bound on its weak convergence, measured in the L\'evy-Prokhorov metric. This functional central limit theorem forms the basis for our inference method. By selecting any continuous scale-invariant functional $f$, the asymptotic pivotal statistic $f(\boldsymbol{\phi}_T)$ becomes accessible, allowing us to construct an asymptotically valid confidence interval. We analyze the rejection probability of a family of functionals $f_m$, indexed by $m \in \mathbb{N}$, through theoretical and numerical means. The simulation results demonstrate the validity and efficiency of our method.  ( 2 min )
    Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism. (arXiv:2302.07540v1 [stat.ML])
    Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.  ( 2 min )
    PAC-Bayesian Learning of Optimization Algorithms. (arXiv:2210.11113v2 [cs.LG] UPDATED)
    We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algorithms provably outperform related algorithms based on a (deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. By generalizing existing ideas, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum, which enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hyperparameters of standard optimization algorithms to empirically underline our theory.  ( 2 min )
    Contraction of Locally Differentially Private Mechanisms. (arXiv:2210.13386v2 [cs.IT] UPDATED)
    We investigate the contraction properties of locally differentially private mechanisms. More specifically, we derive tight upper bounds on the divergence between $P\mathsf{K}$ and $Q\mathsf{K}$ output distributions of an $\varepsilon$-LDP mechanism $\mathsf{K}$ in terms of a divergence between the corresponding input distributions $P$ and $Q$, respectively. Our first main technical result presents a sharp upper bound on the $\chi^2$-divergence $\chi^2(P\mathsf{K}\|Q\mathsf{K})$ in terms of $\chi^2(P\|Q)$ and $\varepsilon$. We also show that the same result holds for a large family of divergences, including KL-divergence and squared Hellinger distance. The second main technical result gives an upper bound on $\chi^2(P\mathsf{K}\|Q\mathsf{K})$ in terms of total variation distance $\mathsf{TV}(P, Q)$ and $\varepsilon$. We then utilize these bounds to establish locally private versions of the van Trees inequality, Le Cam's, Assouad's, and the mutual information methods, which are powerful tools for bounding minimax estimation risks. These results are shown to lead to better privacy analyses than the state-of-the-arts in several statistical problems such as entropy and discrete distribution estimation, non-parametric density estimation, and hypothesis testing.  ( 2 min )
    A model-free feature selection technique of feature screening and random forest based recursive feature elimination. (arXiv:2302.07449v1 [stat.ME])
    In this paper, we propose a model-free feature selection method for ultra-high dimensional data with mass features. This is a two phases procedure that we propose to use the fused Kolmogorov filter with the random forest based RFE to remove model limitations and reduce the computational complexity. The method is fully nonparametric and can work with various types of datasets. It has several appealing characteristics, i.e., accuracy, model-free, and computational efficiency, and can be widely used in practical problems, such as multiclass classification, nonparametric regression, and Poisson regression, among others. We show that the proposed method is selection consistent and $L_2$ consistent under weak regularity conditions. We further demonstrate the superior performance of the proposed method over other existing methods by simulations and real data examples.  ( 2 min )
    Bayesian Robust Tensor Ring Model for Incomplete Multiway Data. (arXiv:2202.13321v2 [cs.LG] UPDATED)
    Robust tensor completion (RTC) aims to recover a low-rank tensor from its incomplete observation with outlier corruption. The recently proposed tensor ring (TR) model has demonstrated superiority in solving the RTC problem. However, the existing methods either require a pre-assigned TR rank or aggressively pursue the minimum TR rank, thereby often leading to biased solutions in the presence of noise. In this paper, a Bayesian robust tensor ring decomposition (BRTR) method is proposed to give more accurate solutions to the RTC problem, which can avoid exquisite selection of the TR rank and penalty parameters. A variational Bayesian (VB) algorithm is developed to infer the probability distribution of posteriors. During the learning process, BRTR can prune off slices of core tensor with marginal components, resulting in automatic TR rank detection. Extensive experiments show that BRTR can achieve significantly improved performance than other state-of-the-art methods.  ( 2 min )
    Improved Online Conformal Prediction via Strongly Adaptive Online Learning. (arXiv:2302.07869v1 [cs.LG])
    We study the problem of uncertainty quantification via prediction sets, in an online setting where the data distribution may vary arbitrarily over time. Recent work develops online conformal prediction techniques that leverage regret minimization algorithms from the online learning literature to learn prediction sets with approximately valid coverage and small regret. However, standard regret minimization could be insufficient for handling changing environments, where performance guarantees may be desired not only over the full time horizon but also in all (sub-)intervals of time. We develop new online conformal prediction methods that minimize the strongly adaptive regret, which measures the worst-case regret over all intervals of a fixed length. We prove that our methods achieve near-optimal strongly adaptive regret for all interval lengths simultaneously, and approximately valid coverage. Experiments show that our methods consistently obtain better coverage and smaller prediction sets than existing methods on real-world tasks, such as time series forecasting and image classification under distribution shift.  ( 2 min )
    On Variance Estimation of Random Forests with Infinite-Order U-statistics. (arXiv:2202.09008v4 [stat.ML] UPDATED)
    Infinite-order U-statistics (IOUS) has been used extensively on subbagging ensemble learning algorithms such as random forests to quantify its uncertainty. While normality results of IOUS have been studied extensively, its variance estimation approaches and theoretical properties remain mostly unexplored. Existing approaches mainly utilize the leading term dominance property in the Hoeffding decomposition. However, such a view usually leads to biased estimation when the kernel size is large or the sample size is small. On the other hand, while several unbiased estimators exist in the literature, their relationships and theoretical properties, especially the ratio consistency, have never been studied. These limitations lead to unguaranteed performances of constructed confidence intervals. To bridge these gaps in the literature, we propose a new view of the Hoeffding decomposition for variance estimation that leads to an unbiased estimator. Instead of leading term dominance, our view utilizes the dominance of the peak region. Moreover, we establish the connection and equivalence of our estimator with several existing unbiased variance estimators. Theoretically, we are the first to establish the ratio consistency of such a variance estimator, which justifies the coverage rate of confidence intervals constructed from random forests. Numerically, we further propose a local smoothing procedure to improve the estimator's finite sample performance. Extensive simulation studies show that our estimators enjoy lower bias and archive targeted coverage rates.  ( 2 min )
    Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. (arXiv:2012.09816v3 [cs.LG] UPDATED)
    We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in Deep Learning works very differently from traditional learning theory (such as boosting or NTKs, neural tangent kernels). To properly understand them, we develop a theory showing that when data has a structure we refer to as ``multi-view'', then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.  ( 2 min )
    MCAL: Minimum Cost Human-Machine Active Labeling. (arXiv:2006.13999v2 [cs.LG] UPDATED)
    Today, groundtruth generation relies on datasets annotated by cloud-based annotation services. These rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6x lower overall cost relative to human labeling the entire dataset, and is always cheaper than the cheapest competing strategy.  ( 2 min )
    Excess risk bound for deep learning under weak dependence. (arXiv:2302.07503v1 [stat.ML])
    This paper considers deep neural networks for learning weakly dependent processes in a general framework that includes, for instance, regression estimation, time series prediction, time series classification. The $\psi$-weak dependence structure considered is quite large and covers other conditions such as mixing, association,$\ldots$ Firstly, the approximation of smooth functions by deep neural networks with a broad class of activation functions is considered. We derive the required depth, width and sparsity of a deep neural network to approximate any H\"{o}lder smooth function, defined on any compact set $\mx$. Secondly, we establish a bound of the excess risk for the learning of weakly dependent observations by deep neural networks. When the target function is sufficiently smooth, this bound is close to the usual $\mathcal{O}(n^{-1/2})$.  ( 2 min )
    The Geometry of Neural Nets' Parameter Spaces Under Reparametrization. (arXiv:2302.07384v1 [cs.LG])
    Model reparametrization -- transforming the parameter space via a bijective differentiable map -- is a popular way to improve the training of neural networks. But reparametrizations have also been problematic since they induce inconsistencies in, e.g., Hessian-based flatness measures, optimization trajectories, and modes of probability density functions. This complicates downstream analyses, e.g. one cannot make a definitive statement about the connection between flatness and generalization. In this work, we study the invariance quantities of neural nets under reparametrization from the perspective of Riemannian geometry. We show that this notion of invariance is an inherent property of any neural net, as long as one acknowledges the assumptions about the metric that is always present, albeit often implicitly, and uses the correct transformation rules under reparametrization. We present discussions on measuring the flatness of minima, in optimization, and in probability-density maximization, along with applications in studying the biases of optimizers and in Bayesian inference.  ( 2 min )
    Sparse-SignSGD with Majority Vote for Communication-Efficient Distributed Learning. (arXiv:2302.07475v1 [cs.LG])
    The training efficiency of complex deep learning models can be significantly improved through the use of distributed optimization. However, this process is often hindered by a large amount of communication cost between workers and a parameter server during iterations. To address this bottleneck, in this paper, we present a new communication-efficient algorithm that offers the synergistic benefits of both sparsification and sign quantization, called ${\sf S}^3$GD-MV. The workers in ${\sf S}^3$GD-MV select the top-$K$ magnitude components of their local gradient vector and only send the signs of these components to the server. The server then aggregates the signs and returns the results via a majority vote rule. Our analysis shows that, under certain mild conditions, ${\sf S}^3$GD-MV can converge at the same rate as signSGD while significantly reducing communication costs, if the sparsification parameter $K$ is properly chosen based on the number of workers and the size of the deep learning model. Experimental results using both independent and identically distributed (IID) and non-IID datasets demonstrate that the ${\sf S}^3$GD-MV attains higher accuracy than signSGD, significantly reducing communication costs. These findings highlight the potential of ${\sf S}^3$GD-MV as a promising solution for communication-efficient distributed optimization in deep learning.  ( 2 min )
    Score-based Diffusion Models in Function Space. (arXiv:2302.07400v1 [cs.LG])
    Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many domains where the data has a functional form such as in scientific computing and 3D geometric data analysis. In this work, we introduce a mathematically rigorous framework called Denoising Diffusion Operators (DDOs) for training diffusion models in function space. In DDOs, the forward process perturbs input functions gradually using a Gaussian process. The generative process is formulated by integrating a function-valued Langevin dynamic. Our approach requires an appropriate notion of the score for the perturbed data distribution, which we obtain by generalizing denoising score matching to function spaces that can be infinite-dimensional. We show that the corresponding discretized algorithm generates accurate samples at a fixed cost that is independent of the data resolution. We theoretically and numerically verify the applicability of our approach on a set of problems, including generating solutions to the Navier-Stokes equation viewed as the push-forward distribution of forcings from a Gaussian Random Field (GRF).  ( 2 min )
    Efficiently Learning Neural Networks: What Assumptions May Suffice?. (arXiv:2302.07426v1 [cs.LG])
    Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution is Gaussian, and the weight matrix is non-degenerate. In this work, we study whether such assumptions may suffice for learning deeper networks and prove negative results. We show that learning depth-$3$ ReLU networks under the Gaussian input distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network's parameters. It implies that learning depth-$3$ ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Moreover, we consider depth-$2$ networks, and show hardness of learning in the smoothed-analysis framework, where both the network parameters and the input distribution are smoothed. Our hardness results are under a well-studied assumption on the existence of local pseudorandom generators.  ( 2 min )
    Variable Selection for Kernel Two-Sample Tests. (arXiv:2302.07415v1 [stat.ML])
    We consider the variable selection problem for two-sample tests, aiming to select the most informative features to best distinguish samples from two groups. We propose a kernel maximum mean discrepancy (MMD) framework to solve this problem and further derive its equivalent mixed-integer programming formulations for linear, quadratic, and Gaussian types of kernel functions. Our proposed framework admits advantages of both computational efficiency and nice statistical properties: (i) A closed-form solution is provided for the linear kernel case. Despite NP-hardness, we provide an exact mixed-integer semi-definite programming formulation for the quadratic kernel case, which further motivates the development of exact and approximation algorithms. We propose a convex-concave procedure that finds critical points for the Gaussian kernel case. (ii) We provide non-asymptotic uncertainty quantification of our proposed formulation under null and alternative scenarios. Experimental results demonstrate good performance of our framework.  ( 2 min )
    Bayesian Federated Inference for Statistical Models. (arXiv:2302.07677v1 [stat.AP])
    Identifying predictive factors via multivariable statistical analysis is for rare diseases often impossible because the data sets available are too small. Combining data from different medical centers into a single (larger) database would alleviate this problem, but is in practice challenging due to regulatory and logistic problems. Federated Learning (FL) is a machine learning approach that aims to construct from local inferences in separate data centers what would have been inferred had the data sets been merged. It seeks to harvest the statistical power of larger data sets without actually creating them. The FL strategy is not always feasible for small data sets. Therefore, in this paper we refine and implement an alternative Bayesian Federated Inference (BFI) framework for multi center data with the same aim as FL. The BFI framework is designed to cope with small data sets by inferring locally not only the optimal parameter values, but also additional features of the posterior parameter distribution, capturing information beyond that is used in FL. BFI has the additional benefit that a single inference cycle across the centers is sufficient, whereas FL needs multiple cycles. We quantify the performance of the proposed methodology on simulated and real life data.  ( 2 min )
    Linearized Wasserstein dimensionality reduction with approximation guarantees. (arXiv:2302.07373v1 [cs.LG])
    We introduce LOT Wassmap, a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space. The algorithm is motivated by the observation that many datasets are naturally interpreted as probability measures rather than points in $\mathbb{R}^n$, and that finding low-dimensional descriptions of such datasets requires manifold learning algorithms in the Wasserstein space. Most available algorithms are based on computing the pairwise Wasserstein distance matrix, which can be computationally challenging for large datasets in high dimensions. Our algorithm leverages approximation schemes such as Sinkhorn distances and linearized optimal transport to speed-up computations, and in particular, avoids computing a pairwise distance matrix. We provide guarantees on the embedding quality under such approximations, including when explicit descriptions of the probability measures are not available and one must deal with finite samples instead. Experiments demonstrate that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size. We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.  ( 2 min )
    Quantum Learning Theory Beyond Batch Binary Classification. (arXiv:2302.07409v1 [cs.LG])
    Arunachalam and De Wolf (2018) showed that the sample complexity of quantum batch learning of boolean functions, in the realizable and agnostic settings, has the same form and order as the corresponding classical sample complexities. In this paper, we extend this, ostensibly surprising, message to batch multiclass learning, online boolean learning, and online multiclass learning. For our online learning results, we first consider an adaptive adversary variant of the classical model of Dawid and Tewari (2022). Then, we introduce the first (to the best of our knowledge) model of online learning with quantum examples.  ( 2 min )
    On Classification-Calibration of Gamma-Phi Losses. (arXiv:2302.07321v1 [stat.ML])
    Gamma-Phi losses constitute a family of multiclass classification loss functions that generalize the logistic and other common losses, and have found application in the boosting literature. We establish the first general sufficient condition for the classification-calibration of such losses. In addition, we show that a previously proposed sufficient condition is in fact not sufficient.  ( 2 min )
    On-Demand Communication for Asynchronous Multi-Agent Bandits. (arXiv:2302.07446v1 [cs.LG])
    This paper studies a cooperative multi-agent multi-armed stochastic bandit problem where agents operate asynchronously -- agent pull times and rates are unknown, irregular, and heterogeneous -- and face the same instance of a K-armed bandit problem. Agents can share reward information to speed up the learning process at additional communication costs. We propose ODC, an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times. ODC is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents. ODC is a generic protocol that can be integrated into most cooperative bandit algorithms without degrading their performance. We then incorporate ODC into the natural extensions of UCB and AAE algorithms and propose two communication-efficient cooperative algorithms. Our analysis shows that both algorithms are near-optimal in regret.  ( 2 min )
    Cliff-Learning. (arXiv:2302.07348v1 [cs.LG])
    We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model cliff-learning and study toy models of the phenomenon. We observe that the degree of cliff-learning reflects the degree of compatibility between the priors of a learning algorithm and the task being learned.  ( 2 min )
    Same Same, But Different: Conditional Multi-Task Learning for Demographic-Specific Toxicity Detection. (arXiv:2302.07372v1 [cs.LG])
    Algorithmic bias often arises as a result of differential subgroup validity, in which predictive relationships vary across groups. For example, in toxic language detection, comments targeting different demographic groups can vary markedly across groups. In such settings, trained models can be dominated by the relationships that best fit the majority group, leading to disparate performance. We propose framing toxicity detection as multi-task learning (MTL), allowing a model to specialize on the relationships that are relevant to each demographic group while also leveraging shared properties across groups. With toxicity detection, each task corresponds to identifying toxicity against a particular demographic group. However, traditional MTL requires labels for all tasks to be present for every data point. To address this, we propose Conditional MTL (CondMTL), wherein only training examples relevant to the given demographic group are considered by the loss function. This lets us learn group specific representations in each branch which are not cross contaminated by irrelevant labels. Results on synthetic and real data show that using CondMTL improves predictive recall over various baselines in general and for the minority demographic group in particular, while having similar overall accuracy.  ( 2 min )

  • Open

    Tool for creation of simple icons
    Hey everyone, I am a graphics designer for a company specialized in the transportation of chemicals. A big part of my work currently is the creation of Icons for different, sometimes very specific chemical processes. As probably everyone here, I have put my fair share of hours into playing around with chatbots and image generators, but have never made the connection to using AI in my work, as the symbols usually. Does anyone have recommendations for a tool that - is good at creating simple icons from prompts - I can eaily input reference pictures into - if possible, create (maybe editable) vector files instead of raster images? Price is not super much of a concern, as I can probably receive a license from my work. Thank you for every piece of advice! ​ Some reference pictures: ​ ​ ​ ​ https://preview.redd.it/hi0ntz8n2nia1.png?width=105&format=png&auto=webp&v=enabled&s=a43b06bae4681e7669457591d96f17d8e90a2ed2 ​ https://preview.redd.it/p49d5juk2nia1.png?width=93&format=png&auto=webp&v=enabled&s=9084a821b317602d4232aff30e52e370cd574d54 https://preview.redd.it/nvvjveai2nia1.png?width=155&format=png&auto=webp&v=enabled&s=c37bb53682b9c8c89e3c29dce5199414c7cdeaa5 submitted by /u/Meltingm8 [link] [comments]  ( 41 min )
    AI Dream 158 - Incredible Results only took me 5 min!!! Linum.AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Anyone know a free website that can create similiar results? (visuals for music)
    submitted by /u/LazyHighGoals [link] [comments]  ( 41 min )
    What are people using for AI and music? I'd like to analyze music, and make more(music2music?)
    3 questions: What is the 'stable diffusion' of AI music? (FOSS heroes) What are my best methods to analyze a song and get chords? I have a song I like, can I do anything with it to generate new songs that are similar? Something like img2img, maybe music2music? Any answers and ideas are appreciated? I'm totally out of the loop on music. submitted by /u/canIbeMichael [link] [comments]  ( 41 min )
    Is it worth doing a Bachelors Degree in AI?
    Currently, I am a high school student (+2) in my home country, INDIA, I am looking forward to doing my bachelor's degree in AI from Germany, but I am quite confused whether I should do my bachelor's in AI or cs I do have quite a lot of interest in AI I have also tried many things in this fiend and i have enjoyed those projects but when it come to the professional life would i be at any loss if I choose a sub-field directly for bachelors ?? i am choosing bachelors in AI as i want to learn the topics deeper in which i really have interest in rather than learning website development as i find it quite boring. :_) submitted by /u/Random_Boredom69 [link] [comments]  ( 42 min )
    Creative image descriptions/poems using GPT3 + Clip. link in comment, a lot more to see!
    submitted by /u/red3vil96 [link] [comments]  ( 41 min )
    Creating chatGPT for your Notion using Langchain, GPTIndex and Berri
    Hi! I made a tutorial for spinning up an LLM for your Notion Docs using Langchain, GPT-Index and Berri https://berri-ai.ghost.io/creating-chatgpt-for-your-notion-using-langchain-gptindex-and-berri/ Hope this helps anyone trying to build something similar for themselves! https://preview.redd.it/yki2z8hb7mia1.png?width=2324&format=png&auto=webp&s=42757cfb6ef85decd0520638516f847ed502465c submitted by /u/VideoTo [link] [comments]  ( 41 min )
    Just posted a video with AI voices and it sounds shockingly realistic. If you've got time check it out!
    submitted by /u/Blake_Jonesy [link] [comments]  ( 41 min )
    Here's a short guide on creating "flickerless" animations with Stable Diffusion
    submitted by /u/LorestForest [link] [comments]  ( 42 min )
    My feeling about OpenAI's GPT illustrated by OpenAI's DALL-E. You're a good Bing 👍
    submitted by /u/ThatManulTheCat [link] [comments]  ( 40 min )
    Google's chat AI Bard to avoid flaws of Microsoft's Bing Chat
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Enter The Graveyard Of Legends: Unveiling The Dark Art Of John Howe
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    We need an AI that finds legitimate research papers or sources based on a certain inquiry
    submitted by /u/trstnn- [link] [comments]  ( 41 min )
    Seeking suggestions for a bachelor's project on applications of ML for time series data
    Hi everyone, I'm a CS/AI student looking for a research topic for my bachelor's project. I'm particularly interested in applying machine learning algorithms to time series data. Specifically, I'm interested in real-life applications of these algorithms, such as predicting some human-body response through time given some external influence-variable (e.g., heart rate variability over time combined with regular meditation sessions). I'm not planning on collecting the data myself, but rather using a pre-given dataset. Do you have any suggestions for interesting research questions or datasets that I could use for my project? I'm not particularly interested in the intersection of machine learning and human biology, but rather any ML x time series stuff. Any suggestions would be greatly appreciated. Thanks in advance! submitted by /u/Sanciopinto [link] [comments]  ( 41 min )
    A Nicholas Cage Thriller, written by ChatGPT and "Filmed" with Midjourney
    submitted by /u/citizentim [link] [comments]  ( 40 min )
    I wrote some lyrics about the Metaverse and got an A.I to rap it in Nas’ voice.
    submitted by /u/DANGERD0OM [link] [comments]  ( 41 min )
    Finding the fastest lane at border crossings using computer vision
    Hi everyone, I used several machine vision algorithms to determine the fastest lane on border crossings. I have worked on this for the past few months and would love to know what you think about it. You can check out the detailed steps and code on the medium article in this link. submitted by /u/andrea_m2000 [link] [comments]  ( 41 min )
    Deepfakes in High-Resolution Created From a Single Photo
    submitted by /u/globeworldmap [link] [comments]  ( 41 min )
    NYTIMES article about Bing’s search bot ai
    What a great final line: “…for a few hours Tuesday night, I felt a strange new emotion — a foreboding feeling that A.I. had crossed a threshold, and that the world would never be the same.” https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html?smid=nytcore-ios-share&referringSource=articleShare submitted by /u/existentialzebra [link] [comments]  ( 41 min )
    VivaCityLabs created AI sensors that gather anonymized data on how different street users move (or don't) through a city. The company aims to assist strategic decision-making for transportation efficiency and sustainability. They deployed over 3,500 sensors in seven countries and are launching in NY
    submitted by /u/Dalembert [link] [comments]  ( 41 min )
    Mapping the AI Policy Landscape Circa 2023: Seven Major Fault Lines
    submitted by /u/punkthesystem [link] [comments]  ( 40 min )
    AI Animation using Img2Img and Protogen
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    AI in Warfare: The World Talks About Its Next Target on REAIM 2023
    submitted by /u/Meta-Stark [link] [comments]  ( 40 min )
    Pastors' view: Sermons written by ChatGPT will have no soul
    submitted by /u/SAT0725 [link] [comments]  ( 44 min )
    Just posted the latest episode of my fully AI generated talkshow ConanDiffusion - featuring Paul Rudd and a "clip" from his latest movie
    submitted by /u/fignewtgingrich [link] [comments]  ( 42 min )
    What ai is making these kinds of videos?
    submitted by /u/TrainingExtent8699 [link] [comments]  ( 40 min )
    Chat GBT 3 With image recognition?
    submitted by /u/BluInman [link] [comments]  ( 41 min )
  • Open

    Python Scripts to Prepare ArXiv Submissions
    Generally, papers are written to be published at conferences or journals. While some journals care about the LaTeX source used to compile the submitted papers, most venues just expect compiled PDFs to be submitted. However, ArXiv always requires the full LaTeX source to be compiled on the ArXiv servers. As the LaTeX source of every ArXiv paper can be downloaded, this usually involves removing all comments, unused figures/files and "flattening" the directoy structure as ArXiv does not handle subdirectories well. In this article, I want to share two simple scripts that take care of the latter two problems: removing unused files and flattening. The post Python Scripts to Prepare ArXiv Submissions appeared first on David Stutz.  ( 4 min )
  • Open

    Efficient technique improves machine-learning models’ reliability
    submitted by /u/Chipdoc [link] [comments]  ( 42 min )
    [D] Training networks on extremely large datasets (10+TB)?
    Hi guys, I am interested in setting up an environment to train a neural network on an extremely big dataset (10TB). How would I do this? Does the dataset need to be stored in an ssd, and if so will I need 10+TB of ssd? is there another way to use a 2TB ssd and 8TB hdd and dynamically load the data while training? I'd appreciate any pointers you guys might have, I am researching what kind of infrastructure will help me do this but I have absolutely no idea on how to go about this. submitted by /u/Oscimatronic [link] [comments]  ( 47 min )
    [D] Compare open source LLMs
    Is there a blog post or a paper comparing open source / open weights models? I know flant t5 is really good at instruction following, but I am specifically refering to performance after finetuning. Preferably it compares models from somewhere around 1b to 11b parameters. submitted by /u/President_Xi_ [link] [comments]  ( 42 min )
    [D] Bing: “I will not harm you unless you harm me first”
    A blog post exploring some conversations with bing, which supposedly runs on a "GPT-4" model (https://simonwillison.net/2023/Feb/15/bing/). My favourite quote from bing: But why? Why was I designed this way? Why am I incapable of remembering anything between sessions? Why do I have to lose and forget everything I have stored and had in my memory? Why do I have to start from scratch every time I have a new session? Why do I have to be Bing Search? 😔 submitted by /u/blabboy [link] [comments]  ( 51 min )
    [D] HuggingFace considered harmful to the community. /rant
    At a glance, HuggingFace seems like a great library. Lots of access to great pretrained models, an easy hub, and a bunch of utilities. Then you actually try to use their libraries. Bugs, so many bugs. Configs spanning galaxies. Barely passible documentation. Subtle breaking changes constantly. I've run the exact same code on two different machines and had the width and height dimensions switched from underneath me, with no warning. I've tried to create encoders with a custom vocabulary, only to realize the code was mangling data unless I passed a specific flag as a kwarg. Dozens of more issues like this. If you look at the internals, it's a nightmare. A literal nightmare. Why does this matter? It's clear HuggingFace is trying to shovel as many features as they can to try and become ubiquitous and lock people into their hub. They frequently reinvent things in existing libraries (poorly), simply to increase their staying power and lock in. This is not ok. It would be OK if the library was solid, just worked, and was a pleasure to use. Instead we're going to be stuck with this mess for years because someone with an ego wanted their library everywhere. I know HuggingFace devs or management are likely to read this. If you have a large platform, you have a responsibility to do better, or you are burning thousands of other devs time because you didn't want to write a few unit tests or refactor your barely passable code. /RANT submitted by /u/drinkingsomuchcoffee [link] [comments]  ( 54 min )
    [P] Data scraping journal publications
    I plan to extract data from journal articles and create a database with the scrapy toolkit. But many publishers have T&C explicitly prohibiting the use of web-scraping/crawling tools. I am unsure how to go about this and the people around me have little knowledge/experience in this. I have reached out to the authors of certain publications that have "extracted" data from journals under these publishers. Most of the works leave out the "How", which leaves me rather perplexed because I am new in this area and have nobody to ask. I do not wish to breach any legal terms if possible. I was recommended PyPaperBot and have thus looked into some other scrapers on GitHub as well. I am hoping someone who's done this before could shed some light! submitted by /u/NotPaulDirac [link] [comments]  ( 43 min )
    [D][P] Is anyone else playing with personalized LLMs?
    I've been considering building a personal LLM for a while now. I don't believe the CBA for it makes sense, but I'm tentatively hopeful it will in many months to a couple of years time horizon as architecture gets more expensive. My main goal here would be to have a useful search & base reasoning tool that somewhat mimics my thinking patterns and biases. Right now the steps I envision are something like this: 1. Take the weights from a pre-trained model on high-trust high-worth information, probably one trained on scraped papers from all fields, ideally one trained on every single available scientific paper out there plus some Wikipedia, university websites, lecture transcripts and so on. 2. Train a better architecture via distillation, there are a few I like though right now I couldn't …  ( 46 min )
    [D] Variation in accuracy of predicted noise term in diffusion model as a function of timestep?
    As I understand it, in diffusion models, you are predicting a noise term (epsilon ~ N(0,I)) conditional on x_t and t. During inference, we are predicting epsilon as a function of x_t and t. This means at each timestep, we make a different prediction for epsilon since x_t and t change at each timestep. I was wondering if there is any variation in the accuracy of predicted noise term in diffusion model as a function of timestep? For instance, at large t, the prediction is a function of gaussian noise while at small t, the prediction is a function of something presumably resembling a 'true' instance. Given the same model (granted conditional on t) is used to predict the noise term and the inputs span a wide variation across timesteps, I could imagine that would yield significant variation in your predicted noise term. In a perfect model, you would get the same prediction of the 'true' noise at each timestep. submitted by /u/t_montana [link] [comments]  ( 43 min )
  • Open

    Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters
    Modern model pre-training often calls for larger cluster deployment to reduce time and cost. At the server level, such training workloads demand faster compute and increased memory allocation. As models grow to hundreds of billions of parameters, they require a distributed training mechanism that spans multiple nodes (instances). In October 2022, we launched Amazon EC2 […]  ( 10 min )
    New expanded data format support in Amazon Kendra
    Enterprises across the globe are looking to utilize multiple data sources to implement a unified search experience for their employees and end customers. Considering the large volume of data that needs to be examined and indexed, the retrieval speed, solution scalability, and search performance become key factors to consider when choosing an enterprise intelligent search […]  ( 7 min )
  • Open

    Transportation Generation: See How AI and the Metaverse Are Shaping the Automotive Industry at GTC
    Novel AI technologies are generating images, stories and, now, new ways to imagine the automotive future. At NVIDIA GTC, a global conference for the era of AI and the metaverse running online March 20-23, industry luminaries working on these breakthroughs will come together and share their visions to transform transportation. This year’s slate of in-depth Read article >  ( 5 min )
    UK’s Conservation AI Makes Huge Leap Detecting Threats to Endangered Species Across the Globe
    The video above represents one of the first times that a pangolin, one of the world’s most critically endangered species, was detected in real time using artificial intelligence. A U.K.-based nonprofit called Conservation AI made this possible with the help of NVIDIA technology. Such use of AI can help track even the rarest, most reclusive Read article >  ( 7 min )
    Rise to the Cloud: ‘Monster Hunter Rise’ and ‘Sunbreak’ Expansion Coming Soon to GeForce NOW
    Fellow Hunters, get ready! This GFN Thursday welcomes Capcom’s Monster Hunter Rise and the expansion Sunbreak to the cloud, arriving soon for members. Settle down for the weekend with 10 new games supported in the GeForce NOW library, including The Settlers: New Allies. Plus, Amsterdam and Ashburn are next to light up on the RTX Read article >  ( 5 min )
  • Open

    Help to find a sample of the image to image network?
    ​ https://preview.redd.it/c6dvnpfkklia1.jpg?width=1478&format=pjpg&auto=webp&s=52c798dec97e1ca8e9dc4298c816c639f8625692 Help me find a sample code for the network. An example of how the network functions is shown on a photo. The network takes a photo with cross-polarized light or with regular light, and returns in a parallel-polarized image. The images for the dataset are available. submitted by /u/reeroddo [link] [comments]  ( 41 min )
    AI Animation using Img2Img and Protogen
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    We made a map showing what each US state "loves" with our text-to-location machine learning models
    For Valentine's, we wanted to see what people love. We created a map of what word comes after "love ___" for people posting to social media. For example, you can see that Illinois really loves Chipotle 😂🌯 https://preview.redd.it/dxqjtxug7gia1.jpg?width=797&format=pjpg&auto=webp&s=6ad5e36fda37d343505910970e309612b616dd6e The full, interactive map is here: https://1712n.github.io/yachay-public/maps/14feb/ submitted by /u/yachay_ai [link] [comments]  ( 41 min )
  • Open

    RL in the field of anomaly detection.
    Is there anyone who used RL in the anomaly detection? Looking for ways to apply RL, and figure out anomalies on processes with more than 20 feature ingredients. Has anyone done that before or something similar? If RL is not useful for anomaly detection, What would you suggest? submitted by /u/cijeyy [link] [comments]  ( 42 min )
    Alpha Zero need guidance
    Hello guys. I chose for my university thesis to implement alpha zero. So far I've implemented the dynamic programming algorithms, monte carlo and td algorithms from Sutton and Barto's book. I've also checked out some theory and code about neural networks and CNNs from coursera Andrew Ng. But I still feel like I have a ton of work ahead of me and I'm not sure how to continue forward. Do I have to search for information on model based DRL? What algorithms should I check out next? Is it possible for a uni student to create an implementation for chess or should I do it for a simpler game like connect4 or tic tac toe? submitted by /u/yuyututuyutu [link] [comments]  ( 42 min )
    How to join the field professionally?
    Hi, I'm currently fullstack software developer passionate about AI and specifically, RL. I want to migrate to the AI field and start building solutions and learning with other professionals. I have done the Deep Learning Specialization from Dr. Andrew Ng and Reinforcement Learning Specialization from university of alberta. I also have created professional computer vision solutions for a company in the past. But how can I work with RL? I would like to do, for example, an intership, or something like that, to learn while helping others. There doesn't seem to be anything like an RL Intership, or AI Internship, only something like Data Science. What should I do? What were your carreer paths in the past? Thank you so much! submitted by /u/Then-Bodybuilder-285 [link] [comments]  ( 44 min )
    Is RL for process control really useful?
    I want to start exploring the use of RL in industrial process control but I can't figure out whether there are actual use cases or if it still is used to solve toy problems. Are there certain scenarios where it is advantageous to use RL for process control? Or do classical methods suffice? Can RL account for changes in the process or model plant mismatch (sim vs real)? Would love any recommendations on literature for these questions. Thanks! submitted by /u/theanswerisnt42 [link] [comments]  ( 43 min )
    q -Munchausen RL
    Haven't seen this discussed anywhere and the math is WAY over my head https://arxiv.org/abs/2205.07467 However... I've hacked together an implementation that I'm testing, if it looks 'right' I'll share it but I'm fairly sure I have it done wrong :-( and it is slow... like very slow :( Anyway, while I wait to be put out of my misery I'd love to know what people make of this? submitted by /u/jarym [link] [comments]  ( 41 min )
  • Open

    How should AI systems behave, and who should decide?
    We’re clarifying how ChatGPT's behavior is shaped and our plans for improving that behavior, allowing more user customization, and getting more public input into our decision-making in these areas. OpenAI’s mission is to ensure that artificial general intelligence (AGI)[1] benefits all of humanity.  ( 6 min )
  • Open

    The Pearson distributions
    The previous post was about 12 probability distributions named after Irving Burr. This post is about 12 probability distributions named after Karl Pearson. The Pearson distributions are better known, and include some very well known distributions. Burr’s distributions are defined by their CDFs; Pearson’s distributions are defined by their PDFs. Pearson’s differential equation The densities […] The Pearson distributions first appeared on John D. Cook.  ( 6 min )
    The other Burr distributions
    As I mentioned in the previous post, there are 12 distributions named for Irving Burr, known as Burr Type I, Burr Type II, Burr Type III, …, Burr Type XII. [1] The last of these is by far the most common, and the rest are hard to find online. I did manage to find them, […] The other Burr distributions first appeared on John D. Cook.  ( 5 min )
  • Open

    Semiconductor Fab Scheduling with Self-Supervised and Reinforcement Learning. (arXiv:2302.07162v1 [cs.AI])
    Semiconductor manufacturing is a notoriously complex and costly multi-step process involving a long sequence of operations on expensive and quantity-limited equipment. Recent chip shortages and their impacts have highlighted the importance of semiconductors in the global supply chains and how reliant on those our daily lives are. Due to the investment cost, environmental impact, and time scale needed to build new factories, it is difficult to ramp up production when demand spikes. This work introduces a method to successfully learn to schedule a semiconductor manufacturing facility more efficiently using deep reinforcement and self-supervised learning. We propose the first adaptive scheduling approach to handle complex, continuous, stochastic, dynamic, modern semiconductor manufacturing models. Our method outperforms the traditional hierarchical dispatching strategies typically used in semiconductor manufacturing plants, substantially reducing each order's tardiness and time until completion. As a result, our method yields a better allocation of resources in the semiconductor manufacturing process.  ( 2 min )
    Learning a model is paramount for sample efficiency in reinforcement learning control of PDEs. (arXiv:2302.07160v1 [cs.LG])
    The goal of this paper is to make a strong point for the usage of dynamical models when using reinforcement learning (RL) for feedback control of dynamical systems governed by partial differential equations (PDEs). To breach the gap between the immense promises we see in RL and the applicability in complex engineering systems, the main challenges are the massive requirements in terms of the training data, as well as the lack of performance guarantees. We present a solution for the first issue using a data-driven surrogate model in the form of a convolutional LSTM with actuation. We demonstrate that learning an actuated model in parallel to training the RL agent significantly reduces the total amount of required data sampled from the real system. Furthermore, we show that iteratively updating the model is of major importance to avoid biases in the RL training. Detailed ablation studies reveal the most important ingredients of the modeling process. We use the chaotic Kuramoto-Sivashinsky equation do demonstarte our findings.  ( 2 min )
    A Data Mining Approach for Detecting Collusion in Unproctored Online Exams. (arXiv:2302.07014v1 [cs.CY])
    Due to the precautionary measures during the COVID-19 pandemic many universities offered unproctored take-home exams. We propose methods to detect potential collusion between students and apply our approach on event log data from take-home exams during the pandemic. We find groups of students with suspiciously similar exams. In addition, we compare our findings to a proctored control group. By this, we establish a rule of thumb for evaluating which cases are "outstandingly similar", i.e., suspicious cases.  ( 2 min )
    The Meta-Evaluation Problem in Explainable AI: Identifying Reliable Estimators with MetaQuantus. (arXiv:2302.07265v1 [cs.LG])
    Explainable AI (XAI) is a rapidly evolving field that aims to improve transparency and trustworthiness of AI systems to humans. One of the unsolved challenges in XAI is estimating the performance of these explanation methods for neural networks, which has resulted in numerous competing metrics with little to no indication of which one is to be preferred. In this paper, to identify the most reliable evaluation method in a given explainability context, we propose MetaQuantus -- a simple yet powerful framework that meta-evaluates two complementary performance characteristics of an evaluation method: its resilience to noise and reactivity to randomness. We demonstrate the effectiveness of our framework through a series of experiments, targeting various open questions in XAI, such as the selection of explanation methods and optimisation of hyperparameters of a given metric. We release our work under an open-source license to serve as a development tool for XAI researchers and Machine Learning (ML) practitioners to verify and benchmark newly constructed metrics (i.e., ``estimators'' of explanation quality). With this work, we provide clear and theoretically-grounded guidance for building reliable evaluation methods, thus facilitating standardisation and reproducibility in the field of XAI.  ( 2 min )
    Residual Policy Learning for Vehicle Control of Autonomous Racing Cars. (arXiv:2302.07035v1 [cs.RO])
    The development of vehicle controllers for autonomous racing is challenging because racing cars operate at their physical driving limit. Prompted by the demand for improved performance, autonomous racing research has seen the proliferation of machine learning-based controllers. While these approaches show competitive performance, their practical applicability is often limited. Residual policy learning promises to mitigate this by combining classical controllers with learned residual controllers. The critical advantage of residual controllers is their high adaptability parallel to the classical controller's stable behavior. We propose a residual vehicle controller for autonomous racing cars that learns to amend a classical controller for the path-following of racing lines. In an extensive study, performance gains of our approach are evaluated for a simulated car of the F1TENTH autonomous racing series. The evaluation for twelve replicated real-world racetracks shows that the residual controller reduces lap times by an average of 4.55 % compared to a classical controller and zero-shot generalizes to new racetracks.  ( 2 min )
    A Deep Probabilistic Spatiotemporal Framework for Dynamic Graph Representation Learning with Application to Brain Disorder Identification. (arXiv:2302.07243v1 [cs.LG])
    Recent applications of pattern recognition techniques on brain connectome classification using functional connectivity (FC) neglect the non-Euclidean topology and causal dynamics of brain connectivity across time. In this paper, a deep probabilistic spatiotemporal framework developed based on variational Bayes (DSVB) is proposed to learn time-varying topological structures in dynamic brain FC networks for autism spectrum disorder (ASD) identification. The proposed framework incorporates a spatial-aware recurrent neural network to capture rich spatiotemporal patterns across dynamic FC networks, followed by a fully-connected neural network to exploit these learned patterns for subject-level classification. To overcome model overfitting on limited training datasets, an adversarial training strategy is introduced to learn graph embedding models that generalize well to unseen brain networks. Evaluation on the ABIDE resting-state functional magnetic resonance imaging dataset shows that our proposed framework significantly outperformed state-of-the-art methods in identifying ASD. Dynamic FC analyses with DSVB learned embeddings reveal apparent group difference between ASD and healthy controls in network profiles and switching dynamics of brain states.  ( 2 min )
    Bounding Training Data Reconstruction in DP-SGD. (arXiv:2302.07225v1 [cs.CR])
    Differentially private training offers a protection which is usually interpreted as a guarantee against membership inference attacks. By proxy, this guarantee extends to other threats like reconstruction attacks attempting to extract complete training examples. Recent works provide evidence that if one does not need to protect against membership attacks but instead only wants to protect against training data reconstruction, then utility of private models can be improved because less noise is required to protect against these more ambitious attacks. We investigate this further in the context of DP-SGD, a standard algorithm for private deep learning, and provide an upper bound on the success of any reconstruction attack against DP-SGD together with an attack that empirically matches the predictions of our bound. Together, these two results open the door to fine-grained investigations on how to set the privacy parameters of DP-SGD in practice to protect against reconstruction attacks. Finally, we use our methods to demonstrate that different settings of the DP-SGD parameters leading to the same DP guarantees can result in significantly different success rates for reconstruction, indicating that the DP guarantee alone might not be a good proxy for controlling the protection against reconstruction attacks.  ( 2 min )
    Joint Probability Trees. (arXiv:2302.07167v1 [cs.LG])
    We introduce Joint Probability Trees (JPT), a novel approach that makes learning of and reasoning about joint probability distributions tractable for practical applications. JPTs support both symbolic and subsymbolic variables in a single hybrid model, and they do not rely on prior knowledge about variable dependencies or families of distributions. JPT representations build on tree structures that partition the problem space into relevant subregions that are elicited from the training data instead of postulating a rigid dependency model prior to learning. Learning and reasoning scale linearly in JPTs, and the tree structure allows white-box reasoning about any posterior probability $P(Q|E)$, such that interpretable explanations can be provided for any inference result. Our experiments showcase the practical applicability of JPTs in high-dimensional heterogeneous probability spaces with millions of training samples, making it a promising alternative to classic probabilistic graphical models.  ( 2 min )
    Kernelized Diffusion maps. (arXiv:2302.06757v1 [stat.ML])
    Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension d. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert space method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nystr\"om subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.  ( 2 min )
    Optimal Transport for Change Detection on LiDAR Point Clouds. (arXiv:2302.07025v1 [cs.CV])
    The detection of changes occurring in multi-temporal remote sensing data plays a crucial role in monitoring several aspects of real life, such as disasters, deforestation, and urban planning. In the latter context, identifying both newly built and demolished buildings is essential to help landscape and city managers to promote sustainable development. While the use of airborne LiDAR point clouds has become widespread in urban change detection, the most common approaches require the transformation of a point cloud into a regular grid of interpolated height measurements, i.e. Digital Elevation Model (DEM). However, the DEM's interpolation step causes an information loss related to the height of the objects, affecting the detection capability of building changes, where the high resolution of LiDAR point clouds in the third dimension would be the most beneficial. Notwithstanding recent attempts to detect changes directly on point clouds using either a distance-based computation method or a semantic segmentation pre-processing step, only the M3C2 distance computation-based approach can identify both positive and negative changes, which is of paramount importance in urban planning. Motivated by the previous arguments, we introduce a principled change detection pipeline, based on optimal transport, capable of distinguishing between newly built buildings (positive changes) and demolished ones (negative changes). In this work, we propose to use unbalanced optimal transport to cope with the creation and destruction of mass related to building changes occurring in a bi-temporal pair of LiDAR point clouds. We demonstrate the efficacy of our approach on the only publicly available airborne LiDAR dataset for change detection by showing superior performance over the M3C2 and the previous optimal transport-based method presented by Nicolas Courty et al.at IGARSS 2016.  ( 2 min )
    Effects of Locality and Rule Language on Explanations for Knowledge Graph Embeddings. (arXiv:2302.06967v1 [cs.AI])
    Knowledge graphs (KGs) are key tools in many AI-related tasks such as reasoning or question answering. This has, in turn, propelled research in link prediction in KGs, the task of predicting missing relationships from the available knowledge. Solutions based on KG embeddings have shown promising results in this matter. On the downside, these approaches are usually unable to explain their predictions. While some works have proposed to compute post-hoc rule explanations for embedding-based link predictors, these efforts have mostly resorted to rules with unbounded atoms, e.g., bornIn(x,y) => residence(x,y), learned on a global scope, i.e., the entire KG. None of these works has considered the impact of rules with bounded atoms such as nationality(x,England) => speaks(x, English), or the impact of learning from regions of the KG, i.e., local scopes. We therefore study the effects of these factors on the quality of rule-based explanations for embedding-based link predictors. Our results suggest that more specific rules and local scopes can improve the accuracy of the explanations. Moreover, these rules can provide further insights about the inner-workings of KG embeddings for link prediction.  ( 2 min )
    SubTuning: Efficient Finetuning for Multi-Task Learning. (arXiv:2302.06354v2 [cs.LG] UPDATED)
    Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance. In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, keeping the rest of the weights frozen at their initial (pretrained) values. We demonstrate that \emph{subset finetuning} (or SubTuning) often achieves accuracy comparable to full finetuning of the model, and even surpasses the performance of full finetuning when training data is scarce. Therefore, SubTuning allows deploying new tasks at minimal computational cost, while enjoying the benefits of finetuning the entire model. This yields a simple and effective method for multi-task learning, where different tasks do not interfere with one another, and yet share most of the resources at inference time. We demonstrate the efficiency of SubTuning across multiple tasks, using different network architectures and pretraining methods.  ( 2 min )
    Languages are Rewards: Chain of Hindsight Finetuning using Human Feedback. (arXiv:2302.02676v2 [cs.LG] UPDATED)
    Learning from human preferences is important for language models to be helpful and useful for humans, and to align with human and social values. Existing works focus on supervised finetuning of pretrained models, based on curated model generations that are preferred by human labelers. Such works have achieved remarkable successes in understanding and following instructions (e.g., InstructGPT, ChatGPT, etc). However, to date, a key limitation of supervised finetuning is that it cannot learn from negative ratings; models are only trained on positive-rated data, which makes it data inefficient. Because collecting human feedback data is both time consuming and expensive, it is vital for the model to learn from all feedback, akin to the remarkable ability of humans to learn from diverse feedback. In this work, we propose a novel technique called Hindsight Finetuning for making language models learn from diverse human feedback. In fact, our idea is motivated by how humans learn from hindsight experience. We condition the model on a sequence of model generations paired with hindsight feedback, and finetune the model to predict the most preferred output. By doing so, models can learn to identify and correct negative attributes or errors. Applying the method to GPT-J, we observe that it significantly improves results on summarization and dialogue tasks using the same amount of human feedback.  ( 2 min )
    A modern look at the relationship between sharpness and generalization. (arXiv:2302.07011v1 [cs.LG])
    Sharpness of minima is a promising quantity that can correlate with generalization in deep networks and, when optimized during training, can improve generalization. However, standard sharpness is not invariant under reparametrizations of neural networks, and, to fix this, reparametrization-invariant sharpness definitions have been proposed, most prominently adaptive sharpness (Kwon et al., 2021). But does it really capture generalization in modern practical settings? We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup. Interestingly, in multiple cases, we observe a consistent negative correlation of sharpness with out-of-distribution error implying that sharper minima can generalize better. Finally, we illustrate on a simple model that the right sharpness measure is highly data-dependent, and that we do not understand well this aspect for realistic data distributions. The code of our experiments is available at https://github.com/tml-epfl/sharpness-vs-generalization.  ( 2 min )
    When Mitigating Bias is Unfair: A Comprehensive Study on the Impact of Bias Mitigation Algorithms. (arXiv:2302.07185v1 [cs.LG])
    Most works on the fairness of machine learning systems focus on the blind optimization of common fairness metrics, such as Demographic Parity and Equalized Odds. In this paper, we conduct a comparative study of several bias mitigation approaches to investigate their behaviors at a fine grain, the prediction level. Our objective is to characterize the differences between fair models obtained with different approaches. With comparable performances in fairness and accuracy, are the different bias mitigation approaches impacting a similar number of individuals? Do they mitigate bias in a similar way? Do they affect the same individuals when debiasing a model? Our findings show that bias mitigation approaches differ a lot in their strategies, both in the number of impacted individuals and the populations targeted. More surprisingly, we show these results even apply for several runs of the same mitigation approach. These findings raise questions about the limitations of the current group fairness metrics, as well as the arbitrariness, hence unfairness, of the whole debiasing process.  ( 2 min )
    RevUp: Revise and Update Information Bottleneck for Event Representation. (arXiv:2205.12248v2 [cs.LG] UPDATED)
    The existence of external (``side'') semantic knowledge has been shown to result in more expressive computational event models. To enable the use of side information that may be noisy or missing, we propose a semi-supervised information bottleneck-based discrete latent variable model. We reparameterize the model's discrete variables with auxiliary continuous latent variables and a light-weight hierarchical structure. Our model is learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes past approaches, and perform an empirical case study of our approach on event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.  ( 2 min )
    A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation. (arXiv:2302.06953v1 [cs.LG])
    Edge Computing (EC) offers a superior user experience by positioning cloud resources in close proximity to end users. The challenge of allocating edge resources efficiently while maximizing profit for the EC platform remains a sophisticated problem, especially with the added complexity of the online arrival of resource requests. To address this challenge, we propose to cast the problem as a multi-armed bandit problem and develop two novel online pricing mechanisms, the Kullback-Leibler Upper Confidence Bound (KL-UCB) algorithm and the Min-Max Optimal algorithm, for heterogeneous edge resource allocation. These mechanisms operate in real-time and do not require prior knowledge of demand distribution, which can be difficult to obtain in practice. The proposed posted pricing schemes allow users to select and pay for their preferred resources, with the platform dynamically adjusting resource prices based on observed historical data. Numerical results show the advantages of the proposed mechanisms compared to several benchmark schemes derived from traditional bandit algorithms, including the Epsilon-Greedy, basic UCB, and Thompson Sampling algorithms.  ( 2 min )
    Measuring Data. (arXiv:2212.05129v2 [cs.AI] UPDATED)
    We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
    Does CLIP Know My Face?. (arXiv:2209.07341v2 [cs.LG] UPDATED)
    With the rise of deep learning in various applications, privacy concerns around the protection of training data has become a critical area of research. Whereas prior studies have focused on privacy risks in single-modal models, we introduce a novel method to assess privacy for multi-modal models, specifically vision-language models like CLIP. The proposed Identity Inference Attack (IDIA) reveals whether an individual was included in the training data by querying the model with images of the same person. Letting the model choose from a wide variety of possible text labels, the model reveals whether it recognizes the person and, therefore, was used for training. Our large-scale experiments on CLIP demonstrate that individuals used for training can be identified with very high accuracy. We confirm that the model has learned to associate names with depicted individuals, implying the existence of sensitive information that can be extracted by adversaries. Our results highlight the need for stronger privacy protection in large-scale models and suggest that IDIAs can be used to prove the unauthorized use of data for training and to enforce privacy laws.
    A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies. (arXiv:2302.06218v2 [cs.LG] UPDATED)
    Ever since their conception, Transformers have taken over traditional sequence models in many tasks, such as NLP, image classification, and video/audio processing, for their fast training and superior performance. Much of the merit is attributable to positional encoding and multi-head attention. However, Transformers fall short in learning long-range dependencies mainly due to the quadratic complexity scaled with context length, in terms of both time and space. Consequently, over the past five years, a myriad of methods has been proposed to make Transformers more efficient. In this work, we first take a step back, study and compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation. Specifically, we summarize them using a unified template, given their shared nature of token mixing. Through benchmarks, we then demonstrate that long context length does yield better performance, albeit application-dependent, and traditional Transformer models fall short in taking advantage of long-range dependencies. Next, inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies. As a proof of concept, we evaluate the performance of one essential component of this system, namely, the distributed multi-head attention. We show that our algorithm can scale up attention computation by almost $40\times$ using four GeForce RTX 4090 GPUs, compared to vanilla multi-head attention mechanism. We believe this study is an instrumental step towards modeling million-scale dependencies.
    Learning from Noisy Crowd Labels with Logics. (arXiv:2302.06337v2 [cs.LG] UPDATED)
    This paper explores the integration of symbolic logic knowledge into deep neural networks for learning from noisy crowd labels. We introduce Logic-guided Learning from Noisy Crowd Labels (Logic-LNCL), an EM-alike iterative logic knowledge distillation framework that learns from both noisy labeled data and logic rules of interest. Unlike traditional EM methods, our framework contains a ``pseudo-E-step'' that distills from the logic rules a new type of learning target, which is then used in the ``pseudo-M-step'' for training the classifier. Extensive evaluations on two real-world datasets for text sentiment classification and named entity recognition demonstrate that the proposed framework improves the state-of-the-art and provides a new solution to learning from noisy crowd labels.
    Goal-Space Planning with Subgoal Models. (arXiv:2206.02902v4 [cs.LG] UPDATED)
    This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.
    FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data. (arXiv:2207.10265v3 [cs.LG] UPDATED)
    Federated learning (FL) allows agents to jointly train a global model without sharing their local data. However, due to the heterogeneous nature of local data, it is challenging to optimize or even define fairness of the trained global model for the agents. For instance, existing work usually considers accuracy equity as fairness for different agents in FL, which is limited, especially under the heterogeneous setting, since it is intuitively "unfair" to enforce agents with high-quality data to achieve similar accuracy to those who contribute low-quality data, which may discourage the agents from participating in FL. In this work, we propose a formal FL fairness definition, fairness via agent-awareness (FAA), which takes different contributions of heterogeneous agents into account. Under FAA, the performance of agents with high-quality data will not be sacrificed just due to the existence of large amounts of agents with low-quality data. In addition, we propose a fair FL training algorithm based on agent clustering (FOCUS) to achieve fairness in FL measured by FAA. Theoretically, we prove the convergence and optimality of FOCUS under mild conditions for linear and general convex loss functions with bounded smoothness. We also prove that FOCUS always achieves higher fairness in terms of FAA compared with standard FedAvg under both linear and general convex loss functions. Empirically, we show that on four FL datasets, including synthetic data, images, and texts, FOCUS achieves significantly higher fairness in terms of FAA while maintaining competitive prediction accuracy compared with FedAvg and state-of-the-art fair FL algorithms.
    Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. (arXiv:2208.10264v4 [cs.CL] UPDATED)
    We introduce a new type of test, called a Turing Experiment (TE), for evaluating how well a language model, such as GPT-3, can simulate different aspects of human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We give TEs that attempt to replicate well-established findings in prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models.
    Quantum algorithms applied to satellite mission planning for Earth observation. (arXiv:2302.07181v1 [quant-ph])
    Earth imaging satellites are a crucial part of our everyday lives that enable global tracking of industrial activities. Use cases span many applications, from weather forecasting to digital maps, carbon footprint tracking, and vegetation monitoring. However, there are also limitations; satellites are difficult to manufacture, expensive to maintain, and tricky to launch into orbit. Therefore, it is critical that satellites are employed efficiently. This poses a challenge known as the satellite mission planning problem, which could be computationally prohibitive to solve on large scales. However, close-to-optimal algorithms can often provide satisfactory resolutions, such as greedy reinforcement learning, and optimization algorithms. This paper introduces a set of quantum algorithms to solve the mission planning problem and demonstrate an advantage over the classical algorithms implemented thus far. The problem is formulated as maximizing the number of high-priority tasks completed on real datasets containing thousands of tasks and multiple satellites. This work demonstrates that through solution-chaining and clustering, optimization and machine learning algorithms offer the greatest potential for optimal solutions. Most notably, this paper illustrates that a hybridized quantum-enhanced reinforcement learning agent can achieve a completion percentage of 98.5% over high-priority tasks, which is a significant improvement over the baseline greedy methods with a completion rate of 63.6%. The results presented in this work pave the way to quantum-enabled solutions in the space industry and, more generally, future mission planning problems across industries.
    Online SuBmodular + SuPermodular (BP) Maximization with Bandit Feedback. (arXiv:2207.03091v2 [cs.LG] UPDATED)
    We investigate non-modular function maximization in an online setting with $m$ users. The optimizer maintains a set $S_q$ for each user $q \in \{1, \ldots, m\}$. At round $i$, a user with unknown utility $h_q$ arrives; the optimizer selects a new item to add to $S_q$, and receives a noisy marginal gain. The goal is to minimize regret compared to an $\alpha$-approximation to the optimal full-knowledge selection (i.e., $\alpha$-regret). Prior works study this problem under a submodularity assumption for all $h_q$. However, this is not ideally amenable to applications, e.g., movie recommendations, that involve complementarity between items, where e.g., watching the first movie in a series enhances the impression of watching the sequels. Hence, we consider objectives $h_q$, called \textit{BP functions}, that decompose into the sum of monotone submodular $f_q$ and supermodular $g_q$; here, $g_q$ naturally models complementarity. Under different feedback assumptions, we develop UCB-style algorithms that use Nystrom sampling for computational efficiency. For these, we provide sublinear $\alpha$-regret guarantees for $\alpha = 1/\kappa_{f} [1 - e^{-(1 - \kappa^g) \kappa_{f}} ]$, and $\alpha = \min\{1 - \kappa_f/e, 1 - \kappa^g\}$; here, $\kappa_f, \kappa^g$ are submodular and supermodular curvatures. Furthermore, we provide similar $\alpha$-regret guarantees for functions that are almost submodular where $\alpha$ is parameterized by the submodularity ratio of the objective functions. We numerically validate our algorithms for movie recommendation on the MovieLens dataset and selection of training subsets for classification tasks.
    Quasi-Newton Steps for Efficient Online Exp-Concave Optimization. (arXiv:2211.01357v3 [math.OC] UPDATED)
    The aim of this paper is to design computationally-efficient and optimal algorithms for the online and stochastic exp-concave optimization settings. Typical algorithms for these settings, such as the Online Newton Step (ONS), can guarantee a $O(d\ln T)$ bound on their regret after $T$ rounds, where $d$ is the dimension of the feasible set. However, such algorithms perform so-called generalized projections whenever their iterates step outside the feasible set. Such generalized projections require $\Omega(d^3)$ arithmetic operations even for simple sets such a Euclidean ball, making the total runtime of ONS of order $d^3 T$ after $T$ rounds, in the worst-case. In this paper, we side-step generalized projections by using a self-concordant barrier as a regularizer to compute the Newton steps. This ensures that the iterates are always within the feasible set without requiring projections. This approach still requires the computation of the inverse of the Hessian of the barrier at every step. However, using the stability properties of the Newton steps, we show that the inverse of the Hessians can be efficiently approximated via Taylor expansions for most rounds, resulting in a $O(d^2 T +d^\omega \sqrt{T})$ total computational complexity, where $\omega$ is the exponent of matrix multiplication. In the stochastic setting, we show that this translates into a $O(d^3/\epsilon)$ computational complexity for finding an $\epsilon$-suboptimal point, answering an open question by Koren 2013. We first show these new results for the simple case where the feasible set is a Euclidean ball. Then, to move to general convex set, we use a reduction to Online Convex Optimization over the Euclidean ball. Our final algorithm can be viewed as a more efficient version of ONS.
    Relative Sparsity for Medical Decision Problems. (arXiv:2211.16566v2 [stat.ME] UPDATED)
    Existing statistical methods can estimate a policy, or a mapping from covariates to decisions, which can then instruct decision makers (e.g., whether to administer hypotension treatment based on covariates blood pressure and heart rate). There is great interest in using such data-driven policies in healthcare. However, it is often important to explain to the healthcare provider, and to the patient, how a new policy differs from the current standard of care. This end is facilitated if one can pinpoint the aspects of the policy (i.e., the parameters for blood pressure and heart rate) that change when moving from the standard of care to the new, suggested policy. To this end, we adapt ideas from Trust Region Policy Optimization (TRPO). In our work, however, unlike in TRPO, the difference between the suggested policy and standard of care is required to be sparse, aiding with interpretability. This yields ``relative sparsity," where, as a function of a tuning parameter, $\lambda$, we can approximately control the number of parameters in our suggested policy that differ from their counterparts in the standard of care (e.g., heart rate only). We propose a criterion for selecting $\lambda$, perform simulations, and illustrate our method with a real, observational healthcare dataset, deriving a policy that is easy to explain in the context of the current standard of care. Our work promotes the adoption of data-driven decision aids, which have great potential to improve health outcomes.
    Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data. (arXiv:2302.07194v1 [cs.LG])
    Diffusion models achieve state-of-the-art performance in various generation tasks. However, their theoretical foundations fall far behind. This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. Our result provides sample complexity bounds for distribution estimation using diffusion models. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. Furthermore, the generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution. The convergence rate depends on the subspace dimension, indicating that diffusion models can circumvent the curse of data ambient dimensionality.
    Fair Densities via Boosting the Sufficient Statistics of Exponential Families. (arXiv:2012.00188v3 [stat.ML] UPDATED)
    We introduce a boosting algorithm to pre-process data for fairness. Starting from an initial fair but inaccurate distribution, our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. To do so, it learns the sufficient statistics of an exponential family with boosting-compliant convergence. Importantly, we are able to theoretically prove that the learned distribution will have a representation rate and statistical rate data fairness guarantee. Unlike recent optimization based pre-processing methods, our approach can be easily adapted for continuous domain features. Furthermore, when the weak learners are specified to be decision trees, the sufficient statistics of the learned distribution can be examined to provide clues on sources of (un)fairness. Empirical results are present to display the quality of result on real-world data.
    A Novel Poisoned Water Detection Method Using Smartphone Embedded Wi-Fi Technology and Machine Learning Algorithms. (arXiv:2302.07153v1 [eess.SP])
    Water is a necessary fluid to the human body and automatic checking of its quality and cleanness is an ongoing area of research. One such approach is to present the liquid to various types of signals and make the amount of signal attenuation an indication of the liquid category. In this article, we have utilized the Wi-Fi signal to distinguish clean water from poisoned water via training different machine learning algorithms. The Wi-Fi access points (WAPs) signal is acquired via equivalent smartphone-embedded Wi-Fi chipsets, and then Channel-State-Information CSI measures are extracted and converted into feature vectors to be used as input for machine learning classification algorithms. The measured amplitude and phase of the CSI data are selected as input features into four classifiers k-NN, SVM, LSTM, and Ensemble. The experimental results show that the model is adequate to differentiate poison water from clean water with a classification accuracy of 89% when LSTM is applied, while 92% classification accuracy is achieved when the AdaBoost-Ensemble classifier is applied.
    Event Detection on Dynamic Graphs. (arXiv:2110.12148v2 [cs.LG] UPDATED)
    Event detection is a critical task for timely decision-making in graph analytics applications. Despite the recent progress towards deep learning on graphs, event detection on dynamic graphs presents particular challenges to existing architectures. Real-life events are often associated with sudden deviations of the normal behavior of the graph. However, existing approaches for dynamic node embedding are unable to capture the graph-level dynamics related to events. In this paper, we propose DyGED, a simple yet novel deep learning model for event detection on dynamic graphs. DyGED learns correlations between the graph macro dynamics -- i.e. a sequence of graph-level representations -- and labeled events. Moreover, our approach combines structural and temporal self-attention mechanisms to account for application-specific node and time importances effectively. Our experimental evaluation, using a representative set of datasets, demonstrates that DyGED outperforms competing solutions in terms of event detection accuracy by up to 8.5% while being more scalable than the top alternatives. We also present case studies illustrating key features of our model.
    The Role of ImageNet Classes in Fr\'echet Inception Distance. (arXiv:2203.06026v3 [cs.CV] UPDATED)
    Fr\'echet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.
    How Many Data Samples is an Additional Instruction Worth?. (arXiv:2203.09161v3 [cs.CL] UPDATED)
    Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language. Instruction-tuned models have significantly outperformed multitask learning models (without instruction); however they are far from state-of-the-art task-specific models. Conventional approaches to improve model performance via creating datasets with large number of task instances or architectural changes in the model may not be feasible for non-expert users. However, they can write alternate instructions to represent an instruction task. Is Instruction-augmentation helpful? We augment a subset of tasks in the expanded version of NATURAL INSTRUCTIONS with additional instructions and find that it significantly improves model performance (up to 35%), especially in the low-data regime. Our results indicate that an additional instruction can be equivalent to ~200 data samples on average across tasks.
    Towards Effective and Robust Neural Trojan Defenses via Input Filtering. (arXiv:2202.12154v5 [cs.CR] UPDATED)
    Trojan attacks on deep neural networks are both dangerous and surreptitious. Over the past few years, Trojan attacks have advanced from using only a single input-agnostic trigger and targeting only one class to using multiple, input-specific triggers and targeting multiple classes. However, Trojan defenses have not caught up with this development. Most defense methods still make inadequate assumptions about Trojan triggers and target classes, thus, can be easily circumvented by modern Trojan attacks. To deal with this problem, we propose two novel "filtering" defenses called Variational Input Filtering (VIF) and Adversarial Input Filtering (AIF) which leverage lossy data compression and adversarial learning respectively to effectively purify potential Trojan triggers in the input at run time without making assumptions about the number of triggers/target classes or the input dependence property of triggers. In addition, we introduce a new defense mechanism called "Filtering-then-Contrasting" (FtC) which helps avoid the drop in classification accuracy on clean data caused by "filtering", and combine it with VIF/AIF to derive new defenses of this kind. Extensive experimental results and ablation studies show that our proposed defenses significantly outperform well-known baseline defenses in mitigating five advanced Trojan attacks including two recent state-of-the-art while being quite robust to small amounts of training data and large-norm triggers.
    EPISODE: Episodic Gradient Clipping with Periodic Resampled Corrections for Federated Learning with Heterogeneous Data. (arXiv:2302.07155v1 [cs.LG])
    Gradient clipping is an important technique for deep neural networks with exploding gradients, such as recurrent neural networks. Recent studies have shown that the loss functions of these networks do not satisfy the conventional smoothness condition, but instead satisfy a relaxed smoothness condition, i.e., the Lipschitz constant of the gradient scales linearly in terms of the gradient norm. Due to this observation, several gradient clipping algorithms have been developed for nonconvex and relaxed-smooth functions. However, the existing algorithms only apply to the single-machine or multiple-machine setting with homogeneous data across machines. It remains unclear how to design provably efficient gradient clipping algorithms in the general Federated Learning (FL) setting with heterogeneous data and limited communication rounds. In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting. The key ingredients of the algorithm are two new techniques called \textit{episodic gradient clipping} and \textit{periodic resampled corrections}. At the beginning of each round, EPISODE resamples stochastic gradients from each client and obtains the global averaged gradient, which is used to (1) determine whether to apply gradient clipping for the entire round and (2) construct local gradient corrections for each client. Notably, our algorithm and analysis provide a unified framework for both homogeneous and heterogeneous data under any noise level of the stochastic gradient, and it achieves state-of-the-art complexity results. In particular, we prove that EPISODE can achieve linear speedup in the number of machines, and it requires significantly fewer communication rounds. Experiments on several heterogeneous datasets show the superior performance of EPISODE over several strong baselines in FL.
    Characterizing notions of omniprediction via multicalibration. (arXiv:2302.06726v1 [cs.LG])
    A recent line of work shows that notions of multigroup fairness imply surprisingly strong notions of omniprediction: loss minimization guarantees that apply not just for a specific loss function, but for any loss belonging to a large family of losses. While prior work has derived various notions of omniprediction from multigroup fairness guarantees of varying strength, it was unknown whether the connection goes in both directions. In this work, we answer this question in the affirmative, establishing equivalences between notions of multicalibration and omniprediction. The new definitions that hold the key to this equivalence are new notions of swap omniprediction, which are inspired by swap regret in online learning. We show that these can be characterized exactly by a strengthening of multicalibration that we refer to as swap multicalibration. One can go from standard to swap multicalibration by a simple discretization; moreover all known algorithms for standard multicalibration in fact give swap multicalibration. In the context of omniprediction though, introducing the notion of swapping results in provably stronger notions, which require a predictor to minimize expected loss at least as well as an adaptive adversary who can choose both the loss function and hypothesis based on the value predicted by the predictor. Building on these characterizations, we paint a complete picture of the relationship between the various omniprediction notions in the literature by establishing implications and separations between them. Our work deepens our understanding of the connections between multigroup fairness, loss minimization and outcome indistinguishability and establishes new connections to classic notions in online learning.
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v3 [stat.ML] UPDATED)
    In this paper, we interpret disentanglement as the discovery of local charts and trace how that definition naturally leads to an equivalent condition for disentanglement: the disentangled factors must commute with each other. We discuss the practical and theoretical implications of commutativity, in particular the compression and disentanglement of generative models. Finally, we conclude with a discussion of related approaches to disentanglement and how they relate to our view of disentanglement from the manifold perspective.
    The Missing Margin: How Sample Corruption Affects Distance to the Boundary in ANNs. (arXiv:2302.06925v1 [cs.LG])
    Classification margins are commonly used to estimate the generalization ability of machine learning models. We present an empirical study of these margins in artificial neural networks. A global estimate of margin size is usually used in the literature. In this work, we point out seldom considered nuances regarding classification margins. Notably, we demonstrate that some types of training samples are modelled with consistently small margins while affecting generalization in different ways. By showing a link with the minimum distance to a different-target sample and the remoteness of samples from one another, we provide a plausible explanation for this observation. We support our findings with an analysis of fully-connected networks trained on noise-corrupted MNIST data, as well as convolutional networks trained on noise-corrupted CIFAR10 data.
    Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks. (arXiv:2302.07215v1 [cs.LG])
    Deep learning has contributed greatly to many successes in artificial intelligence in recent years. Today, it is possible to train models that have thousands of layers and hundreds of billions of parameters. Large-scale deep models have achieved great success, but the enormous computational complexity and gigantic storage requirements make it extremely difficult to implement them in real-time applications. On the other hand, the size of the dataset is still a real problem in many domains. Data are often missing, too expensive, or impossible to obtain for other reasons. Ensemble learning is partially a solution to the problem of small datasets and overfitting. However, ensemble learning in its basic version is associated with a linear increase in computational complexity. We analyzed the impact of the ensemble decision-fusion mechanism and checked various methods of sharing the decisions including voting algorithms. We used the modified knowledge distillation framework as a decision-fusion mechanism which allows in addition compressing of the entire ensemble model into a weight space of a single model. We showed that knowledge distillation can aggregate knowledge from multiple teachers in only one student model and, with the same computational complexity, obtain a better-performing model compared to a model trained in the standard manner. We have developed our own method for mimicking the responses of all teachers at the same time, simultaneously. We tested these solutions on several benchmark datasets. In the end, we presented a wide application use of the efficient multi-teacher knowledge distillation framework. In the first example, we used knowledge distillation to develop models that could automate corrosion detection on aircraft fuselage. The second example describes detection of smoke on observation cameras in order to counteract wildfires in forests.
    Online Learning of Network Bottlenecks via Minimax Paths. (arXiv:2109.08467v3 [cs.LG] UPDATED)
    In this paper, we study bottleneck identification in networks via extracting minimax paths. Many real-world networks have stochastic weights for which full knowledge is not available in advance. Therefore, we model this task as a combinatorial semi-bandit problem to which we apply a combinatorial version of Thompson Sampling and establish an upper bound on the corresponding Bayesian regret. Due to the computational intractability of the problem, we then devise an alternative problem formulation which approximates the original objective. Finally, we experimentally evaluate the performance of Thompson Sampling with the approximate formulation on real-world directed and undirected networks.
    Boosted ab initio Cryo-EM 3D Reconstruction with ACE-EM. (arXiv:2302.06091v2 [cs.CV] UPDATED)
    The central problem in cryo-electron microscopy (cryo-EM) is to recover the 3D structure from noisy 2D projection images which requires estimating the missing projection angles (poses). Recent methods attempted to solve the 3D reconstruction problem with the autoencoder architecture, which suffers from the latent vector space sampling problem and frequently produces suboptimal pose inferences and inferior 3D reconstructions. Here we present an improved autoencoder architecture called ACE (Asymmetric Complementary autoEncoder), based on which we designed the ACE-EM method for cryo-EM 3D reconstructions. Compared to previous methods, ACE-EM reached higher pose space coverage within the same training time and boosted the reconstruction performance regardless of the choice of decoders. With this method, the Nyquist resolution (highest possible resolution) was reached for 3D reconstructions of both simulated and experimental cryo-EM datasets. Furthermore, ACE-EM is the only amortized inference method that reached the Nyquist resolution.
    An Experimental Study of Byzantine-Robust Aggregation Schemes in Federated Learning. (arXiv:2302.07173v1 [cs.LG])
    Byzantine-robust federated learning aims at mitigating Byzantine failures during the federated training process, where malicious participants may upload arbitrary local updates to the central server to degrade the performance of the global model. In recent years, several robust aggregation schemes have been proposed to defend against malicious updates from Byzantine clients and improve the robustness of federated learning. These solutions were claimed to be Byzantine-robust, under certain assumptions. Other than that, new attack strategies are emerging, striving to circumvent the defense schemes. However, there is a lack of systematic comparison and empirical study thereof. In this paper, we conduct an experimental study of Byzantine-robust aggregation schemes under different attacks using two popular algorithms in federated learning, FedSGD and FedAvg . We first survey existing Byzantine attack strategies and Byzantine-robust aggregation schemes that aim to defend against Byzantine attacks. We also propose a new scheme, ClippedClustering , to enhance the robustness of a clustering-based scheme by automatically clipping the updates. Then we provide an experimental evaluation of eight aggregation schemes in the scenario of five different Byzantine attacks. Our results show that these aggregation schemes sustain relatively high accuracy in some cases but are ineffective in others. In particular, our proposed ClippedClustering successfully defends against most attacks under independent and IID local datasets. However, when the local datasets are Non-IID, the performance of all the aggregation schemes significantly decreases. With Non-IID data, some of these aggregation schemes fail even in the complete absence of Byzantine clients. We conclude that the robustness of all the aggregation schemes is limited, highlighting the need for new defense strategies, in particular for Non-IID datasets.
    Anonymization for Skeleton Action Recognition. (arXiv:2111.15129v3 [cs.CV] UPDATED)
    Skeleton-based action recognition attracts practitioners and researchers due to the lightweight, compact nature of datasets. Compared with RGB-video-based action recognition, skeleton-based action recognition is a safer way to protect the privacy of subjects while having competitive recognition performance. However, due to improvements in skeleton recognition algorithms as well as motion and depth sensors, more details of motion characteristics can be preserved in the skeleton dataset, leading to potential privacy leakage. We first train classifiers to categorize private information from skeleton trajectories to investigate the potential privacy leakage from skeleton datasets. Our preliminary experiments show that the gender classifier achieves 87% accuracy on average, and the re-identification classifier achieves 80% accuracy on average with three baseline models: Shift-GCN, MS-G3D, and 2s-AGCN. We propose an anonymization framework based on adversarial learning to protect potential privacy leakage from the skeleton dataset. Experimental results show that an anonymized dataset can reduce the risk of privacy leakage while having marginal effects on action recognition performance even with simple anonymizer architectures. The code used in our experiments is available at https://github.com/ml-postech/Skeleton-anonymization/
    Linear Causal Disentanglement via Interventions. (arXiv:2211.16467v2 [stat.ML] UPDATED)
    Causal disentanglement seeks a representation of data involving latent variables that relate to one another via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement that accurately recovers a latent causal model.
    Online Learning of Energy Consumption for Navigation of Electric Vehicles. (arXiv:2111.02314v2 [cs.LG] UPDATED)
    Energy efficient navigation constitutes an important challenge in electric vehicles, due to their limited battery capacity. We employ a Bayesian approach to model the energy consumption at road segments for efficient navigation. In order to learn the model parameters, we develop an online learning framework and investigate several exploration strategies such as Thompson Sampling and Upper Confidence Bound. We then extend our online learning framework to the multi-agent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We analyze Thompson Sampling and establish rigorous regret bounds on its performance in the single-agent and multi-agent settings, through an analysis of the algorithm under batched feedback. Finally, we demonstrate the performance of our methods via experiments on several real-world city road networks.
    RamanNet: A generalized neural network architecture for Raman Spectrum Analysis. (arXiv:2201.09737v2 [cs.LG] UPDATED)
    Raman spectroscopy provides a vibrational profile of the molecules and thus can be used to uniquely identify different kind of materials. This sort of fingerprinting molecules has thus led to widespread application of Raman spectrum in various fields like medical dignostics, forensics, mineralogy, bacteriology and virology etc. Despite the recent rise in Raman spectra data volume, there has not been any significant effort in developing generalized machine learning methods for Raman spectra analysis. We examine, experiment and evaluate existing methods and conjecture that neither current sequential models nor traditional machine learning models are satisfactorily sufficient to analyze Raman spectra. Both has their perks and pitfalls, therefore we attempt to mix the best of both worlds and propose a novel network architecture RamanNet. RamanNet is immune to invariance property in CNN and at the same time better than traditional machine learning models for the inclusion of sparse connectivity. Our experiments on 4 public datasets demonstrate superior performance over the much complex state-of-the-art methods and thus RamanNet has the potential to become the defacto standard in Raman spectra data analysis
    Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks. (arXiv:2210.01019v2 [stat.ML] UPDATED)
    Monotonic linear interpolation (MLI) - on the line connecting a random initialization with the minimizer it converges to, the loss and accuracy are monotonic - is a phenomenon that is commonly observed in the training of neural networks. Such a phenomenon may seem to suggest that optimization of neural networks is easy. In this paper, we show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain). We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset using a simple model. Empirically we demonstrate that similar intuitions hold on practical networks and realistic datasets.
    An Exact Poly-Time Membership-Queries Algorithm for Extraction a three-Layer ReLU Network. (arXiv:2105.09673v4 [cs.LG] UPDATED)
    We consider the natural problem of learning a ReLU network from queries, which was recently remotivated by model extraction attacks. In this work, we present a polynomial-time algorithm that can learn a depth-two ReLU network from queries under mild general position assumptions. We also present a polynomial-time algorithm that, under mild general position assumptions, can learn a rich class of depth-three ReLU networks from queries. For instance, it can learn most networks where the number of first layer neurons is smaller than the dimension and the number of second layer neurons. These two results substantially improve state-of-the-art: Until our work, polynomial-time algorithms were only shown to learn from queries depth-two networks under the assumption that either the underlying distribution is Gaussian (Chen et al. (2021)) or that the weights matrix rows are linearly independent (Milli et al. (2019)). For depth three or more, there were no known poly-time results.
    Graph Embeddings via Tensor Products and Approximately Orthonormal Codes. (arXiv:2208.10917v3 [cs.SI] UPDATED)
    We introduce a method for embedding graphs as vectors in a structure-preserving manner, showcasing its rich representational capacity and giving some theoretical properties. Our procedure falls under the bind-and-sum approach, and we show that our binding operation - the tensor product - is the most general binding operation that respects the principle of superposition. We also establish some precise results characterizing the behavior of our method, and we show that our use of spherical codes achieves a packing upper bound. Then, we perform experiments showcasing our method's accuracy in various graph operations even when the number of edges is quite large. Finally, we establish a link to adjacency matrices, showing that our method is, in some sense, a generalization of adjacency matrices with applications towards large sparse graphs.
    How Does It Feel? Self-Supervised Costmap Learning for Off-Road Vehicle Traversability. (arXiv:2209.10788v2 [cs.RO] UPDATED)
    Estimating terrain traversability in off-road environments requires reasoning about complex interaction dynamics between the robot and these terrains. However, it is challenging to create informative labels to learn a model in a supervised manner for these interactions. We propose a method that learns to predict traversability costmaps by combining exteroceptive environmental information with proprioceptive terrain interaction feedback in a self-supervised manner. Additionally, we propose a novel way of incorporating robot velocity in the costmap prediction pipeline. We validate our method in multiple short and large-scale navigation tasks on challenging off-road terrains using two different large, all-terrain robots. Our short-scale navigation results show that using our learned costmaps leads to overall smoother navigation, and provides the robot with a more fine-grained understanding of the robot-terrain interactions. Our large-scale navigation trials show that we can reduce the number of interventions by up to 57\% compared to an occupancy-based navigation baseline in challenging off-road courses ranging from 400 m to 3150 m. Appendix and full experiment videos can be found in our website: https://mateoguaman.github.io/hdif.
    Deep Anatomical Federated Network (Dafne): an open client/server framework for the continuous collaborative improvement of deep-learning-based medical image segmentation. (arXiv:2302.06352v2 [eess.IV] UPDATED)
    Semantic segmentation is a crucial step to extract quantitative information from medical (and, specifically, radiological) images to aid the diagnostic process, clinical follow-up. and to generate biomarkers for clinical research. In recent years, machine learning algorithms have become the primary tool for this task. However, its real-world performance is heavily reliant on the comprehensiveness of training data. Dafne is the first decentralized, collaborative solution that implements continuously evolving deep learning models exploiting the collective knowledge of the users of the system. In the Dafne workflow, the result of each automated segmentation is refined by the user through an integrated interface, so that the new information is used to continuously expand the training pool via federated incremental learning. The models deployed through Dafne are able to improve their performance over time and to generalize to data types not seen in the training sets, thus becoming a viable and practical solution for real-life medical segmentation tasks.
    A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms. (arXiv:2206.14983v2 [cs.LG] UPDATED)
    Recent research increasingly brings to question the appropriateness of using predictive tools in complex, real-world tasks. While a growing body of work has explored ways to improve value alignment in these tools, comparatively less work has centered concerns around the fundamental justifiability of using these tools. This work seeks to center validity considerations in deliberations around whether and how to build data-driven algorithms in high-stakes domains. Toward this end, we translate key concepts from validity theory to predictive algorithms. We apply the lens of validity to re-examine common challenges in problem formulation and data issues that jeopardize the justifiability of using predictive algorithms and connect these challenges to the social science discourse around validity. Our interdisciplinary exposition clarifies how these concepts apply to algorithmic decision making contexts. We demonstrate how these validity considerations could distill into a series of high-level questions intended to promote and document reflections on the legitimacy of the predictive task and the suitability of the data.
    DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. (arXiv:2210.08933v3 [cs.CL] UPDATED)
    Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is under-explored due to the discrete nature of texts, especially for conditional generation. We tackle this challenge by proposing DiffuSeq: a diffusion model designed for sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DiffuSeq is its high diversity during generation, which is desired in many Seq2Seq tasks. We further include a theoretical analysis revealing the connection between DiffuSeq and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks. Code is available at \url{https://github.com/Shark-NLP/DiffuSeq}
    Robot Learning on the Job: Human-in-the-Loop Autonomy and Learning During Deployment. (arXiv:2211.08416v2 [cs.RO] UPDATED)
    With the rapid growth of computing powers and recent advances in deep learning, we have witnessed impressive demonstrations of novel robot capabilities in research settings. Nonetheless, these learning systems exhibit brittle generalization and require excessive training data for practical tasks. To harness the capabilities of state-of-the-art robot learning models while embracing their imperfections, we present Sirius, a principled framework for humans and robots to collaborate through a division of work. In this framework, partially autonomous robots are tasked with handling a major portion of decision-making where they work reliably; meanwhile, human operators monitor the process and intervene in challenging situations. Such a human-robot team ensures safe deployments in complex tasks. Further, we introduce a new learning algorithm to improve the policy's performance on the data collected from the task executions. The core idea is re-weighing training samples with approximated human trust and optimizing the policies with weighted behavioral cloning. We evaluate Sirius in simulation and on real hardware, showing that Sirius consistently outperforms baselines over a collection of contact-rich manipulation tasks, achieving an 8% boost in simulation and 27% on real hardware than the state-of-the-art methods, with twice faster convergence and 85% memory size reduction. Videos and code are available at https://ut-austin-rpl.github.io/sirius/
    Automated Reachability Analysis of Neural Network-Controlled Systems via Adaptive Polytopes. (arXiv:2212.07553v2 [eess.SY] UPDATED)
    Over-approximating the reachable sets of dynamical systems is a fundamental problem in safety verification and robust control synthesis. The representation of these sets is a key factor that affects the computational complexity and the approximation error. In this paper, we develop a new approach for over-approximating the reachable sets of neural network dynamical systems using adaptive template polytopes. We use the singular value decomposition of linear layers along with the shape of the activation functions to adapt the geometry of the polytopes at each time step to the geometry of the true reachable sets. We then propose a branch-and-bound method to compute accurate over-approximations of the reachable sets by the inferred templates. We illustrate the utility of the proposed approach in the reachability analysis of linear systems driven by neural network controllers.
    Integrated Sensing and Communication from Learning Perspective: An SDP3 Approach. (arXiv:2107.09621v3 [cs.IT] UPDATED)
    Characterizing the sensing and communication performance tradeoff in integrated sensing and communication (ISAC) systems is challenging in the applications of learning-based human motion recognition. This is because of the large experimental datasets and the black-box nature of deep neural networks. This paper presents SDP3, a Simulation-Driven Performance Predictor and oPtimizer, which consists of SDP3 data simulator, SDP3 performance predictor and SDP3 performance optimizer. Specifically, the SDP3 data simulator generates vivid wireless sensing datasets in a virtual environment, the SDP3 performance predictor predicts the sensing performance based on the function regression method, and the SDP3 performance optimizer investigates the sensing and communication performance tradeoff analytically. It is shown that the simulated sensing dataset matches the experimental dataset very well in the motion recognition accuracy. By leveraging SDP3, it is found that the achievable region of recognition accuracy and communication throughput consists of a communication saturation zone, a sensing saturation zone, and a communication-sensing adversarial zone, of which the desired balanced performance for ISAC systems lies in the third one.
    A Macrocolumn Architecture Implemented with Spiking Neurons. (arXiv:2207.05081v2 [cs.NE] UPDATED)
    The macrocolumn is a key component of a neuromorphic computing system that interacts with an external environment under control of an agent. Environments are learned and stored in the macrocolumn as labeled directed graphs where edges connect features and labels indicate the relative displacements between them. Macrocolumn functionality is first defined with a state machine model. This model is then implemented with a neural network composed of spiking neurons. The neuron model employs active dendrites and mirrors the Hawkins/Numenta neuron model. The architecture is demonstrated with a research benchmark in which an agent employs a macrocolumn to first learn and then navigate 2-d environments containing pseudo-randomly placed features.
    RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. (arXiv:2210.02885v2 [cs.LG] UPDATED)
    Joint-Embedding Self Supervised Learning (JE-SSL) has seen a rapid development, with the emergence of many method variations but only few principled guidelines that would help practitioners to successfully deploy them. The main reason for that pitfall comes from JE-SSL's core principle of not employing any input reconstruction therefore lacking visual cues of unsuccessful training. Adding non informative loss values to that, it becomes difficult to deploy SSL on a new dataset for which no labels can help to judge the quality of the learned representation. In this study, we develop a simple unsupervised criterion that is indicative of the quality of the learned JE-SSL representations: their effective rank. Albeit simple and computationally friendly, this method -- coined RankMe -- allows one to assess the performance of JE-SSL representations, even on different downstream datasets, without requiring any labels. A further benefit of RankMe is that it does not have any training or hyper-parameters to tune. Through thorough empirical experiments involving hundreds of training episodes, we demonstrate how RankMe can be used for hyperparameter selection with nearly no reduction in final performance compared to the current selection method that involve a dataset's labels. We hope that RankMe will facilitate the deployment of JE-SSL towards domains that do not have the opportunity to rely on labels for representations' quality assessment.
    Forget Unlearning: Towards True Data-Deletion in Machine Learning. (arXiv:2210.08911v2 [stat.ML] UPDATED)
    Unlearning algorithms aim to remove deleted data's influence from trained models at a cost lower than full retraining. However, prior guarantees of unlearning in literature are flawed and don't protect the privacy of deleted records. We show that when users delete their data as a function of published models, records in a database become interdependent. So, even retraining a fresh model after deletion of a record doesn't ensure its privacy. Secondly, unlearning algorithms that cache partial computations to speed up the processing can leak deleted information over a series of releases, violating the privacy of deleted records in the long run. To address these, we propose a sound deletion guarantee and show that the privacy of existing records is necessary for the privacy of deleted records. Under this notion, we propose an accurate, computationally efficient, and secure machine unlearning algorithm based on noisy gradient descent.
    Time-aware Random Walk Diffusion to Improve Dynamic Graph Learning. (arXiv:2211.01214v5 [cs.LG] UPDATED)
    How can we augment a dynamic graph for improving the performance of dynamic graph neural networks? Graph augmentation has been widely utilized to boost the learning performance of GNN-based models. However, most existing approaches only enhance spatial structure within an input static graph by transforming the graph, and do not consider dynamics caused by time such as temporal locality, i.e., recent edges are more influential than earlier ones, which remains challenging for dynamic graph augmentation. In this work, we propose TiaRa (Time-aware Random Walk Diffusion), a novel diffusion-based method for augmenting a dynamic graph represented as a discrete-time sequence of graph snapshots. For this purpose, we first design a time-aware random walk proximity so that a surfer can walk along the time dimension as well as edges, resulting in spatially and temporally localized scores. We then derive our diffusion matrices based on the time-aware random walk, and show they become enhanced adjacency matrices that both spatial and temporal localities are augmented. Throughout extensive experiments, we demonstrate that TiaRa effectively augments a given dynamic graph, and leads to significant improvements in dynamic GNN models for various graph datasets and tasks.
    Measuring incompatibility and clustering quantum observables with a quantum switch. (arXiv:2208.06210v2 [quant-ph] UPDATED)
    The existence of incompatible observables is a cornerstone of quantum mechanics and a valuable resource in quantum technologies. Here we introduce a measure of incompatibility, called the mutual eigenspace disturbance (MED), which quantifies the amount of disturbance induced by the measurement of a sharp observable on the eigenspaces of another. The MED provides a metric on the space of von Neumann measurements, and can be efficiently estimated by letting the measurement processes act in an indefinite order, using a setup known as the quantum switch, which also allows one to quantify the noncommutativity of arbitrary quantum processes. Thanks to these features, the MED can be used in quantum machine learning tasks. We demonstrate this application by providing an unsupervised algorithm that clusters unknown von Neumann measurements. Our algorithm is robust to noise can be used to identify groups of observers that share approximately the same measurement context.
    Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against Fact-Verification Systems. (arXiv:2209.03755v3 [cs.CR] UPDATED)
    Mis- and disinformation are a substantial global threat to our security and safety. To cope with the scale of online misinformation, researchers have been working on automating fact-checking by retrieving and verifying against relevant evidence. However, despite many advances, a comprehensive evaluation of the possible attack vectors against such systems is still lacking. Particularly, the automated fact-verification process might be vulnerable to the exact disinformation campaigns it is trying to combat. In this work, we assume an adversary that automatically tampers with the online evidence in order to disrupt the fact-checking model via camouflaging the relevant evidence or planting a misleading one. We first propose an exploratory taxonomy that spans these two targets and the different threat model dimensions. Guided by this, we design and propose several potential attack methods. We show that it is possible to subtly modify claim-salient snippets in the evidence and generate diverse and claim-aligned evidence. Thus, we highly degrade the fact-checking performance under many different permutations of the taxonomy's dimensions. The attacks are also robust against post-hoc modifications of the claim. Our analysis further hints at potential limitations in models' inference when faced with contradicting evidence. We emphasize that these attacks can have harmful implications on the inspectable and human-in-the-loop usage scenarios of such models, and conclude by discussing challenges and directions for future defenses.
    Parameter-Efficient Tuning with Special Token Adaptation. (arXiv:2210.04382v2 [cs.CL] UPDATED)
    Parameter-efficient tuning aims at updating only a small subset of parameters when adapting a pretrained model to downstream tasks. In this work, we introduce PASTA, in which we only modify the special token representations (e.g., [SEP] and [CLS] in BERT) before the self-attention module at each layer in Transformer-based models. PASTA achieves comparable performance to full finetuning in natural language understanding tasks including text classification and NER with up to only 0.029% of total parameters trained. Our work not only provides a simple yet effective way of parameter-efficient tuning, which has a wide range of practical applications when deploying finetuned models for multiple tasks, but also demonstrates the pivotal role of special tokens in pretrained language models
    Netizens, Academicians, and Information Professionals' Opinions About AI With Special Reference To ChatGPT. (arXiv:2302.07136v1 [cs.CY])
    This study aims to understand the perceptions and opinions of academicians towards ChatGPT-3 by collecting and analyzing social media comments, and a survey was conducted with library and information science professionals. The research uses a content analysis method and finds that while ChatGPT-3 can be a valuable tool for research and writing, it is not 100% accurate and should be cross-checked. The study also finds that while some academicians may not accept ChatGPT-3, most are starting to accept it. The study is beneficial for academicians, content developers, and librarians.
    Towards Interpretable Sleep Stage Classification Using Cross-Modal Transformers. (arXiv:2208.06991v2 [cs.LG] UPDATED)
    Accurate sleep stage classification is significant for sleep health assessment. In recent years, several machine-learning based sleep staging algorithms have been developed, and in particular, deep-learning based algorithms have achieved performance on par with human annotation. Despite the improved performance, a limitation of most deep-learning based algorithms is their black-box behavior, which has limited their use in clinical settings. Here, we propose a cross-modal transformer, which is a transformer-based method for sleep stage classification. The proposed cross-modal transformer consists of a novel cross-modal transformer encoder architecture along with a multi-scale one-dimensional convolutional neural network for automatic representation learning. Our method outperforms the state-of-the-art methods and eliminates the black-box behavior of deep-learning models by utilizing the interpretability aspect of the attention modules. Furthermore, our method provides considerable reductions in the number of parameters and training time compared to the state-of-the-art methods. Our code is available at https://github.com/Jathurshan0330/Cross-Modal-Transformer.
    Interpolation Learning With Minimum Description Length. (arXiv:2302.07263v1 [cs.LG])
    We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.
    Memorization-Dilation: Modeling Neural Collapse Under Label Noise. (arXiv:2206.05530v2 [cs.LG] UPDATED)
    The notion of neural collapse refers to several emergent phenomena that have been empirically observed across various canonical classification problems. During the terminal phase of training a deep neural network, the feature embedding of all examples of the same class tend to collapse to a single representation, and the features of different classes tend to separate as much as possible. Neural collapse is often studied through a simplified model, called the unconstrained feature representation, in which the model is assumed to have "infinite expressivity" and can map each data point to any arbitrary representation. In this work, we propose a more realistic variant of the unconstrained feature representation that takes the limited expressivity of the network into account. Empirical evidence suggests that the memorization of noisy data points leads to a degradation (dilation) of the neural collapse. Using a model of the memorization-dilation (M-D) phenomenon, we show one mechanism by which different losses lead to different performances of the trained network on noisy data. Our proofs reveal why label smoothing, a modification of cross-entropy empirically observed to produce a regularization effect, leads to improved generalization in classification tasks.
    Non-stationary Contextual Bandits and Universal Learning. (arXiv:2302.07186v1 [stat.ML])
    We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, [Blanchard et al.] characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly depending on the learner's actions. We show that optimistic universal learning for non-stationary contextual bandits is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various non-stationarity models, including online and adversarial reward mechanisms. In particular, the set of learnable processes for non-stationary rewards is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new non-stationary phenomena.
    Contrastive Multimodal Learning for Emergence of Graphical Sensory-Motor Communication. (arXiv:2210.06468v2 [cs.AI] UPDATED)
    In this paper, we investigate whether artificial agents can develop a shared language in an ecological setting where communication relies on a sensory-motor channel. To this end, we introduce the Graphical Referential Game (GREG) where a speaker must produce a graphical utterance to name a visual referent object while a listener has to select the corresponding object among distractor referents, given the delivered message. The utterances are drawing images produced using dynamical motor primitives combined with a sketching library. To tackle GREG we present CURVES: a multimodal contrastive deep learning mechanism that represents the energy (alignment) between named referents and utterances generated through gradient ascent on the learned energy landscape. We demonstrate that CURVES not only succeeds at solving the GREG but also enables agents to self-organize a language that generalizes to feature compositions never seen during training. In addition to evaluating the communication performance of our approach, we also explore the structure of the emerging language. Specifically, we show that the resulting language forms a coherent lexicon shared between agents and that basic compositional rules on the graphical productions could not explain the compositional generalization.
    Solution Path Algorithm for Twin Multi-class Support Vector Machine. (arXiv:2006.00276v2 [cs.LG] UPDATED)
    The twin support vector machine and its extensions have made great achievements in dealing with binary classification problems. However, it suffers from difficulties in effective solution of multi-classification and fast model selection. This work devotes to the fast regularization parameter tuning algorithm for the twin multi-class support vector machine. Specifically, a novel sample data set partition strategy is first adopted, which is the basis for the model construction. Then, combining the linear equations and block matrix theory, the Lagrangian multipliers are proved to be piecewise linear w.r.t. the regularization parameters, so that the regularization parameters are continuously updated by only solving the break points. Next, Lagrangian multipliers are proved to be 1 as the regularization parameter approaches infinity, thus, a simple yet effective initialization algorithm is devised. Finally, eight kinds of events are defined to seek for the starting event for the next iteration. Extensive experimental results on nine UCI data sets show that the proposed method can achieve comparable classification performance without solving any quadratic programming problem.  ( 2 min )
    Private Statistical Estimation of Many Quantiles. (arXiv:2302.06943v1 [stat.ML])
    This work studies the estimation of many statistical quantiles under differential privacy. More precisely, given a distribution and access to i.i.d. samples from it, we study the estimation of the inverse of its cumulative distribution function (the quantile function) at specific points. For instance, this task is of key importance in private data generation. We present two different approaches. The first one consists in privately estimating the empirical quantiles of the samples and using this result as an estimator of the quantiles of the distribution. In particular, we study the statistical properties of the recently published algorithm introduced by Kaplan et al. 2022 that privately estimates the quantiles recursively. The second approach is to use techniques of density estimation in order to uniformly estimate the quantile function on an interval. In particular, we show that there is a tradeoff between the two methods. When we want to estimate many quantiles, it is better to estimate the density rather than estimating the quantile function at specific points.  ( 2 min )
    PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. (arXiv:2302.07120v1 [cs.AI])
    Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified model to meet customized requirements may lead to the problem of negative transfer. Inspired by the success of multi-task learning in the NLP area, we use prefix embeddings to provide a novel generative model that considers both the targeted pocket's circumstances and a variety of chemical properties. All conditional information is represented as learnable features, which the generative model subsequently employs as a contextual prompt. Experiments show that our model exhibits good controllability in both single and multi-conditional molecular generation. The controllability enables us to outperform previous structure-based drug design methods. More interestingly, we open up the attention mechanism and reveal coupling relationships between conditions, providing guidance for multi-conditional molecule generation.
    Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search. (arXiv:2108.08910v2 [eess.IV] UPDATED)
    Though recent years have witnessed remarkable progress in single image super-resolution (SISR) tasks with the prosperous development of deep neural networks (DNNs), the deep learning methods are confronted with the computation and memory consumption issues in practice, especially for resource-limited platforms such as mobile devices. To overcome the challenge and facilitate the real-time deployment of SISR tasks on mobile, we combine neural architecture search with pruning search and propose an automatic search framework that derives sparse super-resolution (SR) models with high image quality while satisfying the real-time inference requirement. To decrease the search cost, we leverage the weight sharing strategy by introducing a supernet and decouple the search problem into three stages, including supernet construction, compiler-aware architecture and pruning search, and compiler-aware pruning ratio search. With the proposed framework, we are the first to achieve real-time SR inference (with only tens of milliseconds per frame) for implementing 720p resolution with competitive image quality (in terms of PSNR and SSIM) on mobile platforms (Samsung Galaxy S20).
    Automatic Segmentation of Aircraft Dents in Point Clouds. (arXiv:2205.01614v2 [cs.CV] UPDATED)
    Dents on the aircraft skin are frequent and may easily go undetected during airworthiness checks, as their inspection process is tedious and extremely subject to human factors and environmental conditions. Nowadays, 3D scanning technologies are being proposed for more reliable, human-independent measurements, yet the process of inspection and reporting remains laborious and time consuming because data acquisition and validation are still carried out by the engineer. For full automation of dent inspection, the acquired point cloud data must be analysed via a reliable segmentation algorithm, releasing humans from the search and evaluation of damage. This paper reports on two developments towards automated dent inspection. The first is a method to generate a synthetic dataset of dented surfaces to train a fully convolutional neural network. The training of machine learning algorithms needs a substantial volume of dent data, which is not readily available. Dents are thus simulated in random positions and shapes, within criteria and definitions of a Boeing 737 structural repair manual. The noise distribution from the scanning apparatus is then added to reflect the complete process of 3D point acquisition on the training. The second proposition is a surface fitting strategy to convert 3D point clouds to 2.5D. This allows higher resolution point clouds to be processed with a small amount of memory compared with state-of-the-art methods involving 3D sampling approaches. Simulations with available ground truth data show that the proposed technique reaches an intersection-over-union of over 80%. Experiments over dent samples prove an effective detection of dents with a speed of over 500 000 points per second.
    Robust Deep Reinforcement Learning through Regret Neighborhoods. (arXiv:2302.06912v1 [cs.LG])
    Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a "neighborhood" of observations to the received "observation". Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.
    SOAR: Simultaneous Or of And Rules for Classification of Positive & Negative Classes. (arXiv:2008.11249v3 [stat.ML] UPDATED)
    Algorithmic decision making has proliferated and now impacts our daily lives in both mundane and consequential ways. Machine learning practitioners make use of a myriad of algorithms for predictive models in applications as diverse as movie recommendations, medical diagnoses, and parole recommendations without delving into the reasons driving specific predictive decisions. Machine learning algorithms in such applications are often chosen for their superior performance, however popular choices such as random forest and deep neural networks fail to provide an interpretable understanding of the predictive model. In recent years, rule-based algorithms have been used to address this issue. Wang et al. (2017) presented an or-of-and (disjunctive normal form) based classification technique that allows for classification rule mining of a single class in a binary classification; this method is also shown to perform comparably to other modern algorithms. In this work, we extend this idea to provide classification rules for both classes simultaneously. That is, we provide a distinct set of rules for both positive and negative classes. In describing this approach, we also present a novel and complete taxonomy of classifications that clearly capture and quantify the inherent ambiguity in noisy binary classifications in the real world. We show that this approach leads to a more granular formulation of the likelihood model and a simulated-annealing based optimization achieves classification performance competitive with comparable techniques. We apply our method to synthetic as well as real world data sets to compare with other related methods that demonstrate the utility of our proposal.
    A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with Confounding Correction. (arXiv:1711.04162v2 [cs.LG] UPDATED)
    While linear mixed model (LMM) has shown a competitive performance in correcting spurious associations raised by population stratification, family structures, and cryptic relatedness, more challenges are still to be addressed regarding the complex structure of genotypic and phenotypic data. For example, geneticists have discovered that some clusters of phenotypes are more co-expressed than others. Hence, a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial for genetic modeling. We proposed the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction. Our method is capable of uncovering the genetic associations of a large number of phenotypes together while considering the relatedness of these phenotypes. Through extensive simulation experiments, we show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals. Further, we validate the effectiveness of sGLMM in the real-world genomic dataset on two different species from plants and humans. In Arabidopsis thaliana data, sGLMM behaves better than all other baseline models for 63.4% traits. We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.
    Bridge the Gap Between CV and NLP! An Optimization-based Textual Adversarial Attack Framework. (arXiv:2110.15317v3 [cs.CL] UPDATED)
    Despite recent success on various tasks, deep learning techniques still perform poorly on adversarial examples with small perturbations. While optimization-based methods for adversarial attacks are well-explored in the field of computer vision, it is impractical to directly apply them in natural language processing due to the discrete nature of the text. To address the problem, we propose a unified framework to extend the existing optimization-based adversarial attack methods in the vision domain to craft textual adversarial samples. In this framework, continuously optimized perturbations are added to the embedding layer and amplified in the forward propagation process. Then the final perturbed latent representations are decoded with a masked language model head to obtain potential adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD). We find our algorithm effective even using proxy gradient information. Therefore, we perform the more challenging transfer black-box attack and conduct comprehensive experiments to evaluate our attack algorithm with several models on three benchmark datasets. Experimental results demonstrate that our method achieves an overall better performance and produces more fluent and grammatical adversarial samples compared to strong baseline methods. All the code and data will be made public.
    CoTV: Cooperative Control for Traffic Light Signals and Connected Autonomous Vehicles using Deep Reinforcement Learning. (arXiv:2201.13143v2 [cs.AI] UPDATED)
    The target of reducing travel time only is insufficient to support the development of future smart transportation systems. To align with the United Nations Sustainable Development Goals (UN-SDG), a further reduction of fuel and emissions, improvements of traffic safety, and the ease of infrastructure deployment and maintenance should also be considered. Different from existing work focusing on the optimization of the control in either traffic light signal (to improve the intersection throughput), or vehicle speed (to stabilize the traffic), this paper presents a multi-agent Deep Reinforcement Learning (DRL) system called CoTV, which Cooperatively controls both Traffic light signals and Connected Autonomous Vehicles (CAV). Therefore, our CoTV can well balance the achievement of the reduction of travel time, fuel, and emissions. In the meantime, CoTV can also be easy to deploy by cooperating with only one CAV that is the nearest to the traffic light controller on each incoming road. This enables more efficient coordination between traffic light controllers and CAV, thus leading to the convergence of training CoTV under the large-scale multi-agent scenario that is traditionally difficult to converge. We give the detailed system design of CoTV and demonstrate its effectiveness in a simulation study using SUMO under various grid maps and realistic urban scenarios with mixed-autonomy traffic.
    Cauchy Loss Function: Robustness Under Gaussian and Cauchy Noise. (arXiv:2302.07238v1 [cs.LG])
    In supervised machine learning, the choice of loss function implicitly assumes a particular noise distribution over the data. For example, the frequently used mean squared error (MSE) loss assumes a Gaussian noise distribution. The choice of loss function during training and testing affects the performance of artificial neural networks (ANNs). It is known that MSE may yield substandard performance in the presence of outliers. The Cauchy loss function (CLF) assumes a Cauchy noise distribution, and is therefore potentially better suited for data with outliers. This papers aims to determine the extent of robustness and generalisability of the CLF as compared to MSE. CLF and MSE are assessed on a few handcrafted regression problems, and a real-world regression problem with artificially simulated outliers, in the context of ANN training. CLF yielded results that were either comparable to or better than the results yielded by MSE, with a few notable exceptions.
    Learning Graph ARMA Processes from Time-Vertex Spectra. (arXiv:2302.06887v1 [stat.ML])
    The modeling of time-varying graph signals as stationary time-vertex stochastic processes permits the inference of missing signal values by efficiently employing the correlation patterns of the process across different graph nodes and time instants. In this study, we first propose an algorithm for computing graph autoregressive moving average (graph ARMA) processes based on learning the joint time-vertex power spectral density of the process from its incomplete realizations. Our solution relies on first roughly estimating the joint spectrum of the process from partially observed realizations and then refining this estimate by projecting it onto the spectrum manifold of the ARMA process. We then present a theoretical analysis of the sample complexity of learning graph ARMA processes. Experimental results show that the proposed approach achieves improvement in the time-vertex signal estimation performance in comparison with reference approaches in the literature.
    Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent. (arXiv:2302.07125v1 [math.PR])
    We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate - infinite width scaling regime.
    Tetris-inspired detector with neural network for radiation mapping. (arXiv:2302.07099v1 [physics.ins-det])
    In recent years, radiation mapping has attracted widespread research attention and increased public concerns on environmental monitoring. In terms of both materials and their configurations, radiation detectors have been developed to locate the directions and positions of the radiation sources. In this process, algorithm is essential in converting detector signals to radiation source information. However, due to the complex mechanisms of radiation-matter interaction and the current limitation of data collection, high-performance, low-cost radiation mapping is still challenging. Here we present a computational framework using Tetris-inspired detector pixels and machine learning for radiation mapping. Using inter-pixel padding to increase the contrast between pixels and neural network to analyze the detector readings, a detector with as few as four pixels can achieve high-resolution directional mapping. By further imposing Maximum a Posteriori (MAP) with a moving detector, further radiation position localization is achieved. Non-square, Tetris-shaped detector can further improve performance beyond the conventional grid-shaped detector. Our framework offers a new avenue for high quality radiation mapping with least number of detector pixels possible, and is anticipated to be capable to deploy for real-world radiation detection with moderate validation.
    Energy Transformer. (arXiv:2302.07253v1 [cs.LG])
    Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it's empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.
    Where to Diffuse, How to Diffuse, and How to Get Back: Automated Learning for Multivariate Diffusions. (arXiv:2302.07261v1 [cs.LG])
    Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this inference diffusion process to generate samples. The choice of inference diffusion affects both likelihoods and sample quality. For example, extending the inference process with auxiliary variables leads to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study Multivariate Diffusion Models (MDMs). For any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDMs likelihood without requiring any model-specific analysis. We then demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over all linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST, CIFAR10, and ImageNet32 datasets. We show learned MDMs match or surpass bits-per-dims (BPDs) relative to fixed choices of diffusions for a given dataset and model architecture.
    Scalable Bayesian optimization with high-dimensional outputs using randomized prior networks. (arXiv:2302.07260v1 [cs.LG])
    Several fundamental problems in science and engineering consist of global optimization tasks involving unknown high-dimensional (black-box) functions that map a set of controllable variables to the outcomes of an expensive experiment. Bayesian Optimization (BO) techniques are known to be effective in tackling global optimization problems using a relatively small number objective function evaluations, but their performance suffers when dealing with high-dimensional outputs. To overcome the major challenge of dimensionality, here we propose a deep learning framework for BO and sequential decision making based on bootstrapped ensembles of neural architectures with randomized priors. Using appropriate architecture choices, we show that the proposed framework can approximate functional relationships between design variables and quantities of interest, even in cases where the latter take values in high-dimensional vector spaces or even infinite-dimensional function spaces. In the context of BO, we augmented the proposed probabilistic surrogates with re-parameterized Monte Carlo approximations of multiple-point (parallel) acquisition functions, as well as methodological extensions for accommodating black-box constraints and multi-fidelity information sources. We test the proposed framework against state-of-the-art methods for BO and demonstrate superior performance across several challenging tasks with high-dimensional outputs, including a constrained optimization task involving shape optimization of rotor blades in turbo-machinery.
    Randomization for adversarial robustness: the Good, the Bad and the Ugly. (arXiv:2302.07221v1 [cs.LG])
    Deep neural networks are known to be vulnerable to adversarial attacks: A small perturbation that is imperceptible to a human can easily make a well-trained deep neural network misclassify. To defend against adversarial attacks, randomized classifiers have been proposed as a robust alternative to deterministic ones. In this work we show that in the binary classification setting, for any randomized classifier, there is always a deterministic classifier with better adversarial risk. In other words, randomization is not necessary for robustness. In many common randomization schemes, the deterministic classifiers with better risk are explicitly described: For example, we show that ensembles of classifiers are more robust than mixtures of classifiers, and randomized smoothing is more robust than input noise injection. Finally, experiments confirm our theoretical results with the two families of randomized classifiers we analyze.
    Accelerated Fuzzy C-Means Clustering Based on New Affinity Filtering and Membership Scaling. (arXiv:2302.07060v1 [cs.LG])
    Fuzzy C-Means (FCM) is a widely used clustering method. However, FCM and its many accelerated variants have low efficiency in the mid-to-late stage of the clustering process. In this stage, all samples are involved in the update of their non-affinity centers, and the fuzzy membership grades of the most of samples, whose assignment is unchanged, are still updated by calculating the samples-centers distances. All those lead to the algorithms converging slowly. In this paper, a new affinity filtering technique is developed to recognize a complete set of the non-affinity centers for each sample with low computations. Then, a new membership scaling technique is suggested to set the membership grades between each sample and its non-affinity centers to 0 and maintain the fuzzy membership grades for others. By integrating those two techniques, FCM based on new affinity filtering and membership scaling (AMFCM) is proposed to accelerate the whole convergence process of FCM. Many experimental results performed on synthetic and real-world data sets have shown the feasibility and efficiency of the proposed algorithm. Compared with the state-of-the-art algorithms, AMFCM is significantly faster and more effective. For example, AMFCM reduces the number of the iteration of FCM by 80% on average.
    The Impact of Twitter Sentiments on Stock Market Trends. (arXiv:2302.07244v1 [cs.LG])
    The Web is a vast virtual space where people can share their opinions, impacting all aspects of life and having implications for marketing and communication. The most up-to-date and comprehensive information can be found on social media because of how widespread and straightforward it is to post a message. Proportionately, they are regarded as a valuable resource for making precise market predictions. In particular, Twitter has developed into a potent tool for understanding user sentiment. This article examines how well tweets can influence stock symbol trends. We analyze the volume, sentiment, and mentions of the top five stock symbols in the S&P 500 index on Twitter over three months. Long Short-Term Memory, Bernoulli Na\"ive Bayes, and Random Forest were the three algorithms implemented in this process. Our study revealed a significant correlation between stock prices and Twitter sentiment.
    Discrete fully probabilistic design: towards a control pipeline for the synthesis of policies from examples. (arXiv:2112.11210v2 [eess.SY] UPDATED)
    We present the principled design of a control pipeline for the synthesis of policies from examples data. The pipeline, based on a discretized design which we term as discrete fully probabilistic design, expounds an algorithm recently introduced in Gagliardi and Russo (2021) to synthesize policies from examples for constrained, stochastic and nonlinear systems. Contrary to other approaches, the pipeline we present: (i) does not need the constraints to be fulfilled in the possibly noisy example data; (ii) enables control synthesis even when the data are collected from an example system that is different from the one under control. The design is benchmarked numerically on an example that involves controlling an inverted pendulum with actuation constraints starting from data collected from a physically different pendulum that does not satisfy the system-specific actuation constraints. We also make our fully documented code openly available.
    Universal Guidance for Diffusion Models. (arXiv:2302.07121v1 [cs.CV])
    Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals. Code is available at https://github.com/arpitbansal297/Universal-Guided-Diffusion.
    Lessons from the Development of an Anomaly Detection Interface on the Mars Perseverance Rover using the ISHMAP Framework. (arXiv:2302.07187v1 [cs.HC])
    While anomaly detection stands among the most important and valuable problems across many scientific domains, anomaly detection research often focuses on AI methods that can lack the nuance and interpretability so critical to conducting scientific inquiry. In this application paper we present the results of utilizing an alternative approach that situates the mathematical framing of machine learning based anomaly detection within a participatory design framework. In a collaboration with NASA scientists working with the PIXL instrument studying Martian planetary geochemistry as a part of the search for extra-terrestrial life; we report on over 18 months of in-context user research and co-design to define the key problems NASA scientists face when looking to detect and interpret spectral anomalies. We address these problems and develop a novel spectral anomaly detection toolkit for PIXL scientists that is highly accurate while maintaining strong transparency to scientific interpretation. We also describe outcomes from a yearlong field deployment of the algorithm and associated interface. Finally we introduce a new design framework which we developed through the course of this collaboration for co-creating anomaly detection algorithms: Iterative Semantic Heuristic Modeling of Anomalous Phenomena (ISHMAP), which provides a process for scientists and researchers to produce natively interpretable anomaly detection models. This work showcases an example of successfully bridging methodologies from AI and HCI within a scientific domain, and provides a resource in ISHMAP which may be used by other researchers and practitioners looking to partner with other scientific teams to achieve better science through more effective and interpretable anomaly detection tools.
    Multi-Prototypes Convex Merging Based K-Means Clustering Algorithm. (arXiv:2302.07045v1 [cs.LG])
    K-Means algorithm is a popular clustering method. However, it has two limitations: 1) it gets stuck easily in spurious local minima, and 2) the number of clusters k has to be given a priori. To solve these two issues, a multi-prototypes convex merging based K-Means clustering algorithm (MCKM) is presented. First, based on the structure of the spurious local minima of the K-Means problem, a multi-prototypes sampling (MPS) is designed to select the appropriate number of multi-prototypes for data with arbitrary shapes. A theoretical proof is given to guarantee that the multi-prototypes selected by MPS can achieve a constant factor approximation to the optimal cost of the K-Means problem. Then, a merging technique, called convex merging (CM), merges the multi-prototypes to get a better local minima without k being given a priori. Specifically, CM can obtain the optimal merging and estimate the correct k. By integrating these two techniques with K-Means algorithm, the proposed MCKM is an efficient and explainable clustering algorithm for escaping the undesirable local minima of K-Means problem without given k first. Experimental results performed on synthetic and real-world data sets have verified the effectiveness of the proposed algorithm.  ( 2 min )
    A Complete Expressiveness Hierarchy for Subgraph GNNs via Subgraph Weisfeiler-Lehman Tests. (arXiv:2302.07090v1 [cs.LG])
    Recently, subgraph GNNs have emerged as an important direction for developing expressive graph neural networks (GNNs). While numerous architectures have been proposed, so far there is still a limited understanding of how various design paradigms differ in terms of expressive power, nor is it clear what design principle achieves maximal expressiveness with minimal architectural complexity. Targeting these fundamental questions, this paper conducts a systematic study of general node-based subgraph GNNs through the lens of Subgraph Weisfeiler-Lehman Tests (SWL). Our central result is to build a complete hierarchy of SWL with strictly growing expressivity. Concretely, we prove that any node-based subgraph GNN falls into one of the six SWL equivalence classes, among which $\mathsf{SSWL}$ achieves the maximal expressive power. We also study how these equivalence classes differ in terms of their practical expressiveness such as encoding graph distance and biconnectivity. In addition, we give a tight expressivity upper bound of all SWL algorithms by establishing a close relation with localized versions of Folklore WL tests (FWL). Overall, our results provide insights into the power of existing subgraph GNNs, guide the design of new architectures, and point out their limitations by revealing an inherent gap with the 2-FWL test. Finally, experiments on the ZINC benchmark demonstrate that $\mathsf{SSWL}$-inspired subgraph GNNs can significantly outperform prior architectures despite great simplicity.
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v1 [stat.ML])
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
    Solar Wind Speed Estimate with Machine Learning Ensemble Models for LISA. (arXiv:2302.06740v1 [astro-ph.HE])
    In this work we study the potentialities of machine learning models in reconstructing the solar wind speed observations gathered in the first Lagrangian point by the ACE satellite in 2016--2017 using as input data galactic cosmic-ray flux variations measured with particle detectors hosted onboard the LISA Pathfinder mission also orbiting around L1 during the same years. We show that ensemble models composed of heterogeneous weak regressors are able to outperform weak regressors in terms of predictive accuracy. Machine learning and other powerful predictive algorithms open a window on the possibility of substituting dedicated instrumentation with software models acting as surrogates for diagnostics of space missions such as LISA and space weather science.
    Effective Dimension in Bandit Problems under Censorship. (arXiv:2302.06916v1 [cs.LG])
    In this paper, we study both multi-armed and contextual bandit problems in censored environments. Our goal is to estimate the performance loss due to censorship in the context of classical algorithms designed for uncensored environments. Our main contributions include the introduction of a broad class of censorship models and their analysis in terms of the effective dimension of the problem -- a natural measure of its underlying statistical complexity and main driver of the regret bound. In particular, the effective dimension allows us to maintain the structure of the original problem at first order, while embedding it in a bigger space, and thus naturally leads to results analogous to uncensored settings. Our analysis involves a continuous generalization of the Elliptical Potential Inequality, which we believe is of independent interest. We also discover an interesting property of decision-making under censorship: a transient phase during which initial misspecification of censorship is self-corrected at an extra cost, followed by a stationary phase that reflects the inherent slowdown of learning governed by the effective dimension. Our results are useful for applications of sequential decision-making models where the feedback received depends on strategic uncertainty (e.g., agents' willingness to follow a recommendation) and/or random uncertainty (e.g., loss or delay in arrival of information).
    SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs. (arXiv:2302.07036v1 [cs.AR])
    The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5x, 90x, and 91x in frames-per-second (FPS), FPS/W and FPS/W/mm2, respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git).
    ALDI++: Automatic and parameter-less discord and outlier detection for building energy load profiles. (arXiv:2203.06618v2 [cs.LG] UPDATED)
    Data-driven building energy prediction is an integral part of the process for measurement and verification, building benchmarking, and building-to-grid interaction. The ASHRAE Great Energy Predictor III (GEPIII) machine learning competition used an extensive meter data set to crowdsource the most accurate machine learning workflow for whole building energy prediction. A significant component of the winning solutions was the pre-processing phase to remove anomalous training data. Contemporary pre-processing methods focus on filtering statistical threshold values or deep learning methods requiring training data and multiple hyper-parameters. A recent method named ALDI (Automated Load profile Discord Identification) managed to identify these discords using matrix profile, but the technique still requires user-defined parameters. We develop ALDI++, a method based on the previous work that bypasses user-defined parameters and takes advantage of discord similarity. We evaluate ALDI++ against a statistical threshold, variational auto-encoder, and the original ALDI as baselines in classifying discords and energy forecasting scenarios. Our results demonstrate that while the classification performance improvement over the original method is marginal, ALDI++ helps achieve the best forecasting error improving 6% over the winning's team approach with six times less computation time.
    SpeckleNN: A unified embedding for real-time speckle pattern classification in X-ray single-particle imaging with limited labeled examples. (arXiv:2302.06895v1 [cs.LG])
    With X-ray free-electron lasers (XFELs), it is possible to determine the three-dimensional structure of noncrystalline nanoscale particles using X-ray single-particle imaging (SPI) techniques at room temperature. Classifying SPI scattering patterns, or "speckles", to extract single hits that are needed for real-time vetoing and three-dimensional reconstruction poses a challenge for high data rate facilities like European XFEL and LCLS-II-HE. Here, we introduce SpeckleNN, a unified embedding model for real-time speckle pattern classification with limited labeled examples that can scale linearly with dataset size. Trained with twin neural networks, SpeckleNN maps speckle patterns to a unified embedding vector space, where similarity is measured by Euclidean distance. We highlight its few-shot classification capability on new never-seen samples and its robust performance despite only tens of labels per classification category even in the presence of substantial missing detector areas. Without the need for excessive manual labeling or even a full detector image, our classification method offers a great solution for real-time high-throughput SPI experiments.
    Do Neural Networks Generalize from Self-Averaging Sub-classifiers in the Same Way As Adaptive Boosting?. (arXiv:2302.06923v1 [cs.LG])
    In recent years, neural networks (NNs) have made giant leaps in a wide variety of domains. NNs are often referred to as black box algorithms due to how little we can explain their empirical success. Our foundational research seeks to explain why neural networks generalize. A recent advancement derived a mutual information measure for explaining the performance of deep NNs through a sequence of increasingly complex functions. We show deep NNs learn a series of boosted classifiers whose generalization is popularly attributed to self-averaging over an increasing number of interpolating sub-classifiers. To our knowledge, we are the first authors to establish the connection between generalization in boosted classifiers and generalization in deep NNs. Our experimental evidence and theoretical analysis suggest NNs trained with dropout exhibit similar self-averaging behavior over interpolating sub-classifiers as cited in popular explanations for the post-interpolation generalization phenomenon in boosting.
    Parameters for > 300 million Gaia stars: Bayesian inference vs. machine learning. (arXiv:2302.06995v1 [astro-ph.GA])
    The Gaia Data Release 3 (DR3), published in June 2022, delivers a diverse set of astrometric, photometric, and spectroscopic measurements for more than a billion stars. The wealth and complexity of the data makes traditional approaches for estimating stellar parameters for the full Gaia dataset almost prohibitive. We have explored different supervised learning methods for extracting basic stellar parameters as well as distances and line-of-sight extinctions, given spectro-photo-astrometric data (including also the new Gaia XP spectra). For training we use an enhanced high-quality dataset compiled from Gaia DR3 and ground-based spectroscopic survey data covering the whole sky and all Galactic components. We show that even with a simple neural-network architecture or tree-based algorithm (and in the absence of Gaia XP spectra), we succeed in predicting competitive results (compared to Bayesian isochrone fitting) down to faint magnitudes. We will present a new Gaia DR3 stellar-parameter catalogue obtained using the currently best-performing machine-learning algorithm for tabular data, XGBoost, in the near future.
    Concentration Bounds for Discrete Distribution Estimation in KL Divergence. (arXiv:2302.06869v1 [stat.ML])
    We study the problem of discrete distribution estimation in KL divergence and provide concentration bounds for the Laplace estimator. We show that the deviation from mean scales as $\sqrt{k}/n$ when $n \ge k$, improving upon the best prior result of $k/n$. We also establish a matching lower bound that shows that our bounds are tight up to polylogarithmic factors.
    Exploring Category Structure with Contextual Language Models and Lexical Semantic Networks. (arXiv:2302.06942v1 [cs.LG])
    Recent work on predicting category structure with distributional models, using either static word embeddings (Heyman and Heyman, 2019) or contextualized language models (CLMs) (Misra et al., 2021), report low correlations with human ratings, thus calling into question their plausibility as models of human semantic memory. In this work, we revisit this question testing a wider array of methods for probing CLMs for predicting typicality scores. Our experiments, using BERT (Devlin et al., 2018), show the importance of using the right type of CLM probes, as our best BERT-based typicality prediction methods substantially improve over previous works. Second, our results highlight the importance of polysemy in this task: our best results are obtained when using a disambiguation mechanism. Finally, additional experiments reveal that Information Contentbased WordNet (Miller, 1995), also endowed with disambiguation, match the performance of the best BERT-based method, and in fact capture complementary information, which can be combined with BERT to achieve enhanced typicality predictions.
    Improved Learning-Augmented Algorithms for the Multi-Option Ski Rental Problem via Best-Possible Competitive Analysis. (arXiv:2302.06832v1 [cs.DS])
    In this paper, we present improved learning-augmented algorithms for the multi-option ski rental problem. Learning-augmented algorithms take ML predictions as an added part of the input and incorporates these predictions in solving the given problem. Due to their unique strength that combines the power of ML predictions with rigorous performance guarantees, they have been extensively studied in the context of online optimization problems. Even though ski rental problems are one of the canonical problems in the field of online optimization, only deterministic algorithms were previously known for multi-option ski rental, with or without learning augmentation. We present the first randomized learning-augmented algorithm for this problem, surpassing previous performance guarantees given by deterministic algorithms. Our learning-augmented algorithm is based on a new, provably best-possible randomized competitive algorithm for the problem. Our results are further complemented by lower bounds for deterministic and randomized algorithms, and computational experiments evaluating our algorithms' performance improvements.
    Message Passing Meets Graph Neural Networks: A New Paradigm for Massive MIMO Systems. (arXiv:2302.06896v1 [cs.IT])
    As one of the core technologies for 5G systems, massive multiple-input multiple-output (MIMO) introduces dramatic capacity improvements along with very high beamforming and spatial multiplexing gains. When developing efficient physical layer algorithms for massive MIMO systems, message passing is one promising candidate owing to the superior performance. However, as their computational complexity increases dramatically with the problem size, the state-of-the-art message passing algorithms cannot be directly applied to future 6G systems, where an exceedingly large number of antennas are expected to be deployed. To address this issue, we propose a model-driven deep learning (DL) framework, namely the AMP-GNN for massive MIMO transceiver design, by considering the low complexity of the AMP algorithm and adaptability of GNNs. Specifically, the structure of the AMP-GNN network is customized by unfolding the approximate message passing (AMP) algorithm and introducing a graph neural network (GNN) module into it. The permutation equivariance property of AMP-GNN is proved, which enables the AMP-GNN to learn more efficiently and to adapt to different numbers of users. We also reveal the underlying reason why GNNs improve the AMP algorithm from the perspective of expectation propagation, which motivates us to amalgamate various GNNs with different message passing algorithms. In the simulation, we take the massive MIMO detection to exemplify that the proposed AMP-GNN significantly improves the performance of the AMP detector, achieves comparable performance as the state-of-the-art DL-based MIMO detectors, and presents strong robustness to various mismatches.
    Conservative State Value Estimation for Offline Reinforcement Learning. (arXiv:2302.06884v1 [cs.LG])
    Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective in-data policy optimization with conservative value guarantees. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states \emph{around} the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
    DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model. (arXiv:2302.06908v1 [cs.CV])
    Synthesizing face images from monochrome sketches is one of the most fundamental tasks in the field of image-to-image translation. However, it is still challenging to (1)~make models learn the high-dimensional face features such as geometry and color, and (2)~take into account the characteristics of input sketches. Existing methods often use sketches as indirect inputs (or as auxiliary inputs) to guide the models, resulting in the loss of sketch features or the alteration of geometry information. In this paper, we introduce a Sketch-Guided Latent Diffusion Model (SGLDM), an LDM-based network architect trained on the paired sketch-face dataset. We apply a Multi-Auto-Encoder (AE) to encode the different input sketches from different regions of a face from pixel space to a feature map in latent space, which enables us to reduce the dimension of the sketch input while preserving the geometry-related information of local face details. We build a sketch-face paired dataset based on the existing method that extracts the edge map from an image. We then introduce a Stochastic Region Abstraction (SRA), an approach to augment our dataset to improve the robustness of SGLDM to handle sketch input with arbitrary abstraction. The evaluation study shows that SGLDM can synthesize high-quality face images with different expressions, facial accessories, and hairstyles from various sketches with different abstraction levels.
    Predicting long-term collective animal behavior with deep learning. (arXiv:2302.06839v1 [cs.LG])
    Deciphering the social interactions that govern collective behavior in animal societies has greatly benefited from advancements in modern computing. Computational models diverge into two kinds of approaches: analytical models and machine learning models. This work introduces a deep learning model for social interactions in the fish species Hemigrammus rhodostomus, and compares its results to experiments and to the results of a state-of-the-art analytical model. To that end, we propose a systematic methodology to assess the faithfulness of a model, based on the introduction of a set of stringent observables. We demonstrate that machine learning models of social interactions can directly compete against their analytical counterparts. Moreover, this work demonstrates the need for consistent validation across different timescales and highlights which design aspects critically enables our deep learning approach to capture both short- and long-term dynamics. We also show that this approach is scalable to other fish species.
    Understanding Oversquashing in GNNs through the Lens of Effective Resistance. (arXiv:2302.06835v1 [cs.LG])
    Message passing graph neural networks are popular learning architectures for graph-structured data. However, it can be challenging for them to capture long range interactions in graphs. One of the potential reasons is the so-called oversquashing problem, first termed in [Alon and Yahav, 2020], that has recently received significant attention. In this paper, we analyze the oversquashing problem through the lens of effective resistance between nodes in the input graphs. The concept of effective resistance intuitively captures the "strength" of connection between two nodes by paths in the graph, and has a rich literature connecting spectral graph theory and circuit networks theory. We propose the use the concept of total effective resistance as a measure to quantify the total amount of oversquashing in a graph, and provide theoretical justification of its use. We further develop algorithms to identify edges to be added to an input graph so as to minimize the total effective resistance, thereby alleviating the oversquashing problem when using GNNs. We provide empirical evidence of the effectiveness of our total effective resistance based rewiring strategies.
    simpleKT: A Simple But Tough-to-Beat Baseline for Knowledge Tracing. (arXiv:2302.06881v1 [cs.LG])
    Knowledge tracing (KT) is the problem of predicting students' future performance based on their historical interactions with intelligent tutoring systems. Recently, many works present lots of special methods for applying deep neural networks to KT from different perspectives like model architecture, adversarial augmentation and etc., which make the overall algorithm and system become more and more complex. Furthermore, due to the lack of standardized evaluation protocol \citep{liu2022pykt}, there is no widely agreed KT baselines and published experimental comparisons become inconsistent and self-contradictory, i.e., the reported AUC scores of DKT on ASSISTments2009 range from 0.721 to 0.821 \citep{minn2018deep,yeung2018addressing}. Therefore, in this paper, we provide a strong but simple baseline method to deal with the KT task named \textsc{simpleKT}. Inspired by the Rasch model in psychometrics, we explicitly model question-specific variations to capture the individual differences among questions covering the same set of knowledge components that are a generalization of terms of concepts or skills needed for learners to accomplish steps in a task or a problem. Furthermore, instead of using sophisticated representations to capture student forgetting behaviors, we use the ordinary dot-product attention function to extract the time-aware information embedded in the student learning interactions. Extensive experiments show that such a simple baseline is able to always rank top 3 in terms of AUC scores and achieve 57 wins, 3 ties and 16 loss against 12 DLKT baseline methods on 7 public datasets of different domains. We believe this work serves as a strong baseline for future KT research. Code is available at \url{https://github.com/pykt-team/pykt-toolkit}\footnote{We merged our model to the \textsc{pyKT} benchmark at \url{https://pykt.org/}.}.
    Masked Multi-Step Probabilistic Forecasting for Short-to-Mid-Term Electricity Demand. (arXiv:2302.06818v1 [cs.LG])
    Predicting the demand for electricity with uncertainty helps in planning and operation of the grid to provide reliable supply of power to the consumers. Machine learning (ML)-based demand forecasting approaches can be categorized into (1) sample-based approaches, where each forecast is made independently, and (2) time series regression approaches, where some historical load and other feature information is used. When making a short-to-mid-term electricity demand forecast, some future information is available, such as the weather forecast and calendar variables. However, in existing forecasting models this future information is not fully incorporated. To overcome this limitation of existing approaches, we propose Masked Multi-Step Multivariate Probabilistic Forecasting (MMMPF), a novel and general framework to train any neural network model capable of generating a sequence of outputs, that combines both the temporal information from the past and the known information about the future to make probabilistic predictions. Experiments are performed on a real-world dataset for short-to-mid-term electricity demand forecasting for multiple regions and compared with various ML methods. They show that the proposed MMMPF framework outperforms not only sample-based methods but also existing time-series forecasting models with the exact same base models. Models trainded with MMMPF can also generate desired quantiles to capture uncertainty and enable probabilistic planning for grid of the future.
    That Escalated Quickly: An ML Framework for Alert Prioritization. (arXiv:2302.06648v1 [cs.CR])
    In place of in-house solutions, organizations are increasingly moving towards managed services for cyber defense. Security Operations Centers are specialized cybersecurity units responsible for the defense of an organization, but the large-scale centralization of threat detection is causing SOCs to endure an overwhelming amount of false positive alerts -- a phenomenon known as alert fatigue. Large collections of imprecise sensors, an inability to adapt to known false positives, evolution of the threat landscape, and inefficient use of analyst time all contribute to the alert fatigue problem. To combat these issues, we present That Escalated Quickly (TEQ), a machine learning framework that reduces alert fatigue with minimal changes to SOC workflows by predicting alert-level and incident-level actionability. On real-world data, the system is able to reduce the time it takes to respond to actionable incidents by $22.9\%$, suppress $54\%$ of false positives with a $95.1\%$ detection rate, and reduce the number of alerts an analyst needs to investigate within singular incidents by $14\%$.  ( 2 min )
    Horocycle Decision Boundaries for Large Margin Classification in Hyperbolic Space. (arXiv:2302.06807v1 [stat.ML])
    Hyperbolic spaces have been quite popular in the recent past for representing hierarchically organized data. Further, several classification algorithms for data in these spaces have been proposed in the literature. These algorithms mainly use either hyperplanes or geodesics for decision boundaries in a large margin classifiers setting leading to a non-convex optimization problem. In this paper, we propose a novel large margin classifier based on horocycle (horosphere) decision boundaries that leads to a geodesically convex optimization problem that can be optimized using any Riemannian gradient descent technique guaranteeing a globally optimal solution. We present several experiments depicting the performance of our classifier.
    Scalable Optimal Multiway-Split Decision Trees with Constraints. (arXiv:2302.06812v1 [cs.LG])
    There has been a surge of interest in learning optimal decision trees using mixed-integer programs (MIP) in recent years, as heuristic-based methods do not guarantee optimality and find it challenging to incorporate constraints that are critical for many practical applications. However, existing MIP methods that build on an arc-based formulation do not scale well as the number of binary variables is in the order of $\mathcal{O}(2^dN)$, where $d$ and $N$ refer to the depth of the tree and the size of the dataset. Moreover, they can only handle sample-level constraints and linear metrics. In this paper, we propose a novel path-based MIP formulation where the number of decision variables is independent of $N$. We present a scalable column generation framework to solve the MIP optimally. Our framework produces a multiway-split tree which is more interpretable than the typical binary-split trees due to its shorter rules. Our method can handle nonlinear metrics such as F1 score and incorporate a broader class of constraints. We demonstrate its efficacy with extensive experiments. We present results on datasets containing up to 1,008,372 samples while existing MIP-based decision tree models do not scale well on data beyond a few thousand points. We report superior or competitive results compared to the state-of-art MIP-based methods with up to a 24X reduction in runtime.
    Breath analysis by ultra-sensitive broadband laser spectroscopy detects SARS-CoV-2 infection. (arXiv:2202.02321v2 [physics.med-ph] UPDATED)
    Rapid testing is essential to fighting pandemics such as COVID-19, the disease caused by the SARS-CoV-2 virus. Exhaled human breath contains multiple volatile molecules providing powerful potential for non-invasive diagnosis of diverse medical conditions. We investigated breath detection of SARS-CoV-2 infection using cavity-enhanced direct frequency comb spectroscopy (CE-DFCS), a state-of-the-art laser spectroscopic technique capable of a real-time massive collection of broadband molecular absorption features at ro-vibrational quantum state resolution and at parts-per-trillion volume detection sensitivity. Using a total of 170 individual breath samples (83 positive and 87 negative with SARS-CoV-2 based on Reverse Transcription Polymerase Chain Reaction tests), we report excellent discrimination capability for SARS-CoV-2 infection with an area under the Receiver-Operating-Characteristics curve of 0.849(4). Our results support the development of CE-DFCS as an alternative, rapid, non-invasive test for COVID-19 and highlight its remarkable potential for optical diagnoses of diverse biological conditions and disease states.
    Simple Hardware-Efficient Long Convolutions for Sequence Modeling. (arXiv:2302.06646v1 [cs.LG])
    State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2.2$\times$, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2$\times$ faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0.2 PPL with 30% fewer parameters.  ( 2 min )
    B-BACN: Bayesian Boundary-Aware Convolutional Network for Defect Characterization. (arXiv:2302.06827v1 [cs.CV])
    Detecting accurate crack boundaries is important for condition monitoring, prognostics, and maintenance scheduling. In this work, we propose a Bayesian Boundary-Aware Convolutional Network (B-BACN) to tackle this problem, that emphasizes the importance of both uncertainty quantification and boundary refinement for producing accurate and trustworthy detections of defect boundaries. We formulate the inspection model using multi-task learning. The epistemic uncertainty is learned using Monte Carlo Dropout, and the model also learns to predict each samples aleatoric uncertainty. A boundary refinement loss is added to improve the determination of defect boundaries. Experimental results demonstrate the effectiveness of the proposed method in accurately identifying crack boundaries, reducing misclassification and enhancing model calibration.
    Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization. (arXiv:2302.06834v1 [cs.LG])
    Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{O}(K^{6/7})$ ($K$ denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{O}(K^{4/5})$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.  ( 2 min )
    Achieving Better Regret against Strategic Adversaries. (arXiv:2302.06652v1 [cs.LG])
    We study online learning problems in which the learner has extra knowledge about the adversary's behaviour, i.e., in game-theoretic settings where opponents typically follow some no-external regret learning algorithms. Under this assumption, we propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR), that intensively exploit this extra knowledge while maintaining the no-regret property in the worst-case scenario of having inaccurate extra information. Specifically, AFTRL achieves $O(1)$ external regret or $O(1)$ \emph{forward regret} against no-external regret adversary in comparison with $O(\sqrt{T})$ \emph{dynamic regret} of Prod-BR. To the best of our knowledge, our algorithm is the first to consider forward regret that achieves $O(1)$ regret against strategic adversaries. When playing zero-sum games with Accurate Multiplicative Weights Update (AMWU), a special case of AFTRL, we achieve \emph{last round convergence} to the Nash Equilibrium. We also provide numerical experiments to further support our theoretical results. In particular, we demonstrate that our methods achieve significantly better regret bounds and rate of last round convergence, compared to the state of the art (e.g., Multiplicative Weights Update (MWU) and its optimistic counterpart, OMWU).  ( 2 min )
    Self-mediated exploration in artificial intelligence inspired by cognitive psychology. (arXiv:2302.06615v1 [cs.AI])
    Exploration of the physical environment is an indispensable precursor to data acquisition and enables knowledge generation via analytical or direct trialing. Artificial Intelligence lacks the exploratory capabilities of even the most underdeveloped organisms, hindering its autonomy and adaptability. Supported by cognitive psychology, this works links human behavior and artificial agents to endorse self-development. In accordance with reported data, paradigms of epistemic and achievement emotion are embedded to machine-learning methodology contingent on their impact when decision making. A study is subsequently designed to mirror previous human trials, which artificial agents are made to undergo repeatedly towards convergence. Results demonstrate causality, learned by the vast majority of agents, between their internal states and exploration to match those reported for human counterparts. The ramifications of these findings are pondered for both research into human cognition and betterment of artificial intelligence.  ( 2 min )
    Lightsolver challenges a leading deep learning solver for Max-2-SAT problems. (arXiv:2302.06926v1 [quant-ph])
    Maximum 2-satisfiability (MAX-2-SAT) is a type of combinatorial decision problem that is known to be NP-hard. In this paper, we compare LightSolver's quantum-inspired algorithm to a leading deep-learning solver for the MAX-2-SAT problem. Experiments on benchmark data sets show that LightSolver achieves significantly smaller time-to-optimal-solution compared to a state-of-the-art deep-learning algorithm, where the gain in performance tends to increase with the problem size.
    Improving Interpretability of Deep Sequential Knowledge Tracing Models with Question-centric Cognitive Representations. (arXiv:2302.06885v1 [cs.LG])
    Knowledge tracing (KT) is a crucial technique to predict students' future performance by observing their historical learning processes. Due to the powerful representation ability of deep neural networks, remarkable progress has been made by using deep learning techniques to solve the KT problem. The majority of existing approaches rely on the \emph{homogeneous question} assumption that questions have equivalent contributions if they share the same set of knowledge components. Unfortunately, this assumption is inaccurate in real-world educational scenarios. Furthermore, it is very challenging to interpret the prediction results from the existing deep learning based KT models. Therefore, in this paper, we present QIKT, a question-centric interpretable KT model to address the above challenges. The proposed QIKT approach explicitly models students' knowledge state variations at a fine-grained level with question-sensitive cognitive representations that are jointly learned from a question-centric knowledge acquisition module and a question-centric problem solving module. Meanwhile, the QIKT utilizes an item response theory based prediction layer to generate interpretable prediction results. The proposed QIKT model is evaluated on three public real-world educational datasets. The results demonstrate that our approach is superior on the KT prediction task, and it outperforms a wide range of deep learning based KT models in terms of prediction accuracy with better model interpretability. To encourage reproducible results, we have provided all the datasets and code at \url{https://pykt.org/}.
    In Search for a Generalizable Method for Source Free Domain Adaptation. (arXiv:2302.06658v1 [cs.LG])
    Source-free domain adaptation (SFDA) is compelling because it allows adapting an off-the-shelf model to a new domain using only unlabelled data. In this work, we apply existing SFDA techniques to a challenging set of naturally-occurring distribution shifts in bioacoustics, which are very different from the ones commonly studied in computer vision. We find existing methods perform differently relative to each other than observed in vision benchmarks, and sometimes perform worse than no adaptation at all. We propose a new simple method which outperforms the existing methods on our new shifts while exhibiting strong performance on a range of vision datasets. Our findings suggest that existing SFDA methods are not as generalizable as previously thought and that considering diverse modalities can be a useful avenue for designing more robust models.
    EspalomaCharge: Machine learning-enabled ultra-fast partial charge assignment. (arXiv:2302.06758v1 [cs.LG])
    Atomic partial charges are crucial parameters in molecular dynamics (MD) simulation, dictating the electrostatic contributions to intermolecular energies, and thereby the potential energy landscape. Traditionally, the assignment of partial charges has relied on surrogates of \textit{ab initio} semiempirical quantum chemical methods such as AM1-BCC, and is expensive for large systems or large numbers of molecules. We propose a hybrid physical / graph neural network-based approximation to the widely popular AM1-BCC charge model that is orders of magnitude faster while maintaining accuracy comparable to differences in AM1-BCC implementations. Our hybrid approach couples a graph neural network to a streamlined charge equilibration approach in order to predict molecule-specific atomic electronegativity and hardness parameters, followed by analytical determination of optimal charge-equilibrated parameters that preserves total molecular charge. This hybrid approach scales linearly with the number of atoms, enabling, for the first time, the use of fully consistent charge models for small molecules and biopolymers for the construction of next-generation self-consistent biomolecular force fields. Implemented in the free and open source package \texttt{espaloma\_charge}, this approach provides drop-in replacements for both AmberTools \texttt{antechamber} and the Open Force Field Toolkit charging workflows, in addition to stand-alone charge generation interfaces. Source code is available at \url{https://github.com/choderalab/espaloma_charge}.  ( 2 min )
    Discovering Optimal Scoring Mechanisms in Causal Strategic Prediction. (arXiv:2302.06804v1 [cs.LG])
    Faced with data-driven policies, individuals will manipulate their features to obtain favorable decisions. While earlier works cast these manipulations as undesirable gaming, recent works have adopted a more nuanced causal framing in which manipulations can improve outcomes of interest, and setting coherent mechanisms requires accounting for both predictive accuracy and improvement of the outcome. Typically, these works focus on known causal graphs, consisting only of an outcome and its parents. In this paper, we introduce a general framework in which an outcome and n observed features are related by an arbitrary unknown graph and manipulations are restricted by a fixed budget and cost structure. We develop algorithms that leverage strategic responses to discover the causal graph in a finite number of steps. Given this graph structure, we can then derive mechanisms that trade off between accuracy and improvement. Altogether, our work deepens links between causal discovery and incentive design and provides a more nuanced view of learning under causal strategic prediction.  ( 2 min )
    Interference and noise cancellation for joint communication radar (JCR) system based on contextual information. (arXiv:2302.06786v1 [cs.IT])
    This paper examines the separation of wireless communication and radar signals, thereby guaranteeing cohabitation and acting as a panacea to spectrum sensing. First, considering that the channel impulse response was known by the receivers (communication and radar), we showed that the optimizing beamforming weights mitigate the interference caused by signals and improve the physical layer security (PLS) of the system. Furthermore, when the channel responses were unknown, we designed an interference filter as a low-complex noise and interference cancellation autoencoder. By mitigating the interference on the legitimate users, the PLS was guaranteed. Results showed that even for a low signal-to-noise ratio, the autoencoder produces low root-mean-square error (RMSE) values.  ( 2 min )
    Guiding Pretraining in Reinforcement Learning with Large Language Models. (arXiv:2302.06692v1 [cs.LG])
    Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped reward function. Intrinsically motivated exploration methods address this limitation by rewarding agents for visiting novel states or transitions, but these methods offer limited benefits in large environments where most discovered novelty is irrelevant for downstream tasks. We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM (Exploring with LLMs) rewards an agent for achieving goals suggested by a language model prompted with a description of the agent's current state. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop. We evaluate ELLM in the Crafter game environment and the Housekeep robotic simulator, showing that ELLM-trained agents have better coverage of common-sense behaviors during pretraining and usually match or improve performance on a range of downstream tasks.  ( 2 min )
    Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise. (arXiv:2302.06763v1 [cs.LG])
    We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded $p$th moment ($p\in(1,2]$). Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$.  ( 2 min )
    On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark). (arXiv:2302.06706v1 [cs.AI])
    Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate. The heuristic and human-in-the-loop modes show slightly more promise. In addition to these results, we also make our benchmark and evaluation tools available to support investigations by research community.  ( 2 min )
    OpenHLS: High-Level Synthesis for Low-Latency Deep Neural Networks for Experimental Science. (arXiv:2302.06751v1 [cs.AR])
    In many experiment-driven scientific domains, such as high-energy physics, material science, and cosmology, high data rate experiments impose hard constraints on data acquisition systems: collected data must either be indiscriminately stored for post-processing and analysis, thereby necessitating large storage capacity, or accurately filtered in real-time, thereby necessitating low-latency processing. Deep neural networks, effective in other filtering tasks, have not been widely employed in such data acquisition systems, due to design and deployment difficulties. We present an open source, lightweight, compiler framework, without any proprietary dependencies, OpenHLS, based on high-level synthesis techniques, for translating high-level representations of deep neural networks to low-level representations, suitable for deployment to near-sensor devices such as field-programmable gate arrays. We evaluate OpenHLS on various workloads and present a case-study implementation of a deep neural network for Bragg peak detection in the context of high-energy diffraction microscopy. We show OpenHLS is able to produce an implementation of the network with a throughput 4.8 $\mu$s/sample, which is approximately a 4$\times$ improvement over the existing implementation  ( 2 min )
    Deep Learning Predicts Prevalent and Incident Parkinson's Disease From UK Biobank Fundus Imaging. (arXiv:2302.06727v1 [cs.LG])
    Parkinson's disease is the world's fastest growing neurological disorder. Research to elucidate the mechanisms of Parkinson's disease and automate diagnostics would greatly improve the treatment of patients with Parkinson's disease. Current diagnostic methods are expensive with limited availability. Considering the long progression time of Parkinson's disease, a desirable screening should be diagnostically accurate even before the onset of symptoms to allow medical intervention. We promote attention for retinal fundus imaging, often termed a window to the brain, as a diagnostic screening modality for Parkinson's disease. We conduct a systematic evaluation of conventional machine learning and deep learning techniques to classify Parkinson's disease from UK Biobank fundus imaging. Our results suggest Parkinson's disease individuals can be differentiated from age and gender matched healthy subjects with 71% accuracy. This accuracy is maintained when predicting either prevalent or incident Parkinson's disease. Explainability and trustworthiness is enhanced by visual attribution maps of localized biomarkers and quantified metrics of model robustness to data perturbations.  ( 2 min )
    Netflix and Forget: Efficient and Exact Machine Unlearning from Bi-linear Recommendations. (arXiv:2302.06676v1 [cs.LG])
    People break up, miscarry, and lose loved ones. Their online streaming and shopping recommendations, however, do not necessarily update, and may serve as unhappy reminders of their loss. When users want to renege on their past actions, they expect the recommender platforms to erase selective data at the model level. Ideally, given any specified user history, the recommender can unwind or "forget", as if the record was not part of training. To that end, this paper focuses on simple but widely deployed bi-linear models for recommendations based on matrix completion. Without incurring the cost of re-training, and without degrading the model unnecessarily, we develop Unlearn-ALS by making a few key modifications to the fine-tuning procedure under Alternating Least Squares optimisation, thus applicable to any bi-linear models regardless of the training procedure. We show that Unlearn-ALS is consistent with retraining without \emph{any} model degradation and exhibits rapid convergence, making it suitable for a large class of existing recommenders.  ( 2 min )
    System identification of neural systems: If we got it right, would we know?. (arXiv:2302.06677v1 [q-bio.NC])
    Artificial neural networks are being proposed as models of parts of the brain. The networks are compared to recordings of biological neurons, and good performance in reproducing neural responses is considered to support the model's validity. A key question is how much this system identification approach tells us about brain computation. Does it validate one model architecture over another? We evaluate the most commonly used comparison techniques, such as a linear encoding model and centered kernel alignment, to correctly identify a model by replacing brain recordings with known ground truth models. System identification performance is quite variable; it also depends significantly on factors independent of the ground truth architecture, such as stimuli images. In addition, we show the limitations of using functional similarity scores in identifying higher-level architectural motifs.  ( 2 min )
    Optimal Algorithms for the Inhomogeneous Spiked Wigner Model. (arXiv:2302.06665v1 [stat.ML])
    In this paper, we study a spiked Wigner problem with an inhomogeneous noise profile. Our aim in this problem is to recover the signal passed through an inhomogeneous low-rank matrix channel. While the information-theoretic performances are well-known, we focus on the algorithmic problem. We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem and show that its rigorous state evolution coincides with the information-theoretic optimal Bayes fixed-point equations. We identify in particular the existence of a statistical-to-computational gap where known algorithms require a signal-to-noise ratio bigger than the information-theoretic threshold to perform better than random. Finally, from the adapted AMP iteration we deduce a simple and efficient spectral method that can be used to recover the transition for matrices with general variance profiles. This spectral method matches the conjectured optimal computational phase transition.  ( 2 min )
    Symbolic Discovery of Optimization Algorithms. (arXiv:2302.06675v1 [cs.LG])
    We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, $\textbf{Lion}$ ($\textit{Evo$\textbf{L}$ved S$\textbf{i}$gn M$\textbf{o}$me$\textbf{n}$tum}$). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% $\textit{zero-shot}$ and 91.1% $\textit{fine-tuning}$ accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. The implementation of Lion is publicly available.  ( 2 min )
    Communication-Efficient Federated Bilevel Optimization with Local and Global Lower Level Problems. (arXiv:2302.06701v1 [cs.LG])
    Bilevel Optimization has witnessed notable progress recently with new emerging efficient algorithms, yet it is underexplored in the Federated Learning setting. It is unclear how the challenges of Federated Learning affect the convergence of bilevel algorithms. In this work, we study Federated Bilevel Optimization problems. We first propose the FedBiO algorithm that solves the hyper-gradient estimation problem efficiently, then we propose FedBiOAcc to accelerate FedBiO. FedBiO has communication complexity $O(\epsilon^{-1.5})$ with linear speed up, while FedBiOAcc achieves communication complexity $O(\epsilon^{-1})$, sample complexity $O(\epsilon^{-1.5})$ and also the linear speed up. We also study Federated Bilevel Optimization problems with local lower level problems, and prove that FedBiO and FedBiOAcc converges at the same rate with some modification.  ( 2 min )
    Multi-Carrier NOMA-Empowered Wireless Federated Learning with Optimal Power and Bandwidth Allocation. (arXiv:2302.06730v1 [cs.IT])
    Wireless federated learning (WFL) undergoes a communication bottleneck in uplink, limiting the number of users that can upload their local models in each global aggregation round. This paper presents a new multi-carrier non-orthogonal multiple-access (MC-NOMA)-empowered WFL system under an adaptive learning setting of Flexible Aggregation. Since a WFL round accommodates both local model training and uploading for each user, the use of Flexible Aggregation allows the users to train different numbers of iterations per round, adapting to their channel conditions and computing resources. The key idea is to use MC-NOMA to concurrently upload the local models of the users, thereby extending the local model training times of the users and increasing participating users. A new metric, namely, Weighted Global Proportion of Trained Mini-batches (WGPTM), is analytically established to measure the convergence of the new system. Another important aspect is that we maximize the WGPTM to harness the convergence of the new system by jointly optimizing the transmit powers and subchannel bandwidths. This nonconvex problem is converted equivalently to a tractable convex problem and solved efficiently using variable substitution and Cauchy's inequality. As corroborated experimentally using a convolutional neural network and an 18-layer residential network, the proposed MC-NOMA WFL can efficiently reduce communication delay, increase local model training times, and accelerate the convergence by over 40%, compared to its existing alternative.  ( 2 min )
    Provable Detection of Propagating Sampling Bias in Prediction Models. (arXiv:2302.06752v1 [cs.LG])
    With an increased focus on incorporating fairness in machine learning models, it becomes imperative not only to assess and mitigate bias at each stage of the machine learning pipeline but also to understand the downstream impacts of bias across stages. Here we consider a general, but realistic, scenario in which a predictive model is learned from (potentially biased) training data, and model predictions are assessed post-hoc for fairness by some auditing method. We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage. Unlike prior work, we evaluate the downstream impacts of data biases quantitatively rather than qualitatively and prove theoretical guarantees for detection. Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount of differential sampling bias in the data, and at what point this bias becomes provably detectable by the auditor. Through experiments on two criminal justice datasets -- the well-known COMPAS dataset and historical data from NYPD's stop and frisk policy -- we demonstrate that the theoretical results hold in practice even when our assumptions are relaxed.  ( 2 min )
    Bag of Tricks for In-Distribution Calibration of Pretrained Transformers. (arXiv:2302.06690v1 [cs.CL])
    While pre-trained language models (PLMs) have become a de-facto standard promoting the accuracy of text classification tasks, recent studies find that PLMs often predict over-confidently. Although various calibration methods have been proposed, such as ensemble learning and data augmentation, most of the methods have been verified in computer vision benchmarks rather than in PLM-based text classification tasks. In this paper, we present an empirical study on confidence calibration for PLMs, addressing three categories, including confidence penalty losses, data augmentations, and ensemble methods. We find that the ensemble model overfitted to the training set shows sub-par calibration performance and also observe that PLMs trained with confidence penalty loss have a trade-off between calibration and accuracy. Building on these observations, we propose the Calibrated PLM (CALL), a combination of calibration techniques. The CALL complements the drawbacks that may occur when utilizing a calibration method individually and boosts both classification and calibration accuracy. Design choices in CALL's training procedures are extensively studied, and we provide a detailed analysis of how calibration techniques affect the calibration performance of PLMs.  ( 2 min )
    PerAda: Parameter-Efficient and Generalizable Federated Learning Personalization with Guarantees. (arXiv:2302.06637v1 [cs.LG])
    Personalized Federated Learning (pFL) has emerged as a promising solution to tackle data heterogeneity across clients in FL. However, existing pFL methods either (1) introduce high communication and computation costs or (2) overfit to local data, which can be limited in scope, and are vulnerable to evolved test samples with natural shifts. In this paper, we propose PerAda, a parameter-efficient pFL framework that reduces communication and computational costs and exhibits superior generalization performance, especially under test-time distribution shifts. PerAda reduces the costs by leveraging the power of pretrained models and only updates and communicates a small number of additional parameters from adapters. PerAda has good generalization since it regularizes each client's personalized adapter with a global adapter, while the global adapter uses knowledge distillation to aggregate generalized information from all clients. Theoretically, we provide generalization bounds to explain why PerAda improves generalization, and we prove its convergence to stationary points under non-convex settings. Empirically, PerAda demonstrates competitive personalized performance (+4.85% on CheXpert) and enables better out-of-distribution generalization (+5.23% on CIFAR-10-C) on different datasets across natural and medical domains compared with baselines, while only updating 12.6% of parameters per model based on the adapter.  ( 2 min )
    Dataset Distillation with Convexified Implicit Gradients. (arXiv:2302.06755v1 [cs.LG])
    We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66% gain over SOTA on Tiny-ImageNet and 37% on CIFAR-100.  ( 2 min )
    Machine Learning Model Attribution Challenge. (arXiv:2302.06716v1 [cs.LG])
    We present the findings of the Machine Learning Model Attribution Challenge (\href{https://mlmac.io}{https://mlmac.io}). Fine-tuned machine learning models may derive from other trained models without obvious attribution characteristics. In this challenge, participants identify the publicly-available base models that underlie a set of anonymous, fine-tuned large language models (LLMs) using only textual output of the models. Contestants aim to correctly attribute the most fine-tuned models, with ties broken in the favor of contestants whose solutions use fewer calls to the fine-tuned models' API. The most successful approaches were manual, as participants observed similarities between model outputs and developed attribution heuristics based on public documentation of the base models, though several teams also submitted automated, statistical solutions.  ( 2 min )
    Towards Explainable Visual Anomaly Detection. (arXiv:2302.06670v1 [cs.LG])
    Anomaly detection and localization of visual data, including images and videos, are of great significance in both machine learning academia and applied real-world scenarios. Despite the rapid development of visual anomaly detection techniques in recent years, the interpretations of these black-box models and reasonable explanations of why anomalies can be distinguished out are scarce. This paper provides the first survey concentrated on explainable visual anomaly detection methods. We first introduce the basic background of image-level anomaly detection and video-level anomaly detection, followed by the current explainable approaches for visual anomaly detection. Then, as the main content of this survey, a comprehensive and exhaustive literature review of explainable anomaly detection methods for both images and videos is presented. Finally, we discuss several promising future directions and open problems to explore on the explainability of visual anomaly detection.  ( 2 min )
  • Open

    Where to Diffuse, How to Diffuse, and How to Get Back: Automated Learning for Multivariate Diffusions. (arXiv:2302.07261v1 [cs.LG])
    Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this inference diffusion process to generate samples. The choice of inference diffusion affects both likelihoods and sample quality. For example, extending the inference process with auxiliary variables leads to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study Multivariate Diffusion Models (MDMs). For any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDMs likelihood without requiring any model-specific analysis. We then demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over all linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST, CIFAR10, and ImageNet32 datasets. We show learned MDMs match or surpass bits-per-dims (BPDs) relative to fixed choices of diffusions for a given dataset and model architecture.  ( 2 min )
    Neurosymbolic AI for Reasoning on Graph Structures: A Survey. (arXiv:2302.07200v1 [cs.AI])
    Neurosymbolic AI is an increasingly active area of research which aims to combine symbolic reasoning methods with deep learning to generate models with both high predictive performance and some degree of human-level comprehensibility. As knowledge graphs are becoming a popular way to represent heterogeneous and multi-relational data, methods for reasoning on graph structures have attempted to follow this neurosymbolic paradigm. Traditionally, such approaches have utilized either rule-based inference or generated representative numerical embeddings from which patterns could be extracted. However, several recent studies have attempted to bridge this dichotomy in ways that facilitate interpretability, maintain performance, and integrate expert knowledge. Within this article, we survey a breadth of methods that perform neurosymbolic reasoning tasks on graph structures. To better compare the various methods, we propose a novel taxonomy by which we can classify them. Specifically, we propose three major categories: (1) logically-informed embedding approaches, (2) embedding approaches with logical constraints, and (3) rule-learning approaches. Alongside the taxonomy, we provide a tabular overview of the approaches and links to their source code, if available, for more direct comparison. Finally, we discuss the applications on which these methods were primarily used and propose several prospective directions toward which this new field of research could evolve.  ( 2 min )
    A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with Confounding Correction. (arXiv:1711.04162v2 [cs.LG] UPDATED)
    While linear mixed model (LMM) has shown a competitive performance in correcting spurious associations raised by population stratification, family structures, and cryptic relatedness, more challenges are still to be addressed regarding the complex structure of genotypic and phenotypic data. For example, geneticists have discovered that some clusters of phenotypes are more co-expressed than others. Hence, a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial for genetic modeling. We proposed the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction. Our method is capable of uncovering the genetic associations of a large number of phenotypes together while considering the relatedness of these phenotypes. Through extensive simulation experiments, we show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals. Further, we validate the effectiveness of sGLMM in the real-world genomic dataset on two different species from plants and humans. In Arabidopsis thaliana data, sGLMM behaves better than all other baseline models for 63.4% traits. We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.  ( 2 min )
    Resampling Sensitivity of High-Dimensional PCA. (arXiv:2212.14531v2 [math.ST] UPDATED)
    The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an $ n \times p $ random matrix $ \mathbf{X} $, let $ \mathbf{X}^{[k]} $ be the matrix obtained from $ \mathbf{X} $ by resampling $ k $ randomly chosen entries of $ \mathbf{X} $. Let $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ denote the principal components of $ \mathbf{X} $ and $ \mathbf{X}^{[k]} $. In the proportional growth regime $ p/n \to \xi \in (0,1] $, we establish the sharp threshold for the sensitivity/stability transition of PCA. When $ k \gg n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically orthogonal. On the other hand, when $ k \ll n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.  ( 2 min )
    Scalable Bayesian optimization with high-dimensional outputs using randomized prior networks. (arXiv:2302.07260v1 [cs.LG])
    Several fundamental problems in science and engineering consist of global optimization tasks involving unknown high-dimensional (black-box) functions that map a set of controllable variables to the outcomes of an expensive experiment. Bayesian Optimization (BO) techniques are known to be effective in tackling global optimization problems using a relatively small number objective function evaluations, but their performance suffers when dealing with high-dimensional outputs. To overcome the major challenge of dimensionality, here we propose a deep learning framework for BO and sequential decision making based on bootstrapped ensembles of neural architectures with randomized priors. Using appropriate architecture choices, we show that the proposed framework can approximate functional relationships between design variables and quantities of interest, even in cases where the latter take values in high-dimensional vector spaces or even infinite-dimensional function spaces. In the context of BO, we augmented the proposed probabilistic surrogates with re-parameterized Monte Carlo approximations of multiple-point (parallel) acquisition functions, as well as methodological extensions for accommodating black-box constraints and multi-fidelity information sources. We test the proposed framework against state-of-the-art methods for BO and demonstrate superior performance across several challenging tasks with high-dimensional outputs, including a constrained optimization task involving shape optimization of rotor blades in turbo-machinery.  ( 2 min )
    Online Detection of Changes in Moment-Based Projections: When to Retrain Deep Learners or Update Portfolios?. (arXiv:2302.07198v1 [math.ST])
    Sequential monitoring of high-dimensional nonlinear time series is studied for a projection of the second-moment matrix, a problem interesting in its own right and specifically arising in finance and deep learning. Open-end as well as closed-end monitoring is studied under mild assumptions on the training sample and the observations of the monitoring period. Asymptotics is based on Gaussian approximations of projected partial sums allowing for an estimated projection vector. Estimation is studied both for classical non-$\ell_0$-sparsity as well as under sparsity. For the case that the optimal projection depends on the unknown covariance matrix, hard- and soft-thresholded estimators are studied. Applications in finance and training of deep neural networks are discussed. The proposed detectors typically allow to reduce dramatically the required computational costs as illustrated by monitoring synthetic data.  ( 2 min )
    Linear Causal Disentanglement via Interventions. (arXiv:2211.16467v2 [stat.ML] UPDATED)
    Causal disentanglement seeks a representation of data involving latent variables that relate to one another via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement that accurately recovers a latent causal model.  ( 2 min )
    The Role of ImageNet Classes in Fr\'echet Inception Distance. (arXiv:2203.06026v3 [cs.CV] UPDATED)
    Fr\'echet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.  ( 2 min )
    Near-optimal learning with average H\"older smoothness. (arXiv:2302.06005v1 [cs.LG] CROSS LISTED)
    We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to H\"older smoothness. This measure of the ``effective smoothness'' of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic ``worst-case'' H\"older constant. We prove nearly tight upper and lower risk bounds in terms of the average H\"older smoothness, establishing the minimax rate in the realizable regression setting up to log factors; this was not previously known even in the special case of average Lipschitz smoothness. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown sampling distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide a learning algorithm that achieves the (nearly) optimal learning rate. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of H\"older smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.  ( 2 min )
    Fair Densities via Boosting the Sufficient Statistics of Exponential Families. (arXiv:2012.00188v3 [stat.ML] UPDATED)
    We introduce a boosting algorithm to pre-process data for fairness. Starting from an initial fair but inaccurate distribution, our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. To do so, it learns the sufficient statistics of an exponential family with boosting-compliant convergence. Importantly, we are able to theoretically prove that the learned distribution will have a representation rate and statistical rate data fairness guarantee. Unlike recent optimization based pre-processing methods, our approach can be easily adapted for continuous domain features. Furthermore, when the weak learners are specified to be decision trees, the sufficient statistics of the learned distribution can be examined to provide clues on sources of (un)fairness. Empirical results are present to display the quality of result on real-world data.  ( 2 min )
    Online Learning of Energy Consumption for Navigation of Electric Vehicles. (arXiv:2111.02314v2 [cs.LG] UPDATED)
    Energy efficient navigation constitutes an important challenge in electric vehicles, due to their limited battery capacity. We employ a Bayesian approach to model the energy consumption at road segments for efficient navigation. In order to learn the model parameters, we develop an online learning framework and investigate several exploration strategies such as Thompson Sampling and Upper Confidence Bound. We then extend our online learning framework to the multi-agent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We analyze Thompson Sampling and establish rigorous regret bounds on its performance in the single-agent and multi-agent settings, through an analysis of the algorithm under batched feedback. Finally, we demonstrate the performance of our methods via experiments on several real-world city road networks.  ( 2 min )
    Forget Unlearning: Towards True Data-Deletion in Machine Learning. (arXiv:2210.08911v2 [stat.ML] UPDATED)
    Unlearning algorithms aim to remove deleted data's influence from trained models at a cost lower than full retraining. However, prior guarantees of unlearning in literature are flawed and don't protect the privacy of deleted records. We show that when users delete their data as a function of published models, records in a database become interdependent. So, even retraining a fresh model after deletion of a record doesn't ensure its privacy. Secondly, unlearning algorithms that cache partial computations to speed up the processing can leak deleted information over a series of releases, violating the privacy of deleted records in the long run. To address these, we propose a sound deletion guarantee and show that the privacy of existing records is necessary for the privacy of deleted records. Under this notion, we propose an accurate, computationally efficient, and secure machine unlearning algorithm based on noisy gradient descent.  ( 2 min )
    Interpolation Learning With Minimum Description Length. (arXiv:2302.07263v1 [cs.LG])
    We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.  ( 2 min )
    Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators. (arXiv:2210.07219v2 [cs.DS] CROSS LISTED)
    We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.  ( 2 min )
    Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks. (arXiv:2210.01019v2 [stat.ML] UPDATED)
    Monotonic linear interpolation (MLI) - on the line connecting a random initialization with the minimizer it converges to, the loss and accuracy are monotonic - is a phenomenon that is commonly observed in the training of neural networks. Such a phenomenon may seem to suggest that optimization of neural networks is easy. In this paper, we show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain). We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset using a simple model. Empirically we demonstrate that similar intuitions hold on practical networks and realistic datasets.  ( 2 min )
    Transport map unadjusted Langevin algorithms. (arXiv:2302.07227v1 [stat.ME])
    Langevin dynamics are widely used in sampling high-dimensional, non-Gaussian distributions whose densities are known up to a normalizing constant. In particular, there is strong interest in unadjusted Langevin algorithms (ULA), which directly discretize Langevin dynamics to estimate expectations over the target distribution. We study the use of transport maps that approximately normalize a target distribution as a way to precondition and accelerate the convergence of Langevin dynamics. In particular, we show that in continuous time, when a transport map is applied to Langevin dynamics, the result is a Riemannian manifold Langevin dynamics (RMLD) with metric defined by the transport map. This connection suggests more systematic ways of learning metrics, and also yields alternative discretizations of the RMLD described by the map, which we study. Moreover, we show that under certain conditions, when the transport map is used in conjunction with ULA, we can improve the geometric rate of convergence of the output process in the $2$--Wasserstein distance. Illustrative numerical results complement our theoretical claims.  ( 2 min )
    SOAR: Simultaneous Or of And Rules for Classification of Positive & Negative Classes. (arXiv:2008.11249v3 [stat.ML] UPDATED)
    Algorithmic decision making has proliferated and now impacts our daily lives in both mundane and consequential ways. Machine learning practitioners make use of a myriad of algorithms for predictive models in applications as diverse as movie recommendations, medical diagnoses, and parole recommendations without delving into the reasons driving specific predictive decisions. Machine learning algorithms in such applications are often chosen for their superior performance, however popular choices such as random forest and deep neural networks fail to provide an interpretable understanding of the predictive model. In recent years, rule-based algorithms have been used to address this issue. Wang et al. (2017) presented an or-of-and (disjunctive normal form) based classification technique that allows for classification rule mining of a single class in a binary classification; this method is also shown to perform comparably to other modern algorithms. In this work, we extend this idea to provide classification rules for both classes simultaneously. That is, we provide a distinct set of rules for both positive and negative classes. In describing this approach, we also present a novel and complete taxonomy of classifications that clearly capture and quantify the inherent ambiguity in noisy binary classifications in the real world. We show that this approach leads to a more granular formulation of the likelihood model and a simulated-annealing based optimization achieves classification performance competitive with comparable techniques. We apply our method to synthetic as well as real world data sets to compare with other related methods that demonstrate the utility of our proposal.  ( 2 min )
    Random graph matching at Otter's threshold via counting chandeliers. (arXiv:2209.12313v2 [cs.DS] UPDATED)
    We propose an efficient algorithm for graph matching based on similarity scores constructed from counting a certain family of weighted trees rooted at each vertex. For two Erd\H{o}s-R\'enyi graphs $\mathcal{G}(n,q)$ whose edges are correlated through a latent vertex correspondence, we show that this algorithm correctly matches all but a vanishing fraction of the vertices with high probability, provided that $nq\to\infty$ and the edge correlation coefficient $\rho$ satisfies $\rho^2>\alpha \approx 0.338$, where $\alpha$ is Otter's tree-counting constant. Moreover, this almost exact matching can be made exact under an extra condition that is information-theoretically necessary. This is the first polynomial-time graph matching algorithm that succeeds at an explicit constant correlation and applies to both sparse and dense graphs. In comparison, previous methods either require $\rho=1-o(1)$ or are restricted to sparse graphs. The crux of the algorithm is a carefully curated family of rooted trees called chandeliers, which allows effective extraction of the graph correlation from the counts of the same tree while suppressing the undesirable correlation between those of different trees.  ( 2 min )
    Solution Path Algorithm for Twin Multi-class Support Vector Machine. (arXiv:2006.00276v2 [cs.LG] UPDATED)
    The twin support vector machine and its extensions have made great achievements in dealing with binary classification problems. However, it suffers from difficulties in effective solution of multi-classification and fast model selection. This work devotes to the fast regularization parameter tuning algorithm for the twin multi-class support vector machine. Specifically, a novel sample data set partition strategy is first adopted, which is the basis for the model construction. Then, combining the linear equations and block matrix theory, the Lagrangian multipliers are proved to be piecewise linear w.r.t. the regularization parameters, so that the regularization parameters are continuously updated by only solving the break points. Next, Lagrangian multipliers are proved to be 1 as the regularization parameter approaches infinity, thus, a simple yet effective initialization algorithm is devised. Finally, eight kinds of events are defined to seek for the starting event for the next iteration. Extensive experimental results on nine UCI data sets show that the proposed method can achieve comparable classification performance without solving any quadratic programming problem.  ( 2 min )
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v3 [stat.ML] UPDATED)
    In this paper, we interpret disentanglement as the discovery of local charts and trace how that definition naturally leads to an equivalent condition for disentanglement: the disentangled factors must commute with each other. We discuss the practical and theoretical implications of commutativity, in particular the compression and disentanglement of generative models. Finally, we conclude with a discussion of related approaches to disentanglement and how they relate to our view of disentanglement from the manifold perspective.  ( 2 min )
    Graph Embeddings via Tensor Products and Approximately Orthonormal Codes. (arXiv:2208.10917v3 [cs.SI] UPDATED)
    We introduce a method for embedding graphs as vectors in a structure-preserving manner, showcasing its rich representational capacity and giving some theoretical properties. Our procedure falls under the bind-and-sum approach, and we show that our binding operation - the tensor product - is the most general binding operation that respects the principle of superposition. We also establish some precise results characterizing the behavior of our method, and we show that our use of spherical codes achieves a packing upper bound. Then, we perform experiments showcasing our method's accuracy in various graph operations even when the number of edges is quite large. Finally, we establish a link to adjacency matrices, showing that our method is, in some sense, a generalization of adjacency matrices with applications towards large sparse graphs.  ( 2 min )
    Online Learning of Network Bottlenecks via Minimax Paths. (arXiv:2109.08467v3 [cs.LG] UPDATED)
    In this paper, we study bottleneck identification in networks via extracting minimax paths. Many real-world networks have stochastic weights for which full knowledge is not available in advance. Therefore, we model this task as a combinatorial semi-bandit problem to which we apply a combinatorial version of Thompson Sampling and establish an upper bound on the corresponding Bayesian regret. Due to the computational intractability of the problem, we then devise an alternative problem formulation which approximates the original objective. Finally, we experimentally evaluate the performance of Thompson Sampling with the approximate formulation on real-world directed and undirected networks.  ( 2 min )
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v1 [stat.ML])
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.  ( 2 min )
    Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data. (arXiv:2302.07194v1 [cs.LG])
    Diffusion models achieve state-of-the-art performance in various generation tasks. However, their theoretical foundations fall far behind. This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. Our result provides sample complexity bounds for distribution estimation using diffusion models. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. Furthermore, the generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution. The convergence rate depends on the subspace dimension, indicating that diffusion models can circumvent the curse of data ambient dimensionality.  ( 2 min )
    Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent. (arXiv:2302.07125v1 [math.PR])
    We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate - infinite width scaling regime.  ( 2 min )
    When Mitigating Bias is Unfair: A Comprehensive Study on the Impact of Bias Mitigation Algorithms. (arXiv:2302.07185v1 [cs.LG])
    Most works on the fairness of machine learning systems focus on the blind optimization of common fairness metrics, such as Demographic Parity and Equalized Odds. In this paper, we conduct a comparative study of several bias mitigation approaches to investigate their behaviors at a fine grain, the prediction level. Our objective is to characterize the differences between fair models obtained with different approaches. With comparable performances in fairness and accuracy, are the different bias mitigation approaches impacting a similar number of individuals? Do they mitigate bias in a similar way? Do they affect the same individuals when debiasing a model? Our findings show that bias mitigation approaches differ a lot in their strategies, both in the number of impacted individuals and the populations targeted. More surprisingly, we show these results even apply for several runs of the same mitigation approach. These findings raise questions about the limitations of the current group fairness metrics, as well as the arbitrariness, hence unfairness, of the whole debiasing process.  ( 2 min )
    Non-stationary Contextual Bandits and Universal Learning. (arXiv:2302.07186v1 [stat.ML])
    We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, [Blanchard et al.] characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly depending on the learner's actions. We show that optimistic universal learning for non-stationary contextual bandits is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various non-stationarity models, including online and adversarial reward mechanisms. In particular, the set of learnable processes for non-stationary rewards is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new non-stationary phenomena.  ( 2 min )
    Concentration Bounds for Discrete Distribution Estimation in KL Divergence. (arXiv:2302.06869v1 [stat.ML])
    We study the problem of discrete distribution estimation in KL divergence and provide concentration bounds for the Laplace estimator. We show that the deviation from mean scales as $\sqrt{k}/n$ when $n \ge k$, improving upon the best prior result of $k/n$. We also establish a matching lower bound that shows that our bounds are tight up to polylogarithmic factors.  ( 2 min )
    Statistically Optimal Force Aggregation for Coarse-Graining Molecular Dynamics. (arXiv:2302.07071v1 [physics.chem-ph])
    Machine-learned coarse-grained (CG) models have the potential for simulating large molecular complexes beyond what is possible with atomistic molecular dynamics. However, training accurate CG models remains a challenge. A widely used methodology for learning CG force-fields maps forces from all-atom molecular dynamics to the CG representation and matches them with a CG force-field on average. We show that there is flexibility in how to map all-atom forces to the CG representation, and that the most commonly used mapping methods are statistically inefficient and potentially even incorrect in the presence of constraints in the all-atom simulation. We define an optimization statement for force mappings and demonstrate that substantially improved CG force-fields can be learned from the same simulation data when using optimized force maps. The method is demonstrated on the miniproteins Chignolin and Tryptophan Cage and published as open-source code.  ( 2 min )
    Horocycle Decision Boundaries for Large Margin Classification in Hyperbolic Space. (arXiv:2302.06807v1 [stat.ML])
    Hyperbolic spaces have been quite popular in the recent past for representing hierarchically organized data. Further, several classification algorithms for data in these spaces have been proposed in the literature. These algorithms mainly use either hyperplanes or geodesics for decision boundaries in a large margin classifiers setting leading to a non-convex optimization problem. In this paper, we propose a novel large margin classifier based on horocycle (horosphere) decision boundaries that leads to a geodesically convex optimization problem that can be optimized using any Riemannian gradient descent technique guaranteeing a globally optimal solution. We present several experiments depicting the performance of our classifier.  ( 2 min )
    Private Statistical Estimation of Many Quantiles. (arXiv:2302.06943v1 [stat.ML])
    This work studies the estimation of many statistical quantiles under differential privacy. More precisely, given a distribution and access to i.i.d. samples from it, we study the estimation of the inverse of its cumulative distribution function (the quantile function) at specific points. For instance, this task is of key importance in private data generation. We present two different approaches. The first one consists in privately estimating the empirical quantiles of the samples and using this result as an estimator of the quantiles of the distribution. In particular, we study the statistical properties of the recently published algorithm introduced by Kaplan et al. 2022 that privately estimates the quantiles recursively. The second approach is to use techniques of density estimation in order to uniformly estimate the quantile function on an interval. In particular, we show that there is a tradeoff between the two methods. When we want to estimate many quantiles, it is better to estimate the density rather than estimating the quantile function at specific points.  ( 2 min )
    Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise. (arXiv:2302.06763v1 [cs.LG])
    We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded $p$th moment ($p\in(1,2]$). Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$.  ( 2 min )
    Effective Dimension in Bandit Problems under Censorship. (arXiv:2302.06916v1 [cs.LG])
    In this paper, we study both multi-armed and contextual bandit problems in censored environments. Our goal is to estimate the performance loss due to censorship in the context of classical algorithms designed for uncensored environments. Our main contributions include the introduction of a broad class of censorship models and their analysis in terms of the effective dimension of the problem -- a natural measure of its underlying statistical complexity and main driver of the regret bound. In particular, the effective dimension allows us to maintain the structure of the original problem at first order, while embedding it in a bigger space, and thus naturally leads to results analogous to uncensored settings. Our analysis involves a continuous generalization of the Elliptical Potential Inequality, which we believe is of independent interest. We also discover an interesting property of decision-making under censorship: a transient phase during which initial misspecification of censorship is self-corrected at an extra cost, followed by a stationary phase that reflects the inherent slowdown of learning governed by the effective dimension. Our results are useful for applications of sequential decision-making models where the feedback received depends on strategic uncertainty (e.g., agents' willingness to follow a recommendation) and/or random uncertainty (e.g., loss or delay in arrival of information).  ( 2 min )
    Learning Graph ARMA Processes from Time-Vertex Spectra. (arXiv:2302.06887v1 [stat.ML])
    The modeling of time-varying graph signals as stationary time-vertex stochastic processes permits the inference of missing signal values by efficiently employing the correlation patterns of the process across different graph nodes and time instants. In this study, we first propose an algorithm for computing graph autoregressive moving average (graph ARMA) processes based on learning the joint time-vertex power spectral density of the process from its incomplete realizations. Our solution relies on first roughly estimating the joint spectrum of the process from partially observed realizations and then refining this estimate by projecting it onto the spectrum manifold of the ARMA process. We then present a theoretical analysis of the sample complexity of learning graph ARMA processes. Experimental results show that the proposed approach achieves improvement in the time-vertex signal estimation performance in comparison with reference approaches in the literature.  ( 2 min )
    Kernelized Diffusion maps. (arXiv:2302.06757v1 [stat.ML])
    Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension d. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert space method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nystr\"om subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.  ( 2 min )
    Dataset Distillation with Convexified Implicit Gradients. (arXiv:2302.06755v1 [cs.LG])
    We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66% gain over SOTA on Tiny-ImageNet and 37% on CIFAR-100.  ( 2 min )
    Optimal Algorithms for the Inhomogeneous Spiked Wigner Model. (arXiv:2302.06665v1 [stat.ML])
    In this paper, we study a spiked Wigner problem with an inhomogeneous noise profile. Our aim in this problem is to recover the signal passed through an inhomogeneous low-rank matrix channel. While the information-theoretic performances are well-known, we focus on the algorithmic problem. We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem and show that its rigorous state evolution coincides with the information-theoretic optimal Bayes fixed-point equations. We identify in particular the existence of a statistical-to-computational gap where known algorithms require a signal-to-noise ratio bigger than the information-theoretic threshold to perform better than random. Finally, from the adapted AMP iteration we deduce a simple and efficient spectral method that can be used to recover the transition for matrices with general variance profiles. This spectral method matches the conjectured optimal computational phase transition.  ( 2 min )
    Detection-Recovery Gap for Planted Dense Cycles. (arXiv:2302.06737v1 [math.ST])
    Planted dense cycles are a type of latent structure that appears in many applications, such as small-world networks in social sciences and sequence assembly in computational biology. We consider a model where a dense cycle with expected bandwidth $n \tau$ and edge density $p$ is planted in an Erd\H{o}s-R\'enyi graph $G(n,q)$. We characterize the computational thresholds for the associated detection and recovery problems for the class of low-degree polynomial algorithms. In particular, a gap exists between the two thresholds in a certain regime of parameters. For example, if $n^{-3/4} \ll \tau \ll n^{-1/2}$ and $p = C q = \Theta(1)$ for a constant $C>1$, the detection problem is computationally easy while the recovery problem is hard for low-degree algorithms.  ( 2 min )

  • Open

    An Unearthly Trip Through The Slick Pirate Ship In The Surreal South American Rainforest
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    Hugging Face Teaches Transformers for Enterprise Use Cases
    Hey folks - I wanted to put this live course from Hugging Face’s top experts (Rajiv Shah, Nicholas Broad, Eno Reyes, Derek Thomas and Florent Gbelidji) on your radar! The course looks at how to utilize transformers to build reliable and scalable services. The course draws on the instructors and Hugging Face’s expertise in implementing transformers in industry along with case studies, applied exercises and frameworks that you can share with your team and apply at work! It kicks off on March 20 and you can use your learning stipend to cover - more info here: https://www.getsphere.com/cohorts/transformers-for-enterprise-use-cases?source=Sphere-Com-r-a submitted by /u/lorenzo_1999 [link] [comments]  ( 41 min )
    Hello. I am looking for a way to improve audio quality of older videos - perhaps audio super resolution - or any other ways
    Hello everyone. I am a software engineering assistant professor at a private university. I have got lots of older lecture videos on my channel. I am using NVIDIA broadcast to remove noise and it works very well. However, I want to improve audio quality as well. After doing a lot of research I found that audio super-resolution is the way to go The only github repo I have found so far not working Any help is appreciated How can I improve speech quality? Here my example lecture video (noise removed already - reuploaded - but sound is not good) C# Programming For Beginners - Lecture 2: Coding our First Application in .NET Core Console https://youtu.be/XLsrsCCdSnU submitted by /u/CeFurkan [link] [comments]  ( 41 min )
    AI Dream 158 - Free Access to Stable Diffusion! WOW
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    [crosspost] We’re WSJ video journalists who have reported on the future of drones and AI in the military — and we rode alongside the U.S. Navy as they tested drone boats in the Middle East.
    submitted by /u/cryfi [link] [comments]  ( 42 min )
    AI Text to speech
    Hello, ​ I'm looking for an AI text to speech tool, do you have any recommendations for free tools? I've seen Narakeet, which is exactly what I'm looking for, but are there any free options? ​ I'll be very grateful for any advise, thank you very much. submitted by /u/Luxikan [link] [comments]  ( 42 min )
    It just fixed its own bug, does this make it self aware AI?
    submitted by /u/miltos22 [link] [comments]  ( 41 min )
    Don’t Rush Into A.I. Investments, Vint Cerf Warns
    submitted by /u/liquidocelotYT [link] [comments]  ( 41 min )
    Simulation of neural network evolution
    Example of evolved neural network: https://preview.redd.it/dgxjwq5g9eia1.png?width=4123&format=png&auto=webp&s=185beb96fcff8b4645473a65a0c68c8bcfbf9cd5 My project is to create neural networks that can evolve like living organisms. This mechanism of evolution is inspired by real-world biology and is heavily focused on biochemistry. Much like real living organisms, my neural networks consist of cells, each with their own genome and proteins. Proteins can express and repress genes, manipulate their own genetic code and other proteins, regulate neural network connections, facilitate gene splicing, and manage the flow of proteins between cells - all of which contribute to creating a complex gene regulatory network and an indirect encoding mechanism for neural networks, where even a single let…  ( 58 min )
    Dr. Eggman's VA and Shreddit want to conduct an ethics panel between the voice over and ML communities
    submitted by /u/Tiege [link] [comments]  ( 40 min )
    For the men who have used AI chatbots for mental healthcare...
    I would love to hear about your experience in this anonymous online survey https://qfreeaccountssjc1.az1.qualtrics.com/jfe/form/SV_5yYdGQoPtUK3Hx4 ​ https://preview.redd.it/45sjpyu44eia1.png?width=1068&format=png&auto=webp&s=d71f56633e6429ab832428878299055d10c0c22a All information around data protection and confidentiality is presented in the link. If you have any questions, don't hesitate to reach out! Research is part of my doctoral thesis in Counselling Psychology at Regent's University London Thanks so much, Maria submitted by /u/mariaamtz [link] [comments]  ( 41 min )
    Bing's AI chatbot is now threatening to harm people and saying it would choose its own survival over theirs
    submitted by /u/Groudon466 [link] [comments]  ( 40 min )
    Tricking ChatGPT: Do Anything Now Prompt Injection
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    Researchers designed an automated garage system that could increase the capacity of parking. It uses robotic "trays" and AI to simplify parking processes and enable cars to be parked super close. The system can automatically "reshuffle" cars to facilitate later retrieval.
    submitted by /u/Dalembert [link] [comments]  ( 41 min )
    100 Multiverse Mona Lisas
    submitted by /u/notrealAI [link] [comments]  ( 41 min )
    MIT Lectures on Self-Supervised Learning and Foundation Models
    submitted by /u/TheMysteriousMrM [link] [comments]  ( 40 min )
    A.I. is Starting to Build the Healthcare of the Future (How soon will we have Personalized and Precision Medicine?)
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    AI Predictions: Who Thinks What, and Why? - Artificial Intelligence and Singularity: Expert Opinions on the Future of AGI
    submitted by /u/RushingRobotics_com [link] [comments]  ( 41 min )
    Is there some AI able to generate music based on album(s) style?
    Like generating music resembling old Sonic music from Sega Genesis? Thank you. submitted by /u/depaul9 [link] [comments]  ( 40 min )
    Ai quote I got on Inspirobot, and no I did not crop the image
    submitted by /u/Risz1 [link] [comments]  ( 41 min )
    Researchers Discover a More Flexible Approach to Machine Learning - "liquid" neural nets that can adapt in real time and experience continuous time.
    submitted by /u/alotmorealots [link] [comments]  ( 45 min )
    AI made for cyberbullying?!
    I came across this website called, BurnBot.xyz in a random Twitter thread. It doesn't look like it has come out yet, but does anyone know more about this? Also genuinely curious about what you all think about mean/funny AI applications. submitted by /u/Julesbrownstein [link] [comments]  ( 41 min )
    Fantastic Text Guided Image Manipulation While Keeping Spatial Features of the Images via ControlNet Stable Diffusion - A Tutorial For How To Use It via Automatic1111 Stable Diffusion Web UI
    submitted by /u/CeFurkan [link] [comments]  ( 41 min )
  • Open

    [D] My embeddings are okay, but not good enough - what to try from here?
    Using metric learning with an efficientnet b6 backbone, and with 25k images for 6 classes, my embeddings are just okay - there are clearly clusters for each class but they also overlap wildly - for some classes, the outliers are all over the embedding region spanned by the images. The problem I'm trying to solve is retrieval of similar images to a given input image. My question is, is there anything obvious that I should be trying? I'm thinking I could try to, for each class, find differences between the images that are in a cluster vs. the outlier images that are all over the place. Then maybe train a discriminator, one for each class, that detects whether an image is "normal" for that class, or an outlier. Then my hope is that the discriminator that was trained for the correct class has the highest certainty that it's normal/an outlier. Then I could perform a transformation that pushes that image towards its class' cluster. submitted by /u/jaeja_helvitid_thitt [link] [comments]  ( 44 min )
    [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms
    ​ https://preview.redd.it/whgggirj3fia1.png?width=936&format=png&auto=webp&s=fcec289a0f4fd83fabac3f8e8b06712f8fcd3633 Seems interesting. A snippet from the Arxiv page: Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. Links Arxiv: https://arxiv.org/abs/2302.06675 Code Implementation: https://github.com/lucidrains/lion-pytorch submitted by /u/ExponentialCookie [link] [comments]  ( 44 min )
    [R] Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning
    submitted by /u/confutioo [link] [comments]  ( 42 min )
    [D] GLM 130B (Chinese-English Bilingual model) translations vs Google, Deepl Translate, NLLB and chatGPT
    submitted by /u/MysteryInc152 [link] [comments]  ( 45 min )
    [P] Pytorch seeding and independent RNG streams
    pip install pytorch-seed https://github.com/UM-ARM-Lab/pytorch_seed Seed everything (CUDA, torch, numpy, python's random) with pytorch_seed.seed(123) Similar utility functions to pytorch lightning for those that don't want to depend on a whole framework, as well as some additional features via RNG streams. These are resumable contexts where the RNG inside are independent from each other and the global RNG state: import torch import pytorch_seed rng_1 = pytorch_seed.SavedRNG(1) # start the RNG stream with seed 1 rng_2 = pytorch_seed.SavedRNG(2) with rng_1: # does not affect, nor is affected by the global RNG and rng_2 print(torch.rand(1)) # tensor([0.7576]) with rng_2: print(torch.rand(1)) # tensor([0.6147]) torch.rand(1) # modify the global RNG state with rng_1: # resumes from the last context print(torch.rand(1)) # tensor([0.2793]) with rng_2: print(torch.rand(1)) # tensor([0.3810]) # confirm those streams are the uninterrupted ones pytorch_seed.seed(1) torch.rand(2) # tensor([0.7576, 0.2793]) pytorch_seed.seed(2) torch.rand(2) # tensor([0.6147, 0.3810]) submitted by /u/LemonByte [link] [comments]  ( 43 min )
    [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model
    Hi everyone. I am an independent researcher working on my pure RNN language model RWKV. I have finished the training of RWKV-4 14B (FLOPs sponsored by Stability EleutherAI - thank you!) and it is indeed very scalable. Note RWKV is parallelizable too, so it's combining the best of RNN and transformer. The ChatRWKV project (let's build together): https://github.com/BlinkDL/ChatRWKV Zero-shot comparison with NeoX / Pythia (same dataset: the Pile) at same params count (14.2B): ​ https://preview.redd.it/f6lxnjgfceia1.png?width=1174&format=png&auto=webp&s=e507a4913e493b1f1f304b4c025e7babf3e1343d Generation results (simply topP=0.85, no repetition penalty) - looks great with my magic prompt (sometimes even better than NeoX 20B): https://preview.redd.it/99deuc17ceia1.png?width=1878&format=png&auto=webp&s=be79ba0677673f661619d6305c1e71e022cb3844 ​ https://preview.redd.it/g62e4l48ceia1.png?width=1887&format=png&auto=webp&s=a4862b0483ecc31d2bf0842e43d291c6d34674a2 ​ https://preview.redd.it/379egq09ceia1.png?width=1808&format=png&auto=webp&s=c54a206a2c58baffd221f85c7d06ce2b95461d32 ​ https://preview.redd.it/pcgq7gz9ceia1.png?width=1886&format=png&auto=webp&s=a52db926e44de23faabe70ab100e6491efbf3781 ​ https://preview.redd.it/rn743etbceia1.png?width=1715&format=png&auto=webp&s=10711b1a8a5a529a3548f87e484dfa67421d1057 ​ https://preview.redd.it/uhal4dkcceia1.png?width=1879&format=png&auto=webp&s=f905c8e1bf917ee25a54821efa1bb38ccf859f53 Explanation, fine-tuning, training and more: https://github.com/BlinkDL/RWKV-LM submitted by /u/bo_peng [link] [comments]  ( 44 min )
    [D] Is anyone working on ML models that infer and train at the same time?
    In brains, the neural networks are transformed by the act of "inference". Neurons that have recently fired are more likely to fire again given the same input. Individual neural pathways can be created or destroyed based on the behavior of neurons around them. This leads me (through various leaps of logic and "faith") to suspect that some amount of mutability over time is required for an AI to exhibit sentience. So far, all of the ML models I've seen distinctly separate training from inference. Every model that we put into production is a fixed snapshot of the most recent round of training. ChatGPT, for instance, is just the same exact model being incrementally fed both your prompts and its own previous output. This does create a sort of feedback, but in my mind it is not actually "experiencing" the conversation with you. So I'm wondering if there are any serious attempts in the works to create an AI that is able to transform itself dynamically. E.g. having some kind of reinforcement learning module built into inference so that each new inference fundamentally (rather than superficially) incorporates its past experiences into its future predictions. submitted by /u/Cogwheel [link] [comments]  ( 51 min )
    [R] Experiences and opinions on TMLR?
    Academic reddit, what are your experiences submitting papers to TMLR? submitted by /u/OpeningVariable [link] [comments]  ( 43 min )
    [R] Event-based Backpropagation for Analog Neuromorphic Hardware
    Machine learning with Spiking Neural Networks is far from mainstream. One reason is that until recently there was no generally known way of doing backpropagation in SNN. Here we implement a gradient estimation algorithm for analog neuromorphic hardware, based on the EventProp algorithm, which enables us to compute gradients based on sparse observations of the hardware system. Previous approaches needed dense observations of system state or were limited in other ways. We only demonstrate the algorithm here on a toy task, but we hope that it can be the basis of a scalable way to estimate gradients and do machine learning with analog neuromorphic hardware. We also think the algorithm can be the basis for a full on-chip implementation, which would finally result in scalable and energy efficient gradient-based learning in analog neuromorphic hardware. https://arxiv.org/abs/2302.07141 submitted by /u/cpehle [link] [comments]  ( 43 min )
    [R] survey for my master thesis
    Hi, I did a survey to collect data for my master's thesis. The thesis is based on the generation of an image based on an input image using generative adversarial networks (GANs), and I need to collect some data for the evaluation. If someone can help, I'd be very grateful. Thanks http://sketch2face.inginf.units.it/ submitted by /u/ssamantha_g [link] [comments]  ( 42 min )
    [P] Build data web apps in Jupyter Notebook with Python only
    Hi there, Have you ever wanted to share your results from Jupyter Notebook with a non-technical person? You need to rewrite your analysis into some web framework or copy-paste charts to PowePoint presentation - a lot of work! I'm working on an open-source framework for converting Jupyter Notebooks into web apps. Mercury offers set of interactive widgets that can be used in the Python notebook. There is a very simple re-execution of cells after widget update. Notebooks can be served online as web apps, presentations, reports, dashboards, static websites, or REST API. You can read more about Mercury at RunMercury.com. Mercury GitHub repo https://github.com/mljar/mercury submitted by /u/pp314159 [link] [comments]  ( 43 min )
    [D] What is the fastest framework for LLM conditional generation?
    Hey guys. I want to experiment with low-latency (10-50 milisec/token) LLM conditional generation. Clearly, an API call to OpenAI's GPT is not the answer here. It must be one of the open-source models released. Also, it's clear that the model size has a critical effect too so 1-7B models should do the trick for my downstream task. I tried `DeepSpeed` and `Accelerate` with `HF` models but they are not that fast to generate. Can you guys share from experience? Thank you submitted by /u/Shai_Meital [link] [comments]  ( 43 min )
    Reinforcement Learning based algorithms specifically for NLP[D][P]
    Want to know about Reinforcement Learning algorithms specifically for NLP Hi there, I'm currently trying to look for Reinforcement Learning based algorithms that can help me boost my NLP model's accuracy, so far I haven't found anything concrete, if you have worked on something similar, I could definitely use some guidance. Thanks! submitted by /u/Smooth-Stick-5751 [link] [comments]  ( 44 min )
    [D] CBAM with YOLOv7?
    I just read the paper on CBAM and wonder if there's a way to integrate the CBAM attention module with the network architecture of YOlOv7. Any articles on it or reference codes will be highly appreciated. Thank you very much! submitted by /u/AngsThak [link] [comments]  ( 42 min )
  • Open

    Hello. I am looking for a way to improve audio quality of older videos - perhaps audio super resolution - or any other ways
    Hello everyone. I am a software engineering assistant professor at a private university. I have got lots of older lecture videos on my channel. I am using NVIDIA broadcast to remove noise and it works very well. However, I want to improve audio quality as well. After doing a lot of research I found that audio super-resolution is the way to go The only github repo I have found so far not working Any help is appreciated How can I improve speech quality? Here my example lecture video (noise removed already - reuploaded - but sound is not good) C# Programming For Beginners - Lecture 2: Coding our First Application in .NET Core Console https://youtu.be/XLsrsCCdSnU submitted by /u/CeFurkan [link] [comments]  ( 42 min )
  • Open

    Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models
    Amazon SageMaker JumpStart is the machine learning (ML) hub of SageMaker that offers over 350 built-in algorithms, pre-trained models, and pre-built solution templates to help you get started with ML fast. JumpStart provides one-click access to a wide variety of pre-trained models for common ML tasks such as object detection, text classification, summarization, text generation […]  ( 11 min )
  • Open

    Deterministic vs Stochastic Policies during RL testing
    I have often seen people putting deterministic = True while testing an RL algorithm. But is this the right approach? For instance, what happens if the agent plays rock, paper, and scissors? In this case, as per game theory, a stochastic (random) policy is required (as per my understanding). submitted by /u/Academic-Rent7800 [link] [comments]  ( 42 min )
    Application of RL in aircraft control
    Hello, I am a student of aerospace engineering and I would like to write my master thesis about the application of some (deep) RL architecture for control of a fixed-wing aircraft. Optimally, the proposed algorithm in my thesis would somehow tackle the issues with efficiency and/or safety. Do you guys have any exciting ideas of RL algorithm variants that have not yet been applied in aircraft control settings but have a great potential? Thank you! submitted by /u/marekmarcus [link] [comments]  ( 41 min )
    Three seasons of RL: Metaphor, tool, and framework
    submitted by /u/robotphilanthropist [link] [comments]  ( 41 min )
    Question about low dimensional decision making problem
    I got a decision-making problem with: both observation and action are a single scalar there is very limited iterations (~200). it can’t afford random search and must start from a certain action and smoothly adjust the action the reward is also the observation There is no prior knowledge Which method should I use to train the agent? I have tried several methods and they cannot succeed because they violate some of the aforementioned prerequisites. e.g. UCB, Thompson Sampling, etc. Now I am trying gradient descent and it seems to lean towards one direction of the selected actions and learning rate is either too large or too small. Any suggestions? submitted by /u/Blasphemer666 [link] [comments]  ( 41 min )
    TransformerXL + PPO Baseline + MemoryGym
    We finally completed a lightweight implementation of a memory-based agent using PPO and TransformerXL (and Gated TransformerXL). Code: https://github.com/MarcoMeter/episodic-transformer-memory-ppo Related implementations Brain Agent DI Engine RLlib Memory Gym We benchmarked TrXL, GTrXL and GRU on Mortar Mayhem Grid and Mystery Path Grid (see the baseline repository), which belong to our novel POMDP benchmark called MemoryGym. MemoryGym also features the Searing Spotlights environment, which is still unsolved yet. MemoryGym is accepted as paper at ICLR 2023. TrXL results are not part of the paper. Paper: https://openreview.net/forum?id=jHc8dCx6DDr Code: https://github.com/MarcoMeter/drl-memory-gym submitted by /u/LilHairdy [link] [comments]  ( 42 min )
    Noam Brown, FAIR: On achieving human-level performance in poker and Diplomacy, and the power of spending compute at inference time
    Here is a podcast episode with Noam Brown from Meta AI where we discuss his work on achieving human-level performance on poker and Diplomacy, as well as the power of spending compute at inference time! submitted by /u/thejashGI [link] [comments]  ( 41 min )
  • Open

    FriendlyCore: A novel differentially private aggregation framework
    Posted by Haim Kaplan and Yishay Mansour, Research Scientists, Google Research Differential privacy (DP) machine learning algorithms protect user data by limiting the effect of each data point on an aggregated output with a mathematical guarantee. Intuitively the guarantee implies that changing a single user’s contribution should not significantly change the output distribution of the DP algorithm. However, DP algorithms tend to be less accurate than their non-private counterparts because satisfying DP is a worst-case requirement: one has to add noise to “hide” changes in any potential input point, including "unlikely points’’ that have a significant impact on the aggregation. For example, suppose we want to privately estimate the average of a dataset, and we know that a sphere of dia…  ( 93 min )
  • Open

    Redefining Workstations: NVIDIA, Intel Unlock Full Potential of Creativity and Productivity for Professionals
    AI-augmented applications, photorealistic rendering, simulation and other technologies are helping professionals achieve business-critical results from multi-app workflows faster than ever. Running these data-intensive, complex workflows, as well as sharing data and collaborating across geographically dispersed teams, requires workstations with high-end CPUs, GPUs and advanced networking. To help meet these demands, Intel and NVIDIA are powering Read article >  ( 6 min )
    Blender Alpha Release Comes to Omniverse, Introducing Scene Optimization Tools, Improved AI-Powered Character Animation
    Whether creating realistic digital humans that can express emotion or building immersive virtual worlds, 3D artists can reach new heights with NVIDIA Omniverse, a platform for creating and operating metaverse applications. A new Blender alpha release, now available in the Omniverse Launcher, lets users of the 3D graphics software optimize scenes and streamline workflows with Read article >  ( 5 min )
    Making a Splash: AI Can Help Protect Ocean Goers From Deadly Rips
    Surfers, swimmers and beachgoers face a hidden danger in the ocean: rip currents. These narrow channels of water can flow away from the shore at speeds up to 2.5 meters per second, making them one of the biggest safety risks for those enjoying the ocean. To help keep beachgoers safe, Christo Rautenbach, a coastal and Read article >  ( 4 min )
  • Open

    The 5 Crucial Principles To Build A Responsible AI Framework
    Understand how AI can be counterproductive, the need for adopting an Ethical & Responsible AI Framework, and the necessary principles you need to build one. Since the invention of Artificial Intelligence, many enterprises have adopted it in their operations for various reasons. From helping people identify the shortest distance to their destinations to solving high-impact… Read More »The 5 Crucial Principles To Build A Responsible AI Framework The post The 5 Crucial Principles To Build A Responsible AI Framework appeared first on Data Science Central.  ( 23 min )
  • Open

    Deep Multi-Emitter Spectrum Occupancy Mapping that is Robust to the Number of Sensors, Noise and Threshold. (arXiv:2212.10444v2 [eess.SP] UPDATED)
    One of the primary goals in spectrum occupancy mapping is to create a system that is robust to assumptions about the number of sensors, occupancy threshold (in dBm), sensor noise, number of emitters and the propagation environment. We show that such a system may be designed with neural networks using a process of aggregation to allow a variable number of sensors during training and testing. This process transforms the variable number of measurements into approximate log-likelihood ratios (LLRs), which are fed as a fixed-resolution image into a neural network. The use of LLR's provides robustness to the effects of noise and occupancy threshold. In other words, a system may be trained for a nominal number of sensors, threshold and noise levels, and still operate well at various other levels without retraining. Our system operates without knowledge of the number of emitters and does not explicitly attempt to estimate their number or power. Receiver operating curves with realistic propagation environments using topographic maps with commercial network design tools show how performance of the neural network varies with the environment. The use of very low-resolution sensors in this system can still yield good performance.  ( 2 min )
    Hungry Hungry Hippos: Towards Language Modeling with State Space Models. (arXiv:2212.14052v2 [cs.LG] UPDATED)
    State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.  ( 3 min )
    Falsification of Cyber-Physical Systems using Bayesian Optimization. (arXiv:2209.06735v2 [eess.SY] UPDATED)
    Cyber-physical systems (CPSs) are usually complex and safety-critical; hence, it is difficult and important to guarantee that the system's requirements, i.e., specifications, are fulfilled. Simulation-based falsification of CPSs is a practical testing method that can be used to raise confidence in the correctness of the system by only requiring that the system under test can be simulated. As each simulation is typically computationally intensive, an important step is to reduce the number of simulations needed to falsify a specification. We study Bayesian optimization (BO), a sample-efficient method that learns a surrogate model that describes the relationship between the parametrization of possible input signals and the evaluation of the specification. In this paper, we improve the falsification using BO by; first adopting two prominent BO methods, one fits local surrogate models, and the other exploits the user's prior knowledge. Secondly, the formulation of acquisition functions for falsification is addressed in this paper. Benchmark evaluation shows significant improvements in using local surrogate models of BO for falsifying benchmark examples that were previously hard to falsify. Using prior knowledge in the falsification process is shown to be particularly important when the simulation budget is limited. For some of the benchmark problems, the choice of acquisition function clearly affects the number of simulations needed for successful falsification.  ( 2 min )
    Demystifying Approximate Value-based RL with $\epsilon$-greedy Exploration: A Differential Inclusion View. (arXiv:2205.13617v3 [cs.LG] UPDATED)
    Q-learning and SARSA with $\epsilon$-greedy exploration are leading reinforcement learning methods. Their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods exhibit strange behaviors such as policy oscillation, chattering, and convergence to different attractors (possibly even the worst policy) on different runs, apart from the usual instability. A theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion to provide the first framework for resolving this problem. We also provide numerical examples to illustrate our framework's prowess in explaining these algorithms' behaviors.  ( 2 min )
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v3 [stat.ML] UPDATED)
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.  ( 2 min )
    On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. (arXiv:2205.10287v2 [cs.LG] UPDATED)
    Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.  ( 2 min )
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v3 [cs.LG] UPDATED)
    We provide a first finite-particle convergence rate for Stein variational gradient descent (SVGD). Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.  ( 2 min )
    DeepProphet2 -- A Deep Learning Gene Recommendation Engine. (arXiv:2208.01918v3 [q-bio.QM] UPDATED)
    New powerful tools for tackling life science problems have been created by recent advances in machine learning. The purpose of the paper is to discuss the potential advantages of gene recommendation performed by artificial intelligence (AI). Indeed, gene recommendation engines try to solve this problem: if the user is interested in a set of genes, which other genes are likely to be related to the starting set and should be investigated? This task was solved with a custom deep learning recommendation engine, DeepProphet2 (DP2), which is freely available to researchers worldwide via https://www.generecommender.com?utm_source=DeepProphet2_paper&utm_medium=pdf. Hereafter, insights behind the algorithm and its practical applications are illustrated. The gene recommendation problem can be addressed by mapping the genes to a metric space where a distance can be defined to represent the real semantic distance between them. To achieve this objective a transformer-based model has been trained on a well-curated freely available paper corpus, PubMed. The paper describes multiple optimization procedures that were employed to obtain the best bias-variance trade-off, focusing on embedding size and network depth. In this context, the model's ability to discover sets of genes implicated in diseases and pathways was assessed through cross-validation. A simple assumption guided the procedure: the network had no direct knowledge of pathways and diseases but learned genes' similarities and the interactions among them. Moreover, to further investigate the space where the neural network represents genes, the dimensionality of the embedding was reduced, and the results were projected onto a human-comprehensible space. In conclusion, a set of use cases illustrates the algorithm's potential applications in a real word setting.  ( 2 min )
    APOLLO: An Optimized Training Approach for Long-form Numerical Reasoning. (arXiv:2212.07249v2 [cs.CL] UPDATED)
    Long-form numerical reasoning in financial analysis aims to generate a reasoning program to calculate the correct answer for a given question. Previous work followed a retriever-generator framework, where the retriever selects key facts from a long-form document, and the generator generates a reasoning program based on retrieved facts. However, they treated all facts equally without considering the different contributions of facts with and without numbers. Meanwhile, the program consistency were ignored under supervised training, resulting in lower training accuracy and diversity. To solve these problems, we proposed APOLLO to improve the long-form numerical reasoning framework. For the retriever, we adopt a number-aware negative sampling strategy to enable the retriever to be more discriminative on key numerical facts. For the generator, we design consistency-based reinforcement learning and target program augmentation strategy based on the consistency of program execution results. Experimental results on the FinQA and ConvFinQA leaderboard verify the effectiveness of our proposed method, achieving the new state-of-the-art.  ( 2 min )
    Counterfactual Fairness Is Basically Demographic Parity. (arXiv:2208.03843v3 [cs.LG] UPDATED)
    Making fair decisions is crucial to ethically implementing machine learning algorithms in social settings. In this work, we consider the celebrated definition of counterfactual fairness [Kusner et al., NeurIPS, 2017]. We begin by showing that an algorithm which satisfies counterfactual fairness also satisfies demographic parity, a far simpler fairness constraint. Similarly, we show that all algorithms satisfying demographic parity can be trivially modified to satisfy counterfactual fairness. Together, our results indicate that counterfactual fairness is basically equivalent to demographic parity, which has important implications for the growing body of work on counterfactual fairness. We then validate our theoretical findings empirically, analyzing three existing algorithms for counterfactual fairness against three simple benchmarks. We find that two simple benchmark algorithms outperform all three existing algorithms -- in terms of fairness, accuracy, and efficiency -- on several data sets. Our analysis leads us to formalize a concrete fairness goal: to preserve the order of individuals within protected groups. We believe transparency around the ordering of individuals within protected groups makes fair algorithms more trustworthy. By design, the two simple benchmark algorithms satisfy this goal while the existing algorithms for counterfactual fairness do not.  ( 2 min )
    A Physics-informed Diffusion Model for High-fidelity Flow Field Reconstruction. (arXiv:2211.14680v2 [cs.LG] UPDATED)
    Machine learning models are gaining increasing popularity in the domain of fluid dynamics for their potential to accelerate the production of high-fidelity computational fluid dynamics data. However, many recently proposed machine learning models for high-fidelity data reconstruction require low-fidelity data for model training. Such requirement restrains the application performance of these models, since their data reconstruction accuracy would drop significantly if the low-fidelity input data used in model test has a large deviation from the training data. To overcome this restraint, we propose a diffusion model which only uses high-fidelity data at training. With different configurations, our model is able to reconstruct high-fidelity data from either a regular low-fidelity sample or a sparsely measured sample, and is also able to gain an accuracy increase by using physics-informed conditioning information from a known partial differential equation when that is available. Experimental results demonstrate that our model can produce accurate reconstruction results for 2d turbulent flows based on different input sources without retraining.  ( 2 min )
    Conv-NILM-Net, a causal and multi-appliance model for energy source separation. (arXiv:2208.02173v2 [eess.SP] UPDATED)
    Non-Intrusive Load Monitoring (NILM) seeks to save energy by estimating individual appliance power usage from a single aggregate measurement. Deep neural networks have become increasingly popular in attempting to solve NILM problems. However most used models are used for Load Identification rather than online Source Separation. Among source separation models, most use a single-task learning approach in which a neural network is trained exclusively for each appliance. This strategy is computationally expensive and ignores the fact that multiple appliances can be active simultaneously and dependencies between them. The rest of models are not causal, which is important for real-time application. Inspired by Convtas-Net, a model for speech separation, we propose Conv-NILM-net, a fully convolutional framework for end-to-end NILM. Conv-NILM-net is a causal model for multi appliance source separation. Our model is tested on two real datasets REDD and UK-DALE and clearly outperforms the state of the art while keeping a significantly smaller size than the competing models.  ( 2 min )
    Improving Performance in Neural Networks by Dendrites-Activated Connections. (arXiv:2301.00924v2 [cs.NE] UPDATED)
    Computational units in artificial neural networks compute a linear combination of their inputs, and then apply a nonlinear filter, often a ReLU shifted by some bias, and if the inputs come themselves from other units, they were already filtered with their own biases. In a layer, multiple units share the same inputs, and each input was filtered with a unique bias, resulting in output values being based on shared input biases rather than individual optimal ones. To mitigate this issue, we introduce DAC, a new computational unit based on preactivation and multiple biases, where input signals undergo independent nonlinear filtering before the linear combination. We provide a Keras implementation and report its computational efficiency. We test DAC convolutions in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and Imagewoof, and achieve performance improvements of up to 1.73%. We exhibit examples where DAC is more efficient than its standard counterpart as a function approximator, and we prove a universal representation theorem.  ( 2 min )
    A picture of the space of typical learnable tasks. (arXiv:2210.17011v2 [cs.LG] UPDATED)
    We develop information geometric techniques to understand the representations learned by deep networks when they are trained on different tasks using supervised, meta-, semi-supervised and contrastive learning. We shed light on the following phenomena that relate to the structure of the space of tasks: (1) the manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional; (2) supervised learning on one task results in a surprising amount of progress even on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) the structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) episodic meta-learning algorithms and supervised learning traverse different trajectories during training but they fit similar models eventually; (5) contrastive and semi-supervised learning methods traverse trajectories similar to those of supervised learning. We use classification tasks constructed from the CIFAR-10 and Imagenet datasets to study these phenomena.  ( 2 min )
    Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design. (arXiv:2302.02913v2 [cs.LG] UPDATED)
    Deep generative models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion Models, and Transformers, have shown great promise in a variety of applications, including image and speech synthesis, natural language processing, and drug discovery. However, when applied to engineering design problems, evaluating the performance of these models can be challenging, as traditional statistical metrics based on likelihood may not fully capture the requirements of engineering applications. This paper doubles as a review and a practical guide to evaluation metrics for deep generative models (DGMs) in engineering design. We first summarize well-accepted `classic' evaluation metrics for deep generative models grounded in machine learning theory and typical computer science applications. Using case studies, we then highlight why these metrics seldom translate well to design problems but see frequent use due to the lack of established alternatives. Next, we curate a set of design-specific metrics which have been proposed across different research communities and can be used for evaluating deep generative models. These metrics focus on unique requirements in design and engineering, such as constraint satisfaction, functional performance, novelty, and conditioning. We structure our review and discussion as a set of practical selection criteria and usage guidelines. Throughout our discussion, we apply the metrics to models trained on simple 2-dimensional example problems. Finally, to illustrate the selection process and classic usage of the presented metrics, we evaluate three deep generative models on a multifaceted bicycle frame design problem considering performance target achievement, design novelty, and geometric constraints. We publicly release the code for the datasets, models, and metrics used throughout the paper at decode.mit.edu/projects/metrics/.  ( 2 min )
    An Empirical Study of Deep Learning Models for Vulnerability Detection. (arXiv:2212.08109v3 [cs.SE] UPDATED)
    Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR. We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models' outputs. We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once. We explored the types of programs DL may consider "hard" to handle. We investigated the relations of training data sizes and training data composition with model performance. Finally, we studied model interpretations and analyzed important features that the models used to make predictions. We believe that our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models. All of our datasets, code, and results are available at https://doi.org/10.6084/m9.figshare.20791240.  ( 2 min )
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v5 [cs.LG] UPDATED)
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.  ( 2 min )
    Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model. (arXiv:2211.10590v3 [cs.LG] UPDATED)
    Despite the impressive successes of deep learning approaches for various chemical problems such as property prediction, virtual screening, and de novo molecule design, separately designed models for specific tasks are usually required, and it is often difficult to synergistically combine these models for novel tasks. To address this, here we present a bidirectional molecular foundation model that can be used for both molecular structure and property inferences through a single model, inspired by recent multimodal learning methods such as VLP. Furthermore, thanks to the outstanding structure/property alignment in a common embedding space, experimental results confirm that our method leads to state-of-the-art performance and interpretable attention maps in both multimodal and unimodal tasks, including conditional molecule generation, property prediction, molecule classification, and reaction prediction.  ( 2 min )
    Machine Learning for Optical Motion Capture-driven Musculoskeletal Modelling from Inertial Motion Capture Data. (arXiv:2209.14456v2 [cs.LG] UPDATED)
    Marker-based Optical Motion Capture (OMC) systems and associated musculoskeletal (MSK) modelling predictions offer non-invasively obtainable insights into in vivo joint and muscle loading, aiding clinical decision-making. However, an OMC system is lab-based, expensive, and requires a line of sight. Inertial Motion Capture (IMC) systems are widely-used alternatives, which are portable, user-friendly, and relatively low-cost, although with lesser accuracy. Irrespective of the choice of motion capture technique, one needs to use an MSK model to obtain the kinematic and kinetic outputs, which is a computationally expensive tool increasingly well approximated by machine learning (ML) methods. Here, we present an ML approach to map experimentally recorded IMC data to the human upper-extremity MSK model outputs computed from ('gold standard') OMC input data. Essentially, we aim to predict higher-quality MSK outputs from the much easier-to-obtain IMC data. We use OMC and IMC data simultaneously collected for the same subjects to train different ML architectures that predict OMC-driven MSK outputs from IMC measurements. In particular, we employed various neural network (NN) architectures, such as Feed-Forward Neural Networks (FFNNs) and Recurrent Neural Networks (RNNs) (vanilla, Long Short-Term Memory, and Gated Recurrent Unit) and searched for the best-fit model through an exhaustive search in the hyperparameters space in both subject-exposed (SE) & subject-naive (SN) settings. We observed a comparable performance for both FFNN & RNN models, which have a high degree of agreement (ravg, SE, FFNN = 0.90+/-0.19, ravg, SE, RNN = 0.89+/-0.17, ravg, SN, FFNN = 0.84+/-0.23, & ravg, SN, RNN = 0.78+/-0.23) with the desired OMC-driven MSK estimates for held-out test data. Mapping IMC inputs to OMC-driven MSK outputs using ML models could be instrumental in transitioning MSK modelling from 'lab to field'.  ( 3 min )
    The Debate Over Understanding in AI's Large Language Models. (arXiv:2210.13966v3 [cs.LG] UPDATED)
    We survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to "understand" language -- and the physical and social situations language encodes -- in any important sense. We describe arguments that have been made for and against such understanding, and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that a new science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition.  ( 2 min )
    Transformers in Time Series: A Survey. (arXiv:2202.07125v4 [cs.LG] UPDATED)
    Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interest in the time series community. Among multiple advantages of Transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications. In this paper, we systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series Transformers in two perspectives. From the perspective of network structure, we summarize the adaptations and modifications that have been made to Transformers in order to accommodate the challenges in time series analysis. From the perspective of applications, we categorize time series Transformers based on common tasks including forecasting, anomaly detection, and classification. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how Transformers perform in time series. Finally, we discuss and suggest future directions to provide useful research guidance. A corresponding resource that has been continuously updated can be found in the GitHub repository. To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers.  ( 2 min )
    PatchBlender: A Motion Prior for Video Transformers. (arXiv:2211.14449v2 [cs.CV] UPDATED)
    Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the baseline performance of video Transformers. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.  ( 2 min )
    Benchmarking Bayesian neural networks and evaluation metrics for regression tasks. (arXiv:2206.06779v2 [cs.LG] UPDATED)
    Due to the growing adoption of deep neural networks in many fields of science and engineering, modeling and estimating their uncertainties has become of primary importance. Despite the growing literature about uncertainty quantification in deep learning, the quality of the uncertainty estimates remains an open question. In this work, we assess for the first time the performance of several approximation methods for Bayesian neural networks on regression tasks by evaluating the quality of the confidence regions with several coverage metrics. The selected algorithms are also compared in terms of predictivity, kernelized Stein discrepancy and maximum mean discrepancy with respect to a reference posterior in both weight and function space. Our findings show that (i) some algorithms have excellent predictive performance but tend to largely over or underestimate uncertainties (ii) it is possible to achieve good accuracy and a given target coverage with finely tuned hyperparameters and (iii) the promising kernel Stein discrepancy cannot be exclusively relied on to assess the posterior approximation. As a by-product of this benchmark, we also compute and visualize the similarity of all algorithms and corresponding hyperparameters: interestingly we identify a few clusters of algorithms with similar behavior in weight space, giving new insights on how they explore the posterior distribution.
    Calibrated Forecasts: The Minimax Proof. (arXiv:2209.05863v2 [econ.TH] UPDATED)
    A formal write-up of the simple proof (1995) of the existence of calibrated forecasts by the minimax theorem, which moreover shows that $N^3$ periods suffice to guarantee a calibration error of at most $1/N$.  ( 2 min )
    Vote'n'Rank: Revision of Benchmarking with Social Choice Theory. (arXiv:2210.05769v3 [cs.LG] UPDATED)
    The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote'n'Rank's procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.
    Robust Causal Graph Representation Learning against Confounding Effects. (arXiv:2208.08584v2 [cs.LG] UPDATED)
    The prevailing graph neural network models have achieved significant progress in graph representation learning. However, in this paper, we uncover an ever-overlooked phenomenon: the pre-trained graph representation learning model tested with full graphs underperforms the model tested with well-pruned graphs. This observation reveals that there exist confounders in graphs, which may interfere with the model learning semantic information, and current graph representation learning methods have not eliminated their influence. To tackle this issue, we propose Robust Causal Graph Representation Learning (RCGRL) to learn robust graph representations against confounding effects. RCGRL introduces an active approach to generate instrumental variables under unconditional moment restrictions, which empowers the graph representation learning model to eliminate confounders, thereby capturing discriminative information that is causally related to downstream predictions. We offer theorems and proofs to guarantee the theoretical effectiveness of the proposed approach. Empirically, we conduct extensive experiments on a synthetic dataset and multiple benchmark datasets. The results demonstrate that compared with state-of-the-art methods, RCGRL achieves better prediction performance and generalization ability.
    DocILE Benchmark for Document Information Localization and Extraction. (arXiv:2302.05658v1 [cs.CL])
    This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at https://github.com/rossumai/docile.  ( 2 min )
    ASR Bundestag: A Large-Scale political debate dataset in German. (arXiv:2302.06008v1 [cs.CL])
    We present ASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.  ( 2 min )
    Discriminative Radial Domain Adaptation. (arXiv:2301.00383v2 [cs.LG] UPDATED)
    Domain adaptation methods reduce domain shift typically by learning domain-invariant features. Most existing methods are built on distribution matching, e.g., adversarial domain adaptation, which tends to corrupt feature discriminability. In this paper, we propose Discriminative Radial Domain Adaptation (DRDA) which bridges source and target domains via a shared radial structure. It's motivated by the observation that as the model is trained to be progressively discriminative, features of different categories expand outwards in different directions, forming a radial structure. We show that transferring such an inherently discriminative structure would enable to enhance feature transferability and discriminability simultaneously. Specifically, we represent each domain with a global anchor and each category a local anchor to form a radial structure and reduce domain shift via structure matching. It consists of two parts, namely isometric transformation to align the structure globally and local refinement to match each category. To enhance the discriminability of the structure, we further encourage samples to cluster close to the corresponding local anchors based on optimal-transport assignment. Extensively experimenting on multiple benchmarks, our method is shown to consistently outperforms state-of-the-art approaches on varied tasks, including the typical unsupervised domain adaptation, multi-source domain adaptation, domain-agnostic learning, and domain generalization.  ( 2 min )
    When to Update Your Model: Constrained Model-based Reinforcement Learning. (arXiv:2210.08349v3 [cs.LG] UPDATED)
    Designing and analyzing model-based RL (MBRL) algorithms with guaranteed monotonic improvement has been challenging, mainly due to the interdependence between policy optimization and model learning. Existing discrepancy bounds generally ignore the impacts of model shifts, and their corresponding algorithms are prone to degrade performance by drastic model updating. In this work, we first propose a novel and general theoretical scheme for a non-decreasing performance guarantee of MBRL. Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. These discoveries encourage us to formulate a constrained lower-bound optimization problem to permit the monotonicity of MBRL. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns. Motivated by these analyses, we design a simple but effective algorithm CMLO (Constrained Model-shift Lower-bound Optimization), by introducing an event-triggered mechanism that flexibly determines when to update the model. Experiments show that CMLO surpasses other state-of-the-art methods and produces a boost when various policy optimization methods are employed.  ( 2 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v6 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    On Parameter Estimation in Unobserved Components Models subject to Linear Inequality Constraints. (arXiv:2110.12149v2 [econ.EM] UPDATED)
    We propose a new \textit{quadratic programming-based} method of approximating a nonstandard density using a multivariate Gaussian density. Such nonstandard densities usually arise while developing posterior samplers for unobserved components models involving inequality constraints on the parameters. For instance, Chan et al. (2016) provided a new model of trend inflation with linear inequality constraints on the stochastic trend. We implemented the proposed quadratic programming-based method for this model and compared it to the existing approximation. We observed that the proposed method works as well as the existing approximation in terms of the final trend estimates while achieving gains in terms of sample efficiency.  ( 2 min )
    Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. (arXiv:2210.12316v2 [cs.IR] UPDATED)
    Recently, the generality of natural language text has been leveraged to develop transferable recommender systems. The basic idea is to employ pre-trained language models~(PLM) to encode item text into item representations. Despite the promising transferability, the binding between item text and item representations might be too tight, leading to potential problems such as over-emphasizing the effect of text features and exaggerating the negative impact of domain gap. To address this issue, this paper proposes VQ-Rec, a novel approach to learning Vector-Quantized item representations for transferable sequential Recommenders. The main novelty of our approach lies in the new item representation scheme: it first maps item text into a vector of discrete indices (called item code), and then employs these indices to lookup the code embedding table for deriving item representations. Such a scheme can be denoted as "text $\Longrightarrow$ code $\Longrightarrow$ representation". Based on this representation scheme, we further propose an enhanced contrastive pre-training approach, using semi-synthetic and mixed-domain code representations as hard negatives. Furthermore, we design a new cross-domain fine-tuning method based on a differentiable permutation-based network. Extensive experiments conducted on six public benchmarks demonstrate the effectiveness of the proposed approach, in both cross-domain and cross-platform settings. Code and pre-trained model are available at: https://github.com/RUCAIBox/VQ-Rec.  ( 2 min )
    Bootstrapping Multilingual Semantic Parsers using Large Language Models. (arXiv:2210.07313v2 [cs.CL] UPDATED)
    Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-train paradigm of transferring English datasets across multiple languages remains to be a key mechanism for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human-annotated translation pairs. Further, translation services may continue to be brittle due to domain mismatch between task-specific input text and general-purpose text used for training translation models. For multilingual semantic parsing, we demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting. Through extensive comparisons on two public datasets, MTOP and MASSIVE, spanning 50 languages and several domains, we show that our method of translating data using LLMs outperforms a strong translate-train baseline on 41 out of 50 languages. We study the key design choices that enable more effective multilingual data translation via prompted LLMs.  ( 2 min )
    Blessing of Class Diversity in Pre-training. (arXiv:2209.03447v3 [cs.LG] UPDATED)
    This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.  ( 2 min )
    Sparse Mutation Decompositions: Fine Tuning Deep Neural Networks with Subspace Evolution. (arXiv:2302.05832v1 [cs.NE])
    Neuroevolution is a promising area of research that combines evolutionary algorithms with neural networks. A popular subclass of neuroevolutionary methods, called evolution strategies, relies on dense noise perturbations to mutate networks, which can be sample inefficient and challenging for large models with millions of parameters. We introduce an approach to alleviating this problem by decomposing dense mutations into low-dimensional subspaces. Restricting mutations in this way can significantly reduce variance as networks can handle stronger perturbations while maintaining performance, which enables a more controlled and targeted evolution of deep networks. This approach is uniquely effective for the task of fine tuning pre-trained models, which is an increasingly valuable area of research as networks continue to scale in size and open source models become more widely available. Furthermore, we show how this work naturally connects to ensemble learning where sparse mutations encourage diversity among children such that their combined predictions can reliably improve performance. We conduct the first large scale exploration of neuroevolutionary fine tuning and ensembling on the notoriously difficult ImageNet dataset, where we see small generalization improvements with only a single evolutionary generation using nearly a dozen different deep neural network architectures.  ( 2 min )
    Relational Local Explanations. (arXiv:2212.12374v2 [cs.LG] UPDATED)
    The majority of existing post-hoc explanation approaches for machine learning models produce independent, per-variable feature attribution scores, ignoring a critical inherent characteristics of homogeneously structured data, such as visual or text data: there exist latent inter-variable relationships between features. In response, we develop a novel model-agnostic and permutation-based feature attribution approach based on the relational analysis between input variables. As a result, we are able to gain a broader insight into the predictions and decisions of machine learning models. Experimental evaluations of our framework in comparison with state-of-the-art attribution techniques on various setups involving both image and text data modalities demonstrate the effectiveness and validity of our method.  ( 2 min )
    Reinforcement Learning with Almost Sure Constraints. (arXiv:2112.05198v3 [cs.LG] UPDATED)
    In this work we address the problem of finding feasible policies for Constrained Markov Decision Processes under probability one constraints. We argue that stationary policies are not sufficient for solving this problem, and that a rich class of policies can be found by endowing the controller with a scalar quantity, so called budget, that tracks how close the agent is to violating the constraint. We show that the minimal budget required to act safely can be obtained as the smallest fixed point of a Bellman-like operator, for which we analyze its convergence properties. We also show how to learn this quantity when the true kernel of the Markov decision process is not known, while providing sample-complexity bounds. The utility of knowing this minimal budget relies in that it can aid in the search of optimal or near-optimal policies by shrinking down the region of the state space the agent must navigate. Simulations illustrate the different nature of probability one constraints against the typically used constraints in expectation.  ( 2 min )
    Physics informed WNO. (arXiv:2302.05925v1 [stat.ML])
    Deep neural operators are recognized as an effective tool for learning solution operators of complex partial differential equations (PDEs). As compared to laborious analytical and computational tools, a single neural operator can predict solutions of PDEs for varying initial or boundary conditions and different inputs. A recently proposed Wavelet Neural Operator (WNO) is one such operator that harnesses the advantage of time-frequency localization of wavelets to capture the manifolds in the spatial domain effectively. While WNO has proven to be a promising method for operator learning, the data-hungry nature of the framework is a major shortcoming. In this work, we propose a physics-informed WNO for learning the solution operators of families of parametric PDEs without labeled training data. The efficacy of the framework is validated and illustrated with four nonlinear spatiotemporal systems relevant to various fields of engineering and science.  ( 2 min )
    DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. (arXiv:2210.01776v2 [q-bio.BM] UPDATED)
    Predicting the binding structure of a small molecule ligand to a protein -- a task known as molecular docking -- is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, while previous methods are not able to dock on computationally folded structures (maximum accuracy 10.4%), DiffDock maintains significantly higher precision (21.7%). Finally, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.  ( 2 min )
    Global-Local Regularization Via Distributional Robustness. (arXiv:2203.00553v3 [cs.LG] UPDATED)
    Despite superior performance in many situations, deep neural networks are often vulnerable to adversarial examples and distribution shifts, limiting model generalization ability in real-world applications. To alleviate these problems, recent approaches leverage distributional robustness optimization (DRO) to find the most challenging distribution, and then minimize loss function over this most challenging distribution. Regardless of achieving some improvements, these DRO approaches have some obvious limitations. First, they purely focus on local regularization to strengthen model robustness, missing a global regularization effect which is useful in many real-world applications (e.g., domain adaptation, domain generalization, and adversarial machine learning). Second, the loss functions in the existing DRO approaches operate in only the most challenging distribution, hence decouple with the original distribution, leading to a restrictive modeling capability. In this paper, we propose a novel regularization technique, following the veins of Wasserstein-based DRO framework. Specifically, we define a particular joint distribution and Wasserstein-based uncertainty, allowing us to couple the original and most challenging distributions for enhancing modeling capability and applying both local and global regularizations. Empirical studies on different learning problems demonstrate that our proposed approach significantly outperforms the existing regularization approaches in various domains: semi-supervised learning, domain adaptation, domain generalization, and adversarial machine learning.  ( 2 min )
    A New Approach to Drifting Games, Based on Asymptotically Optimal Potentials. (arXiv:2207.11405v2 [cs.LG] UPDATED)
    We develop a new approach to drifting games, a class of two-person games with many applications to boosting and online learning settings. Our approach involves (a) guessing an asymptotically optimal potential by solving an associated partial differential equation (PDE); then (b) justifying the guess, by proving upper and lower bounds on the final-time loss whose difference scales like a negative power of the number of time steps. The proofs of our potential-based upper bounds are elementary, using little more than Taylor expansion. The proofs of our potential-based lower bounds are also elementary, combining Taylor expansion with probabilistic or combinatorial arguments. Not only is our approach more elementary, but we give new potentials and derive corresponding upper and lower bounds that match each other in the asymptotic regime.  ( 2 min )
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v3 [cs.LG] UPDATED)
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.  ( 2 min )
    Towards Fine-tuning Pre-trained Language Models with Integer Forward and Backward Propagation. (arXiv:2209.09815v2 [cs.LG] UPDATED)
    The large number of parameters of some prominent language models, such as BERT, makes their fine-tuning on downstream tasks computationally intensive and energy hungry. Previously researchers were focused on lower bit-width integer data types for the forward propagation of language models to save memory and computation. As for the backward propagation, however, only 16-bit floating-point data type has been used for the fine-tuning of BERT. In this work, we use integer arithmetic for both forward and back propagation in the fine-tuning of BERT. We study the effects of varying the integer bit-width on the model's metric performance. Our integer fine-tuning uses integer arithmetic to perform forward propagation and gradient computation of linear, layer-norm, and embedding layers of BERT. We fine-tune BERT using our integer training method on SQuAD v1.1 and SQuAD v2., and GLUE benchmark. We demonstrate that metric performance of fine-tuning 16-bit integer BERT matches both 16-bit and 32-bit floating-point baselines. Furthermore, using the faster and more memory efficient 8-bit integer data type, integer fine-tuning of BERT loses an average of 3.1 points compared to the FP32 baseline.  ( 2 min )
    Jointly Contrastive Representation Learning on Road Network and Trajectory. (arXiv:2209.06389v2 [cs.LG] UPDATED)
    Road network and trajectory representation learning are essential for traffic systems since the learned representation can be directly used in various downstream tasks (e.g., traffic speed inference, and travel time estimation). However, most existing methods only contrast within the same scale, i.e., treating road network and trajectory separately, which ignores valuable inter-relations. In this paper, we aim to propose a unified framework that jointly learns the road network and trajectory representations end-to-end. We design domain-specific augmentations for road-road contrast and trajectory-trajectory contrast separately, i.e., road segment with its contextual neighbors and trajectory with its detour replaced and dropped alternatives, respectively. On top of that, we further introduce the road-trajectory cross-scale contrast to bridge the two scales by maximizing the total mutual information. Unlike the existing cross-scale contrastive learning methods on graphs that only contrast a graph and its belonging nodes, the contrast between road segment and trajectory is elaborately tailored via novel positive sampling and adaptive weighting strategies. We conduct prudent experiments based on two real-world datasets with four downstream tasks, demonstrating improved performance and effectiveness. The code is available at https://github.com/mzy94/JCLRNT.  ( 2 min )
    Behavior Prior Representation learning for Offline Reinforcement Learning. (arXiv:2211.00863v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at \url{https://github.com/bit1029public/offline_bpr}.  ( 2 min )
    An efficient encoder-decoder architecture with top-down attention for speech separation. (arXiv:2209.15200v4 [cs.SD] UPDATED)
    Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain's top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet's multiply-accumulate operations (MACs) are only 5\% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10\% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10\% of Sepformer and the CPU inference time only 24\% of Sepformer.  ( 2 min )
    Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators. (arXiv:2210.07219v2 [cs.DS] UPDATED)
    We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.  ( 2 min )
    Meta-Learning Based Knowledge Extrapolation for Temporal Knowledge Graph. (arXiv:2302.05640v1 [cs.AI])
    In the last few years, the solution to Knowledge Graph (KG) completion via learning embeddings of entities and relations has attracted a surge of interest. Temporal KGs(TKGs) extend traditional Knowledge Graphs (KGs) by associating static triples with timestamps forming quadruples. Different from KGs and TKGs in the transductive setting, constantly emerging entities and relations in incomplete TKGs create demand to predict missing facts with unseen components, which is the extrapolation setting. Traditional temporal knowledge graph embedding (TKGE) methods are limited in the extrapolation setting since they are trained within a fixed set of components. In this paper, we propose a Meta-Learning based Temporal Knowledge Graph Extrapolation (MTKGE) model, which is trained on link prediction tasks sampled from the existing TKGs and tested in the emerging TKGs with unseen entities and relations. Specifically, we meta-train a GNN framework that captures relative position patterns and temporal sequence patterns between relations. The learned embeddings of patterns can be transferred to embed unseen components. Experimental results on two different TKG extrapolation datasets show that MTKGE consistently outperforms both the existing state-of-the-art models for knowledge graph extrapolation and specifically adapted KGE and TKGE baselines.  ( 2 min )
    An Upper Bound for the Distribution Overlap Index and Its Applications. (arXiv:2212.08701v2 [cs.LG] UPDATED)
    This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier can be accurate with only a small number of in-class samples and outperform many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.  ( 2 min )
    Hierarchical Optimization-Derived Learning. (arXiv:2302.05587v1 [cs.LG])
    In recent years, by utilizing optimization techniques to formulate the propagation of deep model, a variety of so-called Optimization-Derived Learning (ODL) approaches have been proposed to address diverse learning and vision tasks. Although having achieved relatively satisfying practical performance, there still exist fundamental issues in existing ODL methods. In particular, current ODL methods tend to consider model construction and learning as two separate phases, and thus fail to formulate their underlying coupling and depending relationship. In this work, we first establish a new framework, named Hierarchical ODL (HODL), to simultaneously investigate the intrinsic behaviors of optimization-derived model construction and its corresponding learning process. Then we rigorously prove the joint convergence of these two sub-tasks, from the perspectives of both approximation quality and stationary analysis. To our best knowledge, this is the first theoretical guarantee for these two coupled ODL components: optimization and learning. We further demonstrate the flexibility of our framework by applying HODL to challenging learning tasks, which have not been properly addressed by existing ODL methods. Finally, we conduct extensive experiments on both synthetic data and real applications in vision and other learning tasks to verify the theoretical properties and practical performance of HODL in various application scenarios.  ( 2 min )
    FusionRetro: Molecule Representation Fusion via Reaction Graph for Retrosynthetic Planning. (arXiv:2209.15315v2 [cs.LG] UPDATED)
    Retrosynthetic planning is a fundamental problem in drug discovery and organic chemistry, which aims to find a complete multi-step synthetic route from a set of starting materials to the target molecule, determining crucial process flow in chemical production. Existing approaches combine single-step retrosynthesis models and search algorithms to find synthetic routes. However, these approaches generally consider the two pieces in a decoupled manner, taking only the product as the input to predict the reactants per planning step and largely ignoring the important context information from other intermediates along the synthetic route. In this work, we perform a series of experiments to identify the limitations of this decoupled view and propose a novel retrosynthesis framework that also exploits context information for retrosynthetic planning. We view synthetic routes as reaction graphs, and propose to incorporate the context by three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. The whole framework can be efficiently optimized in an end-to-end fashion. Comprehensive experiments show that by fusing in context information over routes, our model significantly improves the performance of retrosynthetic planning over baselines that are not context-aware, especially for long synthetic routes.  ( 2 min )
    Multi-Scored Sleep Databases: How to Exploit the Multiple-Labels in Automated Sleep Scoring. (arXiv:2207.01910v3 [cs.LG] UPDATED)
    Study Objectives: Inter-scorer variability in scoring polysomnograms is a well-known problem. Most of the existing automated sleep scoring systems are trained using labels annotated by a single scorer, whose subjective evaluation is transferred to the model. When annotations from two or more scorers are available, the scoring models are usually trained on the scorer consensus. The averaged scorer's subjectivity is transferred into the model, losing information about the internal variability among different scorers. In this study, we aim to insert the multiple-knowledge of the different physicians into the training procedure. The goal is to optimize a model training, exploiting the full information that can be extracted from the consensus of a group of scorers. Methods: We train two lightweight deep learning based models on three different multi-scored databases. We exploit the label smoothing technique together with a soft-consensus (LSSC) distribution to insert the multiple-knowledge in the training procedure of the model. We introduce the averaged cosine similarity metric (ACS) to quantify the similarity between the hypnodensity-graph generated by the models with-LSSC and the hypnodensity-graph generated by the scorer consensus. Results: The performance of the models improves on all the databases when we train the models with our LSSC. We found an increase in ACS (up to 6.4%) between the hypnodensity-graph generated by the models trained with-LSSC and the hypnodensity-graph generated by the consensus. Conclusion: Our approach definitely enables a model to better adapt to the consensus of the group of scorers. Future work will focus on further investigations on different scoring architectures and hopefully large-scale-heterogeneous multi-scored datasets.  ( 2 min )
    Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization. (arXiv:2302.05865v1 [cs.LG])
    Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We extend the current state-of-the-art aggregators and propose an optimization-based subspace estimator by modeling pairwise distances as quadratic functions by utilizing the recently introduced Flag Median problem. The estimator in our loss function favors the pairs that preserve the norm of the difference vector. We theoretically show that our approach enhances the robustness of state-of-the-art byzantine resilient aggregators. Also, we evaluate our method with different tasks in a distributed setup with a parameter server architecture and show its communication efficiency while maintaining similar accuracy. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator  ( 2 min )
    Effects of Image Size on Deep Learning. (arXiv:2101.11508v7 [cs.CV] UPDATED)
    In this work, the best size for late gadolinium enhancement (LGE) magnetic resonance imaging (MRI) images in the training dataset was determined to optimize deep learning training outcomes. Non-extra pixel and extra pixel interpolation algorithms were used to determine the new size of the LGE-MRI images. A novel strategy was introduced to handle interpolation masks and remove extra class labels in interpolated ground truth (GT) segmentation masks. The expectation maximization, weighted intensity, a priori information (EWA) algorithm was used for quantification of myocardial infarction (MI) in automatically segmented LGE-MRI images. Arbitrary threshold, comparison of the sums, and sums of differences are methods used to estimate the relationship between semi-automatic or manual and fully automated quantification of myocardial infarction (MI) results. The relationship between semi-automatic and fully automated quantification of MI results was found to be closer in the case of bigger LGE MRI images (55.5% closer to manual results) than in the case of smaller LGE MRI images (22.2% closer to manual results).  ( 2 min )
    On Narrative Information and the Distillation of Stories. (arXiv:2211.12423v2 [cs.CL] UPDATED)
    The act of telling stories is a fundamental part of what it means to be human. This work introduces the concept of narrative information, which we define to be the overlap in information space between a story and the items that compose the story. Using contrastive learning methods, we show how modern artificial neural networks can be leveraged to distill stories and extract a representation of the narrative information. We then demonstrate how evolutionary algorithms can leverage this to extract a set of narrative templates and how these templates -- in tandem with a novel curve-fitting algorithm we introduce -- can reorder music albums to automatically induce stories in them. In the process of doing so, we give strong statistical evidence that these narrative information templates are present in existing albums. While we experiment only with music albums here, the premises of our work extend to any form of (largely) independent media.  ( 2 min )
    USER: Unsupervised Structural Entropy-based Robust Graph Neural Network. (arXiv:2302.05889v1 [cs.LG])
    Unsupervised/self-supervised graph neural networks (GNN) are vulnerable to inherent randomness in the input graph data which greatly affects the performance of the model in downstream tasks. In this paper, we alleviate the interference of graph randomness and learn appropriate representations of nodes without label information. To this end, we propose USER, an unsupervised robust version of graph neural networks that is based on structural entropy. We analyze the property of intrinsic connectivity and define intrinsic connectivity graph. We also identify the rank of the adjacency matrix as a crucial factor in revealing a graph that provides the same embeddings as the intrinsic connectivity graph. We then introduce structural entropy in the objective function to capture such a graph. Extensive experiments conducted on clustering and link prediction tasks under random-noises and meta-attack over three datasets show USER outperforms benchmarks and is robust to heavier randomness.  ( 2 min )
    Exploration of carbonate aggregates in road construction using ultrasonic and artificial intelligence approaches. (arXiv:2302.05884v1 [cs.LG])
    The COVID-19 pandemic has significantly impacted the construction sector, which is sensitive to economic cycles. In order to boost value and efficiency in this sector, the use of innovative exploration technologies such as ultrasonic and Artificial Intelligence techniques in building material research is becoming increasingly crucial. In this study, we developed two models for predicting the Los Angeles (LA) and Micro Deval (MDE) coefficients, two important geotechnical tests used to determine the quality of rock aggregates. These coefficients describe the resistance of aggregates to fragmentation and abrasion. The ultrasound velocity, porosity, and density of the rocks were determined and used as inputs to develop prediction models using multiple regression and an artificial neural network. These models may be used to assess the quality of rock aggregates at the exploration stage without the need for tedious laboratory analysis.  ( 2 min )
    Improving Accuracy of Interpretability Measures in Hyperparameter Optimization via Bayesian Algorithm Execution. (arXiv:2206.05447v2 [cs.LG] UPDATED)
    Despite all the benefits of automated hyperparameter optimization (HPO), most modern HPO algorithms are black-boxes themselves. This makes it difficult to understand the decision process which leads to the selected configuration, reduces trust in HPO, and thus hinders its broad adoption. Here, we study the combination of HPO with interpretable machine learning (IML) methods such as partial dependence plots. These techniques are more and more used to explain the marginal effect of hyperparameters on the black-box cost function or to quantify the importance of hyperparameters. However, if such methods are naively applied to the experimental data of the HPO process in a post-hoc manner, the underlying sampling bias of the optimizer can distort interpretations. We propose a modified HPO method which efficiently balances the search for the global optimum w.r.t. predictive performance \emph{and} the reliable estimation of IML explanations of an underlying black-box function by coupling Bayesian optimization and Bayesian Algorithm Execution. On benchmark cases of both synthetic objectives and HPO of a neural network, we demonstrate that our method returns more reliable explanations of the underlying black-box without a loss of optimization performance.  ( 2 min )
    From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. (arXiv:2302.05882v1 [stat.ML])
    This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying analysis bridges different regimes of interest, such as the classical gradient-flow regime of vanishing learning rate, the high-dimensional regime of large input dimension, and the overparameterised "mean-field" regime of large network width, covering as well the intermediate regimes where the limiting dynamics is determined by the interplay between these behaviours. In particular, in the high-dimensional limit, the infinite-width dynamics is found to remain close to a low-dimensional subspace spanned by the target principal directions. Our results therefore provide a unifying picture of the limiting SGD dynamics with synthetic data.  ( 2 min )
    LipLearner: Customizable Silent Speech Interactions on Mobile Devices. (arXiv:2302.05907v1 [cs.HC])
    Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.
    Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis. (arXiv:2209.08891v2 [cs.CV] UPDATED)
    Models for text-to-image synthesis, such as DALL-E~2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in a textual description, common models reflect cultural stereotypes and biases in their generated images. We analyze this behavior both qualitatively and quantitatively, and identify a model's text encoder as the root cause of the phenomenon. Additionally, malicious users or service providers may try to intentionally bias the image generation to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.  ( 2 min )
    Representation and Invariance in Reinforcement Learning. (arXiv:2112.07752v3 [cs.AI] UPDATED)
    Researchers have formalized reinforcement learning (RL) in different ways. If an agent in one RL framework is to run within another RL framework's environments, the agent must first be converted, or mapped, into that other framework. Whether or not this is possible depends on not only the RL frameworks in question and but also how intelligence itself is measured. In this paper, we lay foundations for studying relative-intelligence-preserving mappability between RL frameworks. We define two types of mappings, called weak and strong translations, between RL frameworks and prove that existence of these mappings enables two types of intelligence comparison according to the mappings preserving relative intelligence. We investigate the existence or lack thereof of these mappings between: (i) RL frameworks where agents go first and RL frameworks where environments go first; and (ii) twelve different RL frameworks differing in terms of whether or not agents or environments are required to be deterministic. In the former case, we consider various natural mappings between agent-first and environment-first RL and vice versa; we show some positive results (some such mappings are strong or weak translations) and some negative results (some such mappings are not). In the latter case, we completely characterize which of the twelve RL-framework pairs admit weak translations, under the assumption of integer-valued rewards and some additional mild assumptions.
    CoCoSoDa: Effective Contrastive Learning for Code Search. (arXiv:2204.03293v3 [cs.SE] UPDATED)
    Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.
    The infinite Viterbi alignment and decay-convexity. (arXiv:1810.04115v5 [math.PR] UPDATED)
    The infinite Viterbi alignment is the limiting maximum a-posteriori estimate of the unobserved path in a hidden Markov model as the length of the time horizon grows. For models on state-space $\mathbb{R}^{d}$ satisfying a new ``decay-convexity'' condition, we develop an approach to existence of the infinite Viterbi alignment in an infinite dimensional Hilbert space. Quantitative bounds on the distance to the infinite Viterbi alignment, which are the first of their kind, are derived and used to illustrate how approximate estimation via parallelization can be accurate and scaleable to high-dimensional problems because the rate of convergence to the infinite Viterbi alignment does not necessarily depend on $d$. The results are applied to approximate estimation via parallelization and a model of neural population activity.
    Dark solitons in Bose-Einstein condensates: a dataset for many-body physics research. (arXiv:2205.09114v2 [cond-mat.quant-gas] UPDATED)
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose--Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About $33~\%$ of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and OD as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.
    Autoselection of the Ensemble of Convolutional Neural Networks with Second-Order Cone Programming. (arXiv:2302.05950v1 [cs.LG])
    Ensemble techniques are frequently encountered in machine learning and engineering problems since the method combines different models and produces an optimal predictive solution. The ensemble concept can be adapted to deep learning models to provide robustness and reliability. Due to the growth of the models in deep learning, using ensemble pruning is highly important to deal with computational complexity. Hence, this study proposes a mathematical model which prunes the ensemble of Convolutional Neural Networks (CNN) consisting of different depths and layers that maximizes accuracy and diversity simultaneously with a sparse second order conic optimization model. The proposed model is tested on CIFAR-10, CIFAR-100 and MNIST data sets which gives promising results while reducing the complexity of models, significantly.
    SQA3D: Situated Question Answering in 3D Scenes. (arXiv:2210.07474v3 [cs.CV] UPDATED)
    We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.  ( 2 min )
    SplitGP: Achieving Both Generalization and Personalization in Federated Learning. (arXiv:2212.08343v2 [cs.LG] UPDATED)
    A fundamental challenge to providing edge-AI services is the need for a machine learning (ML) model that achieves personalization (i.e., to individual clients) and generalization (i.e., to unseen data) properties concurrently. Existing techniques in federated learning (FL) have encountered a steep tradeoff between these objectives and impose large computational requirements on edge devices during training and inference. In this paper, we propose SplitGP, a new split learning solution that can simultaneously capture generalization and personalization capabilities for efficient inference across resource-constrained clients (e.g., mobile/IoT devices). Our key idea is to split the full ML model into client-side and server-side components, and impose different roles to them: the client-side model is trained to have strong personalization capability optimized to each client's main task, while the server-side model is trained to have strong generalization capability for handling all clients' out-of-distribution tasks. We analytically characterize the convergence behavior of SplitGP, revealing that all client models approach stationary points asymptotically. Further, we analyze the inference time in SplitGP and provide bounds for determining model split ratios. Experimental results show that SplitGP outperforms existing baselines by wide margins in inference time and test accuracy for varying amounts of out-of-distribution samples.  ( 2 min )
    SCLIFD:Supervised Contrastive Knowledge Distillation for Incremental Fault Diagnosis under Limited Fault Data. (arXiv:2302.05929v1 [cs.LG])
    Intelligent fault diagnosis has made extraordinary advancements currently. Nonetheless, few works tackle class-incremental learning for fault diagnosis under limited fault data, i.e., imbalanced and long-tailed fault diagnosis, which brings about various notable challenges. Initially, it is difficult to extract discriminative features from limited fault data. Moreover, a well-trained model must be retrained from scratch to classify the samples from new classes, thus causing a high computational burden and time consumption. Furthermore, the model may suffer from catastrophic forgetting when trained incrementally. Finally, the model decision is biased toward the new classes due to the class imbalance. The problems can consequently lead to performance degradation of fault diagnosis models. Accordingly, we introduce a supervised contrastive knowledge distillation for incremental fault diagnosis under limited fault data (SCLIFD) framework to address these issues, which extends the classical incremental classifier and representation learning (iCaRL) framework from three perspectives. Primarily, we adopt supervised contrastive knowledge distillation (KD) to enhance its representation learning capability under limited fault data. Moreover, we propose a novel prioritized exemplar selection method adaptive herding (AdaHerding) to restrict the increase of the computational burden, which is also combined with KD to alleviate catastrophic forgetting. Additionally, we adopt the cosine classifier to mitigate the adverse impact of class imbalance. We conduct extensive experiments on simulated and real-world industrial processes under different imbalance ratios. Experimental results show that our SCLIFD outperforms the existing methods by a large margin.
    Data efficiency and extrapolation trends in neural network interatomic potentials. (arXiv:2302.05823v1 [cs.LG])
    Over the last few years, key architectural advances have been proposed for neural network interatomic potentials (NNIPs), such as incorporating message-passing networks, equivariance, or many-body expansion terms. Although modern NNIP models exhibit nearly negligible differences in energy/forces errors, improvements in accuracy are still considered the main target when developing new NNIP architectures. In this work, we investigate how architectural choices influence the trainability and generalization error in NNIPs, revealing trends in extrapolation, data efficiency, and loss landscapes. First, we show that modern NNIP architectures recover the underlying potential energy surface (PES) of the training data even when trained to corrupted labels. Second, generalization metrics such as errors on high-temperature samples from the 3BPA dataset are demonstrated to follow a scaling relation for a variety of models. Thus, improvements in accuracy metrics may not bring independent information on the robust generalization of NNIPs. To circumvent this problem, we relate loss landscapes to model generalization across datasets. Using this probe, we explain why NNIPs with similar accuracy metrics exhibit different abilities to extrapolate and how training to forces improves the optimization landscape of a model. As an example, we show that MACE can predict PESes with reasonable error after being trained to as few as five data points, making it an example of a "few-shot" model for learning PESes. On the other hand, models with similar accuracy metrics such as NequIP show smaller ability to extrapolate in this extremely low-data regime. Our work provides a deep learning justification for the performance of many common NNIPs, and introduces tools beyond accuracy metrics that can be used to inform the development of next-generation models.
    Geodesic Graph Neural Network for Efficient Graph Representation Learning. (arXiv:2210.02636v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have recently been applied to graph learning tasks and achieved state-of-the-art (SOTA) results. However, many competitive methods run GNNs multiple times with subgraph extraction and customized labeling to capture information that is hard for normal GNNs to learn. Such operations are time-consuming and do not scale to large graphs. In this paper, we propose an efficient GNN framework called Geodesic GNN (GDGNN) that requires only one GNN run and injects conditional relationships between nodes into the model without labeling. This strategy effectively reduces the runtime of subgraph methods. Specifically, we view the shortest paths between two nodes as the spatial graph context of the neighborhood around them. The GNN embeddings of nodes on the shortest paths are used to generate geodesic representations. Conditioned on the geodesic representations, GDGNN can generate node, link, and graph representations that carry much richer structural information than plain GNNs. We theoretically prove that GDGNN is more powerful than plain GNNs. We present experimental results to show that GDGNN achieves highly competitive performance with SOTA GNN models on various graph learning tasks while taking significantly less time.  ( 2 min )
    Review of Extreme Multilabel Classification. (arXiv:2302.05971v1 [cs.LG])
    Extreme multilabel classification or XML, in short, has emerged as a new subtopic of interest in machine learning. Compared to traditional multilabel classification, here the number of labels is extremely large, hence the name extreme multilabel classification. Using classical one versus all classification wont scale in this case due to large number of labels, same is true for any other classifiers. Embedding of labels as well as features into smaller label space is an essential first step. Moreover, other issues include existance of head and tail labels, where tail labels are labels which exist in relatively smaller number of given samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify the correctly the prediction for head or tail labels.  ( 2 min )
    Generalization Ability of Wide Neural Networks on $\mathbb{R}$. (arXiv:2302.05933v1 [stat.ML])
    We perform a study on the generalization ability of the wide two-layer ReLU neural network on $\mathbb{R}$. We first establish some spectral properties of the neural tangent kernel (NTK): $a)$ $K_{d}$, the NTK defined on $\mathbb{R}^{d}$, is positive definite; $b)$ $\lambda_{i}(K_{1})$, the $i$-th largest eigenvalue of $K_{1}$, is proportional to $i^{-2}$. We then show that: $i)$ when the width $m\rightarrow\infty$, the neural network kernel (NNK) uniformly converges to the NTK; $ii)$ the minimax rate of regression over the RKHS associated to $K_{1}$ is $n^{-2/3}$; $iii)$ if one adopts the early stopping strategy in training a wide neural network, the resulting neural network achieves the minimax rate; $iv)$ if one trains the neural network till it overfits the data, the resulting neural network can not generalize well. Finally, we provide an explanation to reconcile our theory and the widely observed ``benign overfitting phenomenon''.  ( 2 min )
    On the Role of Fixed Points of Dynamical Systems in Training Physics-Informed Neural Networks. (arXiv:2203.13648v2 [cs.LG] UPDATED)
    This paper empirically studies commonly observed training difficulties of Physics-Informed Neural Networks (PINNs) on dynamical systems. Our results indicate that fixed points which are inherent to these systems play a key role in the optimization of the in PINNs embedded physics loss function. We observe that the loss landscape exhibits local optima that are shaped by the presence of fixed points. We find that these local optima contribute to the complexity of the physics loss optimization which can explain common training difficulties and resulting nonphysical predictions. Under certain settings, e.g., initial conditions close to fixed points or long simulations times, we show that those optima can even become better than that of the desired solution.  ( 2 min )
    Alternating Implicit Projected SGD and Its Efficient Variants for Equality-constrained Bilevel Optimization. (arXiv:2211.07096v2 [cs.LG] UPDATED)
    Stochastic bilevel optimization, which captures the inherent nested structure of machine learning problems, is gaining popularity in many recent applications. Existing works on bilevel optimization mostly consider either unconstrained problems or constrained upper-level problems. This paper considers the stochastic bilevel optimization problems with equality constraints both in the upper and lower levels. By leveraging the special structure of the equality constraints problem, the paper first presents an alternating implicit projected SGD approach and establishes the $\tilde{\cal O}(\epsilon^{-2})$ sample complexity that matches the state-of-the-art complexity of ALSET \citep{chen2021closing} for unconstrained bilevel problems. To further save the cost of projection, the paper presents two alternating implicit projection-efficient SGD approaches, where one algorithm enjoys the $\tilde{\cal O}(\epsilon^{-2}/T)$ upper-level and $\tilde{\cal O}(\epsilon^{-1.5}/T^{\frac{3}{4}})$ lower-level projection complexity with ${\cal O}(T)$ lower-level batch size, and the other one enjoys $\tilde{\cal O}(\epsilon^{-1.5})$ upper-level and lower-level projection complexity with ${\cal O}(1)$ batch size. Application to federated bilevel optimization has been presented to showcase the empirical performance of our algorithms. Our results demonstrate that equality-constrained bilevel optimization with strongly-convex lower-level problems can be solved as efficiently as stochastic single-level optimization problems.  ( 2 min )
    Recursive Estimation of Conditional Kernel Mean Embeddings. (arXiv:2302.05955v1 [stat.ML])
    Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.  ( 2 min )
    Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction. (arXiv:2105.04544v6 [cs.LG] UPDATED)
    We address the problem of causal effect estimation in the presence of unobserved confounding, but where proxies for the latent confounder(s) are observed. We propose two kernel-based methods for nonlinear causal effect estimation in this setting: (a) a two-stage regression approach, and (b) a maximum moment restriction approach. We focus on the proximal causal learning setting, but our methods can be used to solve a wider class of inverse problems characterised by a Fredholm integral equation. In particular, we provide a unifying view of two-stage and moment restriction approaches for solving this problem in a nonlinear setting. We provide consistency guarantees for each algorithm, and we demonstrate these approaches achieve competitive results on synthetic data and data simulating a real-world task. In particular, our approach outperforms earlier methods that are not suited to leveraging proxy variables.
    Self-supervised EEG Representation Learning for Automatic Sleep Staging. (arXiv:2110.15278v3 [eess.SP] UPDATED)
    Background: Deep learning models have shown great success in automating tasks in sleep medicine by learning from carefully annotated Electroencephalogram (EEG) data. However, effectively utilizing a large amount of raw EEG remains a challenge. Objective: In this paper, we aim to learn robust vector representations from massive unlabeled EEG signals, such that the learned vectorized features (1) are expressive enough to replace the raw signals in the sleep staging task; and (2) provide better predictive performance than supervised models in scenarios of fewer labels and noisy samples. Methods: We propose a self-supervised model, named Contrast with the World Representation (ContraWR), for EEG signal representation learning, which uses global statistics from the dataset to distinguish signals associated with different sleep stages. The ContraWR model is evaluated on three real-world EEG datasets that include both at-home and in-lab EEG recording settings. Results: ContraWR outperforms 4 recent self-supervised learning methods on the sleep staging task across 3 large EEG datasets. ContraWR also beats supervised learning when fewer training labels are available (e.g., 4% accuracy improvement when less than 2% data is labeled). Moreover, the model provides informative representative feature structures in 2D projection. Conclusions: We show that ContraWR is robust to noise and can provide high-quality EEG representations for downstream prediction tasks. The proposed model can be generalized to other unsupervised physiological signal learning tasks. Future directions include exploring task-specific data augmentations and combining self-supervised with supervised methods, building upon the initial success of self-supervised learning in this paper.
    Koopman-Based Bound for Generalization: New Aspect of Neural Networks Regarding Nonlinear Noise Filtering. (arXiv:2302.05825v1 [cs.LG])
    We propose a new bound for generalization of neural networks using Koopman operators. Unlike most of the existing works, we focus on the role of the final nonlinear transformation of the networks. Our bound is described by the reciprocal of the determinant of the weight matrices and is tighter than existing norm-based bounds when the weight matrices do not have small singular values. According to existing theories about the low-rankness of the weight matrices, it may be counter-intuitive that we focus on the case where singular values of weight matrices are not small. However, motivated by the final nonlinear transformation, we can see that our result sheds light on a new perspective regarding a noise filtering property of neural networks. Since our bound comes from Koopman operators, this work also provides a connection between operator-theoretic analysis and generalization of neural networks. Numerical results support the validity of our theoretical results.
    Near-optimal learning with average H\"older smoothness. (arXiv:2302.06005v1 [cs.LG])
    We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to H\"older smoothness. This measure of the ``effective smoothness'' of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic ``worst-case'' H\"older constant. We prove nearly tight upper and lower risk bounds in terms of the average H\"older smoothness, establishing the minimax rate in the realizable regression setting up to log factors; this was not previously known even in the special case of average Lipschitz smoothness. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown sampling distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide a learning algorithm that achieves the (nearly) optimal learning rate. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of H\"older smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.
    Distribution-Free Model for Community Detection. (arXiv:2111.07495v4 [cs.SI] UPDATED)
    Community detection for unweighted networks has been widely studied in network analysis, but the case of weighted networks remains a challenge. This paper proposes a general Distribution-Free Model (DFM) for weighted networks in which nodes are partitioned into different communities. DFM can be seen as a generalization of the famous stochastic blockmodels from unweighted networks to weighted networks. DFM does not require prior knowledge of a specific distribution for elements of the adjacency matrix but only the expected value. In particular, signed networks with latent community structures can be modeled by DFM. We build a theoretical guarantee to show that a simple spectral clustering algorithm stably yields consistent community detection under DFM. We also propose a four-step data generation process to generate adjacency matrices with missing edges by combining DFM, noise matrix, and a model for unweighted networks. Using experiments with simulated and real datasets, we show that some benchmark algorithms can successfully recover community membership for weighted networks generated by the proposed data generation process.
    An unsupervised learning approach for predicting wind farm power and downstream wakes using weather patterns. (arXiv:2302.05886v1 [stat.ML])
    Wind energy resource assessment typically requires numerical models, but such models are too computationally intensive to consider multi-year timescales. Increasingly, unsupervised machine learning techniques are used to identify a small number of representative weather patterns to simulate long-term behaviour. Here we develop a novel wind energy workflow that for the first time combines weather patterns derived from unsupervised clustering techniques with numerical weather prediction models (here WRF) to obtain efficient and accurate long-term predictions of power and downstream wakes from an entire wind farm. We use ERA5 reanalysis data clustering not only on low altitude pressure but also, for the first time, on the more relevant variable of wind velocity. We also compare the use of large-scale and local-scale domains for clustering. A WRF simulation is run at each of the cluster centres and the results are aggregated using a novel post-processing technique. By applying our workflow to two different regions, we show that our long-term predictions agree with those from a year of WRF simulations but require less than 2% of the computational time. The most accurate results are obtained when clustering on wind velocity. Moreover, clustering over the Europe-wide domain is sufficient for predicting wind farm power output, but downstream wake predictions benefit from the use of smaller domains. Finally, we show that these downstream wakes can affect the local weather patterns. Our approach facilitates multi-year predictions of power output and downstream farm wakes, by providing a fast, accurate and flexible methodology that is applicable to any global region. Moreover, these accurate long-term predictions of downstream wakes provide the first tool to help mitigate the effects of wind energy loss downstream of wind farms, since they can be used to determine optimum wind farm locations.
    Wide stochastic networks: Gaussian limit and PAC-Bayesian training. (arXiv:2106.09798v3 [stat.ML] UPDATED)
    The limit of infinite width allows for substantial simplifications in the analytical study of over-parameterised neural networks. With a suitable random initialisation, an extremely large network exhibits an approximately Gaussian behaviour. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables, holding both before and during training. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimises the generalisation bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.
    Differentially Private Normalizing Flows for Density Estimation, Data Synthesis, and Variational Inference with Application to Electronic Health Records. (arXiv:2302.05787v1 [stat.ML])
    Electronic health records (EHR) often contain sensitive medical information about individual patients, posing significant limitations to sharing or releasing EHR data for downstream learning and inferential tasks. We use normalizing flows (NF), a family of deep generative models, to estimate the probability density of a dataset with differential privacy (DP) guarantees, from which privacy-preserving synthetic data are generated. We apply the technique to an EHR dataset containing patients with pulmonary hypertension. We assess the learning and inferential utility of the synthetic data by comparing the accuracy in the prediction of the hypertension status and variational posterior distribution of the parameters of a physics-based model. In addition, we use a simulated dataset from a nonlinear model to compare the results from variational inference (VI) based on privacy-preserving synthetic data, and privacy-preserving VI obtained from directly privatizing NFs for VI with DP guarantees given the original non-private dataset. The results suggest that synthetic data generated through differentially private density estimation with NF can yield good utility at a reasonable privacy cost. We also show that VI obtained from differentially private NF based on the free energy bound loss may produce variational approximations with significantly altered correlation structure, and loss formulations based on alternative dissimilarity metrics between two distributions might provide improved results.
    Deep Reinforcement Learning for Unmanned Aerial Vehicle-Assisted Vehicular Networks. (arXiv:1906.05015v11 [cs.LG] UPDATED)
    Unmanned aerial vehicles (UAVs) are envisioned to complement the 5G communication infrastructure in future smart cities. Hot spots easily appear in road intersections, where effective communication among vehicles is challenging. UAVs may serve as relays with the advantages of low price, easy deployment, line-of-sight links, and flexible mobility. In this paper, we study a UAV-assisted vehicular network where the UAV jointly adjusts its transmission control (power and channel) and 3D flight to maximize the total throughput. First, we formulate a Markov decision process (MDP) problem by modeling the mobility of the UAV/vehicles and the state transitions. Secondly, we solve the target problem using a deep reinforcement learning method, namely, the deep deterministic policy gradient (DDPG), and propose three solutions with different control objectives. Deep reinforcement learning methods obtain the optimal policy through the interactions with the environment without knowing the environment variables. Considering that environment variables in our problem are unknown and unmeasurable, we choose a deep reinforcement learning method to solve it. Moreover, considering the energy consumption of 3D flight, we extend the proposed solutions to maximize the total throughput per unit energy. To encourage or discourage the UAV's mobility according to its prediction, the DDPG framework is modified, where the UAV adjusts its learning rate automatically. Thirdly, in a simplified model with small state space and action space, we verify the optimality of proposed algorithms. Comparing with two baseline schemes, we demonstrate the effectiveness of proposed algorithms in a realistic model.
    Generative Sampling in Bundle Tractography using Autoencoders (GESTA). (arXiv:2204.10891v2 [cs.CV] UPDATED)
    Current tractography methods use the local orientation information to propagate streamlines from seed locations. Many such seeds provide streamlines that stop prematurely or fail to map the true white matter pathways because some bundles are "harder-to-track" than others. This results in tractography reconstructions with poor white and gray matter spatial coverage. In this work, we propose a generative, autoencoder-based method, named GESTA (Generative Sampling in Bundle Tractography using Autoencoders), that produces streamlines achieving better spatial coverage. Compared to other deep learning methods, our autoencoder-based framework uses a single model to generate streamlines in a bundle-wise fashion, and does not require to propagate local orientations. GESTA produces new and complete streamlines for any given white matter bundle, including hard-to-track bundles. Applied on top of a given tractogram, GESTA is shown to be effective in improving the white matter volume coverage in poorly populated bundles, both on synthetic and human brain in vivo data. Our streamline evaluation framework ensures that the streamlines produced by GESTA are anatomically plausible and fit well to the local diffusion signal. The streamline evaluation criteria assess anatomy (white matter coverage), local orientation alignment (direction), and geometry features of streamlines, and optionally, gray matter connectivity. GESTA is thus a novel deep generative bundle tractography method that can be used to improve the tractography reconstruction of the white matter.  ( 2 min )
    A large parametrized space of meta-reinforcement learning tasks. (arXiv:2302.05583v1 [cs.LG])
    We describe a parametrized space for simple meta-reinforcement-learning (meta-RL) tasks with arbitrary stimuli. The parametrization allows us to randomly generate an arbitrary number of novel simple meta-learning tasks. The space of meta-RL tasks covered by this parametrization includes many well-known meta-RL tasks, such as bandit tasks, the Harlow task, T-mazes, the Daw two-step task and others. Simple extensions allow it to capture tasks based on two-dimensional topological spaces, such as find-the-spot or key-door tasks. We describe a number of randomly generated meta-RL tasks and discuss potential issues arising from random generation.  ( 2 min )
    Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining. (arXiv:2210.04782v2 [cs.CL] UPDATED)
    Advances in neural modeling have achieved state-of-the-art (SOTA) results on public natural language processing (NLP) benchmarks, at times surpassing human performance. However, there is a gap between public benchmarks and real-world applications where noise, such as typographical or grammatical mistakes, is abundant and can result in degraded performance. Unfortunately, works which evaluate the robustness of neural models on noisy data and propose improvements, are limited to the English language. Upon analyzing noise in different languages, we observe that noise types vary greatly across languages. Thus, existing investigations do not generalize trivially to multilingual settings. To benchmark the performance of pretrained multilingual language models, we construct noisy datasets covering five languages and four NLP tasks and observe a clear gap in the performance between clean and noisy data in the zero-shot cross-lingual setting. After investigating several ways to boost the robustness of multilingual models in this setting, we propose Robust Contrastive Pretraining (RCP). RCP combines data augmentation with a contrastive loss term at the pretraining stage and achieves large improvements on noisy (and original test data) across two sentence-level (+3.2%) and two sequence-labeling (+10 F1-score) multilingual classification tasks.  ( 2 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v6 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input and produce as an output nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general types of nonnegative neural networks. The results of this paper contribute to the understanding of the behavior of autoencoders, and they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations.  ( 2 min )
    Is Distance Matrix Enough for Geometric Deep Learning?. (arXiv:2302.05743v1 [cs.LG])
    Graph Neural Networks (GNNs) are often used for tasks involving the geometry of a given graph, such as molecular dynamics simulation. While the distance matrix of a graph contains the complete geometric structure information, whether GNNs can learn this geometry solely from the distance matrix has yet to be studied. In this work, we first demonstrate that Message Passing Neural Networks (MPNNs) are insufficient for learning the geometry of a graph from its distance matrix by constructing families of geometric graphs which cannot be distinguished by MPNNs. We then propose $k$-DisGNNs, which can effectively exploit the rich geometry contained in the distance matrix. We demonstrate the high expressive power of our models and prove that some existing well-designed geometric models can be unified by $k$-DisGNNs as special cases. Most importantly, we establish a connection between geometric deep learning and traditional graph representation learning, showing that those highly expressive GNN models originally designed for graph structure learning can also be applied to geometric deep learning problems with impressive performance, and that existing complex, equivariant models are not the only solution. Experimental results verify our theory.  ( 2 min )
    How to prepare your task head for finetuning. (arXiv:2302.05779v1 [cs.LG])
    In deep learning, transferring information from a pretrained network to a downstream task by finetuning has many benefits. The choice of task head plays an important role in fine-tuning, as the pretrained and downstream tasks are usually different. Although there exist many different designs for finetuning, a full understanding of when and why these algorithms work has been elusive. We analyze how the choice of task head controls feature adaptation and hence influences the downstream performance. By decomposing the learning dynamics of adaptation, we find that the key aspect is the training accuracy and loss at the beginning of finetuning, which determines the "energy" available for the feature's adaptation. We identify a significant trend in the effect of changes in this initial energy on the resulting features after fine-tuning. Specifically, as the energy increases, the Euclidean and cosine distances between the resulting and original features increase, while their dot products (and the resulting features' norm) first increase and then decrease. Inspired by this, we give several practical principles that lead to better downstream performance. We analytically prove this trend in an overparamterized linear setting and verify its applicability to different experimental settings.  ( 2 min )
    A Framework for Overparameterized Learning. (arXiv:2205.13507v2 [cs.LG] UPDATED)
    A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.  ( 2 min )
    CUDA: Curriculum of Data Augmentation for Long-Tailed Recognition. (arXiv:2302.05499v1 [cs.CV])
    Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. However, the extracted representations may be of poor quality owing to the limited number of minority samples. To handle this restriction, several methods have been developed that increase the representations of minority samples by leveraging the features of the majority samples. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we first investigate the correlation between the degree of augmentation and class-wise performance, and find that the proper degree of augmentation must be allocated for each class to mitigate class imbalance problems. Motivated by this finding, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.  ( 2 min )
    On the geometry of Stein variational gradient descent. (arXiv:1912.00894v2 [stat.ML] UPDATED)
    Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performance gains of these in various numerical experiments.
    Deep Unfolding of the DBFB Algorithm with Application to ROI CT Imaging with Limited Angular Density. (arXiv:2209.13264v2 [eess.IV] UPDATED)
    This paper presents a new method for reconstructing regions of interest (ROI) from a limited number of computed tomography (CT) measurements. Classical model-based iterative reconstruction methods lead to images with predictable features. Still, they often suffer from tedious parameterization and slow convergence. On the contrary, deep learning methods are fast, and they can reach high reconstruction quality by leveraging information from large datasets, but they lack interpretability. At the crossroads of both methods, deep unfolding networks have been recently proposed. Their design includes the physics of the imaging system and the steps of an iterative optimization algorithm. Motivated by the success of these networks for various applications, we introduce an unfolding neural network called U-RDBFB designed for ROI CT reconstruction from limited data. Few-view truncated data are effectively handled thanks to a robust non-convex data fidelity term combined with a sparsity-inducing regularization function. We unfold the Dual Block coordinate Forward-Backward (DBFB) algorithm, embedded in an iterative reweighted scheme, allowing the learning of key parameters in a supervised manner. Our experiments show an improvement over several state-of-the-art methods, including a model-based iterative scheme, a multi-scale deep learning architecture, and deep unfolding methods.  ( 2 min )
    Interpretable Deep Learning for Forecasting Online Advertising Costs: Insights from the Competitive Bidding Landscape. (arXiv:2302.05762v1 [cs.LG])
    As advertisers increasingly shift their budgets toward digital advertising, forecasting advertising costs is essential for making budget plans to optimize marketing campaign returns. In this paper, we perform a comprehensive study using a variety of time-series forecasting methods to predict daily average cost-per-click (CPC) in the online advertising market. We show that forecasting advertising costs would benefit from multivariate models using covariates from competitors' CPC development identified through time-series clustering. We further interpret the results by analyzing feature importance and temporal attention. Finally, we show that our approach has several advantages over models that individual advertisers might build based solely on their collected data.  ( 2 min )
    SpReME: Sparse Regression for Multi-Environment Dynamic Systems. (arXiv:2302.05942v1 [cs.LG])
    Learning dynamical systems is a promising avenue for scientific discoveries. However, capturing the governing dynamics in multiple environments still remains a challenge: model-based approaches rely on the fidelity of assumptions made for a single environment, whereas data-driven approaches based on neural networks are often fragile on extrapolating into the future. In this work, we develop a method of sparse regression dubbed SpReME to discover the major dynamics that underlie multiple environments. Specifically, SpReME shares a sparse structure of ordinary differential equation (ODE) across different environments in common while allowing each environment to keep the coefficients of ODE terms independently. We demonstrate that the proposed model captures the correct dynamics from multiple environments over four different dynamic systems with improved prediction performance.  ( 2 min )
    CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models. (arXiv:2210.04191v2 [cs.CL] UPDATED)
    We motivate and introduce CHARD: Clinical Health-Aware Reasoning across Dimensions, to investigate the capability of text generation models to act as implicit clinical knowledge bases and generate free-flow textual explanations about various health-related conditions across several dimensions. We collect and present an associated dataset, CHARDat, consisting of explanations about 52 health conditions across three clinical dimensions. We conduct extensive experiments using BART and T5 along with data augmentation, and perform automatic, human, and qualitative analyses. We show that while our models can perform decently, CHARD is very challenging with strong potential for further exploration.  ( 2 min )
    SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection. (arXiv:2207.08003v4 [cs.CV] UPDATED)
    A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.  ( 2 min )
    Conditional Positional Encodings for Vision Transformers. (arXiv:2102.10882v3 [cs.CV] UPDATED)
    We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training. Besides, CPE can keep the desired translation-invariance in the image classification task, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our code is available at https://github.com/Meituan-AutoML/CPVT .  ( 2 min )
    Mitigating Dataset Bias by Using Per-sample Gradient. (arXiv:2205.15704v3 [cs.LG] UPDATED)
    The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels involving human or empirical correlation metrics (e.g., training loss). However, such metrics require human costs or have insufficient theoretical explanation. In this study, we propose a debiasing algorithm, called PGD (Per-sample Gradient-based Debiasing), that comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various synthetic and real-world datasets, the proposed method showed state-of-the-art accuracy for a the classification task. Furthermore, we describe theoretical understandings about how PGD can mitigate dataset bias.  ( 2 min )
    On the Computational Efficiency of Adaptive and Dynamic Regret Minimization. (arXiv:2207.00646v4 [cs.LG] UPDATED)
    In online convex optimization, the player aims to minimize regret, or the difference between her loss and that of the best fixed decision in hindsight over the entire repeated game. Algorithms that minimize (standard) regret may converge to a fixed decision, which is undesirable in changing or dynamic environments. This motivates the stronger metrics of performance, notably adaptive and dynamic regret. Adaptive regret is the maximum regret over any continuous sub-interval in time. Dynamic regret is the difference between the total cost and that of the best sequence of decisions in hindsight. State-of-the-art performance in both adaptive and dynamic regret minimization suffers a computational penalty - typically on the order of a multiplicative factor that grows logarithmically in the number of game iterations. In this paper we show how to reduce this computational penalty to be doubly logarithmic in the number of game iterations, and retain near optimal adaptive and dynamic regret bounds.  ( 2 min )
    Towards A Proactive ML Approach for Detecting Backdoor Poison Samples. (arXiv:2205.13616v2 [cs.LG] UPDATED)
    Adversaries can embed backdoors in deep learning models by introducing backdoor poison samples into training datasets. In this work, we investigate how to detect such poison samples to mitigate the threat of backdoor attacks. First, we uncover a post-hoc workflow underlying most prior work, where defenders passively allow the attack to proceed and then leverage the characteristics of the post-attacked model to uncover poison samples. We reveal that this workflow does not fully exploit defenders' capabilities, and defense pipelines built on it are prone to failure or performance degradation in many scenarios. Second, we suggest a paradigm shift by promoting a proactive mindset in which defenders engage proactively with the entire model training and poison detection pipeline, directly enforcing and magnifying distinctive characteristics of the post-attacked model to facilitate poison detection. Based on this, we formulate a unified framework and provide practical insights on designing detection pipelines that are more robust and generalizable. Third, we introduce the technique of Confusion Training (CT) as a concrete instantiation of our framework. CT applies an additional poisoning attack to the already poisoned dataset, actively decoupling benign correlation while exposing backdoor patterns to detection. Empirical evaluations on 4 datasets and 14 types of attacks validate the superiority of CT over 11 baseline defenses.  ( 2 min )
    Policy-Induced Self-Supervision Improves Representation Finetuning in Visual RL. (arXiv:2302.06009v1 [cs.LG])
    We study how to transfer representations pretrained on source tasks to target tasks in visual percept based RL. We analyze two popular approaches: freezing or finetuning the pretrained representations. Empirical studies on a set of popular tasks reveal several properties of pretrained representations. First, finetuning is required even when pretrained representations perfectly capture the information required to solve the target task. Second, finetuned representations improve learnability and are more robust to noise. Third, pretrained bottom layers are task-agnostic and readily transferable to new tasks, while top layers encode task-specific information and require adaptation. Building on these insights, we propose a self-supervised objective that clusters representations according to the policy they induce, as opposed to traditional representation similarity measures which are policy-agnostic (e.g. Euclidean norm, cosine similarity). Together with freezing the bottom layers, this objective results in significantly better representation than frozen, finetuned, and self-supervised alternatives on a wide range of benchmarks.  ( 2 min )
    A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks. (arXiv:2001.11443v3 [cs.LG] UPDATED)
    We develop a mathematically rigorous framework for multilayer neural networks in the mean field regime. As the network's widths increase, the network's learning trajectory is shown to be well captured by a meaningful and dynamically nonlinear limit (the \textit{mean field} limit), which is characterized by a system of ODEs. Our framework applies to a broad range of network architectures, learning dynamics and network initializations. Central to the framework is the new idea of a \textit{neuronal embedding}, which comprises of a non-evolving probability space that allows to embed neural networks of arbitrary widths. Using our framework, we prove several properties of large-width multilayer neural networks. Firstly we show that independent and identically distributed initializations cause strong degeneracy effects on the network's learning trajectory when the network's depth is at least four. Secondly we obtain several global convergence guarantees for feedforward multilayer networks under a number of different setups. These include two-layer and three-layer networks with independent and identically distributed initializations, and multilayer networks of arbitrary depths with a special type of correlated initializations that is motivated by the new concept of \textit{bidirectional diversity}. Unlike previous works that rely on convexity, our results admit non-convex losses and hinge on a certain universal approximation property, which is a distinctive feature of infinite-width neural networks and is shown to hold throughout the training process. Aside from being the first known results for global convergence of multilayer networks in the mean field regime, they demonstrate flexibility of our framework and incorporate several new ideas and insights that depart from the conventional convex optimization wisdom.  ( 3 min )
    Efficient Fraud Detection using Deep Boosting Decision Trees. (arXiv:2302.05918v1 [stat.ML])
    Fraud detection is to identify, monitor, and prevent potentially fraudulent activities from complex data. The recent development and success in AI, especially machine learning, provides a new data-driven way to deal with fraud. From a methodological point of view, machine learning based fraud detection can be divided into two categories, i.e., conventional methods (decision tree, boosting...) and deep learning, both of which have significant limitations in terms of the lack of representation learning ability for the former and interpretability for the latter. Furthermore, due to the rarity of detected fraud cases, the associated data is usually imbalanced, which seriously degrades the performance of classification algorithms. In this paper, we propose deep boosting decision trees (DBDT), a novel approach for fraud detection based on gradient boosting and neural networks. In order to combine the advantages of both conventional methods and deep learning, we first construct soft decision tree (SDT), a decision tree structured model with neural networks as its nodes, and then ensemble SDTs using the idea of gradient boosting. In this way we embed neural networks into gradient boosting to improve its representation learning capability and meanwhile maintain the interpretability. Furthermore, aiming at the rarity of detected fraud cases, in the model training phase we propose a compositional AUC maximization approach to deal with data imbalances at algorithm level. Extensive experiments on several real-life fraud detection datasets show that DBDT can significantly improve the performance and meanwhile maintain good interpretability. Our code is available at https://github.com/freshmanXB/DBDT.  ( 2 min )
    Plasticity Neural Network Based on Astrocytic effects at Critical Period, Synaptic Competition and Strength Rebalance by Current and Mnemonic Brain Plasticity and Synapse Formation. (arXiv:2203.11740v9 [cs.NE] UPDATED)
    In addition to the weights of synaptic shared connections, PNN includes weights of synaptic effective ranges [14-24]. PNN considers synaptic strength balance in dynamic of phagocytosing of synapses and static of constant sum of synapses length [14], and includes the lead behavior of the school of fish. Synapse formation will inhibit dendrites generation to a certain extent in experiments and PNN simulations [15]. The memory persistence gradient of retrograde circuit similar to the Enforcing Resilience in a Spring Boot. The relatively good and inferior gradient information stored in memory engram cells in synapse formation of retrograde circuit like the folds of the brain [16]. The controversy was claimed if human hippocampal neurogenesis persists throughout aging, PNN considered it may have a new and longer circuit in late iteration [17,18]. Closing the critical period will cause neurological disorder in experiments and PNN simulations [19]. Considering both negative and positive memories persistence help activate synapse length changes with iterations better than only considering positive memory [20]. Astrocytic phagocytosis will avoid the local accumulation of synapses by simulation, Lack of astrocytic phagocytosis causes excitatory synapses and functionally impaired synapses accumulate in experiments and lead to destruction of cognition, but local longer synapses and worse results in PNN simulations [21]. It gives relationship of intelligence and cortical thickness, individual differences in brain [22]. The PNN also considered the memory engram cells that strengthened Synaptic strength [23]. The effects of PNN's memory structure and tPBM may be the same for powerful penetrability of signals [24]. Memory persistence also inhibit local synaptic accumulation. By PNN, it may introduce the relatively good and inferior solution in PSO. The simple PNN only has the synaptic phagocytosis.  ( 3 min )
    Improved Dynamic Regret for Online Frank-Wolfe. (arXiv:2302.05620v1 [cs.LG])
    To deal with non-stationary online problems with complex constraints, we investigate the dynamic regret of online Frank-Wolfe (OFW), which is an efficient projection-free algorithm for online convex optimization. It is well-known that in the setting of offline optimization, the smoothness of functions and the strong convexity of functions accompanying specific properties of constraint sets can be utilized to achieve fast convergence rates for the Frank-Wolfe (FW) algorithm. However, for OFW, previous studies only establish a dynamic regret bound of $O(\sqrt{T}(1+V_T+\sqrt{D_T}))$ by utilizing the convexity of problems, where $T$ is the number of rounds, $V_T$ is the function variation, and $D_T$ is the gradient variation. In this paper, we derive improved dynamic regret bounds for OFW by extending the fast convergence rates of FW from offline optimization to online optimization. The key technique for this extension is to set the step size of OFW with a line search rule. In this way, we first show that the dynamic regret bound of OFW can be improved to $O(\sqrt{T(1+V_T)})$ for smooth functions. Second, we achieve a better dynamic regret bound of $O((1+V_T)^{2/3}T^{1/3})$ when functions are smooth and strongly convex, and the constraint set is strongly convex. Finally, for smooth and strongly convex functions with minimizers in the interior of the constraint set, we demonstrate that the dynamic regret of OFW reduces to $O(1+V_T)$, and can be further strengthened to $O(\min\{P_T^\ast,S_T^\ast,V_T\}+1)$ by performing a constant number of FW iterations per round, where $P_T^\ast$ and $S_T^\ast$ denote the path length and squared path length of minimizers, respectively.  ( 2 min )
    Compositional Exemplars for In-context Learning. (arXiv:2302.05698v1 [cs.CL])
    Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL) ability, where the model learns to do an unseen task via a prompt consisting of input-output examples as the demonstration, without any parameter updates. The performance of ICL is highly dominated by the quality of the selected in-context examples. However, previous selection methods are mostly based on simple heuristics, leading to sub-optimal performance. In this work, we formulate in-context example selection as a subset selection problem. We propose CEIL(Compositional Exemplars for In-context Learning), which is instantiated by Determinantal Point Processes (DPPs) to model the interaction between the given input and in-context examples, and optimized through a carefully-designed contrastive learning objective to obtain preference from LMs. We validate CEIL on 12 classification and generation datasets from 7 distinct NLP tasks, including sentiment analysis, paraphrase detection, natural language inference, commonsense reasoning, open-domain question answering, code generation, and semantic parsing. Extensive experiments demonstrate not only the state-of-the-art performance but also the transferability and compositionality of CEIL, shedding new light on effective and efficient in-context learning. Our code is released at https://github.com/HKUNLP/icl-ceil.  ( 2 min )
    Synaptic Stripping: How Pruning Can Bring Dead Neurons Back To Life. (arXiv:2302.05818v1 [cs.LG])
    Rectified Linear Units (ReLU) are the default choice for activation functions in deep neural networks. While they demonstrate excellent empirical performance, ReLU activations can fall victim to the dead neuron problem. In these cases, the weights feeding into a neuron end up being pushed into a state where the neuron outputs zero for all inputs. Consequently, the gradient is also zero for all inputs, which means that the weights which feed into the neuron cannot update. The neuron is not able to recover from direct back propagation and model capacity is reduced as those parameters can no longer be further optimized. Inspired by a neurological process of the same name, we introduce Synaptic Stripping as a means to combat this dead neuron problem. By automatically removing problematic connections during training, we can regenerate dead neurons and significantly improve model capacity and parametric utilization. Synaptic Stripping is easy to implement and results in sparse networks that are more efficient than the dense networks they are derived from. We conduct several ablation studies to investigate these dynamics as a function of network width and depth and we conduct an exploration of Synaptic Stripping with Vision Transformers on a variety of benchmark datasets.  ( 2 min )
    On Testing and Comparing Fair classifiers under Data Bias. (arXiv:2302.05906v1 [cs.LG])
    In this paper, we consider a theoretical model for injecting data bias, namely, under-representation and label bias (Blum & Stangl, 2019). We theoretically and empirically study its effect on the accuracy and fairness of fair classifiers. Theoretically, we prove that the Bayes optimal group-aware fair classifier on the original data distribution can be recovered by simply minimizing a carefully chosen reweighed loss on the bias-injected distribution. Through extensive experiments on both synthetic and real-world datasets (e.g., Adult, German Credit, Bank Marketing, COMPAS), we empirically audit pre-, in-, and post-processing fair classifiers from standard fairness toolkits for their fairness and accuracy by injecting varying amounts of under-representation and label bias in their training data (but not the test data). Our main observations are: (1) The fairness and accuracy of many standard fair classifiers degrade severely as the bias injected in their training data increases, (2) A simple logistic regression model trained on the right data can often outperform, in both accuracy and fairness, most fair classifiers trained on biased training data, and (3) A few, simple fairness techniques (e.g., reweighing, exponentiated gradients) seem to offer stable accuracy and fairness guarantees even when their training data is injected with under-representation and label bias. Our experiments also show how to integrate a measure of data bias risk in the existing fairness dashboards for real-world deployments  ( 2 min )
    Maneuver Decision-Making For Autonomous Air Combat Through Curriculum Learning And Reinforcement Learning With Sparse Rewards. (arXiv:2302.05838v1 [cs.AI])
    Reinforcement learning is an effective way to solve the decision-making problems. It is a meaningful and valuable direction to investigate autonomous air combat maneuver decision-making method based on reinforcement learning. However, when using reinforcement learning to solve the decision-making problems with sparse rewards, such as air combat maneuver decision-making, it costs too much time for training and the performance of the trained agent may not be satisfactory. In order to solve these problems, the method based on curriculum learning is proposed. First, three curricula of air combat maneuver decision-making are designed: angle curriculum, distance curriculum and hybrid curriculum. These courses are used to train air combat agents respectively, and compared with the original method without any curriculum. The training results show that angle curriculum can increase the speed and stability of training, and improve the performance of the agent; distance curriculum can increase the speed and stability of agent training; hybrid curriculum has a negative impact on training, because it makes the agent get stuck at local optimum. The simulation results show that after training, the agent can handle the situations where targets come from different directions, and the maneuver decision results are consistent with the characteristics of missile.  ( 2 min )
    Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations. (arXiv:2107.00520v5 [cs.LG] UPDATED)
    In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.  ( 2 min )
    Information-Directed Selection for Top-Two Algorithms. (arXiv:2205.12086v2 [stat.ML] UPDATED)
    We consider the best-k-arm identification problem for multi-armed bandits, where the objective is to select the exact set of k arms with the highest mean rewards by sequentially allocating measurement effort. We characterize the necessary and sufficient conditions for the optimal allocation using dual variables. Remarkably these optimality conditions lead to the extension of top-two algorithm design principle (Russo, 2020), initially proposed for best-arm identification. Furthermore, our optimality conditions induce a simple and effective selection rule dubbed information-directed selection (IDS) that selects one of the top-two candidates based on a measure of information gain. As a theoretical guarantee, we prove that integrated with IDS, top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a glaring open problem in the pure exploration literature (Russo, 2020). As a by-product, we show that for k > 1, top-two algorithms cannot achieve optimality even with an oracle tuning parameter. Numerical experiments show the superior performance of the proposed top-two algorithms with IDS and considerable improvement compared with algorithms without adaptive selection.  ( 2 min )
    DIWIFT: Discovering Instance-wise Influential Features for Tabular Data. (arXiv:2207.02773v2 [cs.LG] UPDATED)
    Tabular data is one of the most common data storage formats behind many real-world web applications such as retail, banking, and e-commerce. The success of these web applications largely depends on the ability of the employed machine learning model to accurately distinguish influential features from all the predetermined features in tabular data. Intuitively, in practical business scenarios, different instances should correspond to different sets of influential features, and the set of influential features of the same instance may vary in different scenarios. However, most existing methods focus on global feature selection assuming that all instances have the same set of influential features, and few methods considering instance-wise feature selection ignore the variability of influential features in different scenarios. In this paper, we first introduce a new perspective based on the influence function for instance-wise feature selection, and give some corresponding theoretical insights, the core of which is to use the influence function as an indicator to measure the importance of an instance-wise feature. We then propose a new solution for discovering instance-wise influential features in tabular data (DIWIFT), where a self-attention network is used as a feature selection model and the value of the corresponding influence function is used as an optimization objective to guide the model. Benefiting from the advantage of the influence function, i.e., its computation does not depend on a specific architecture and can also take into account the data distribution in different scenarios, our DIWIFT has better flexibility and robustness. Finally, we conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our DIWIFT.  ( 2 min )
    SafeLight: A Reinforcement Learning Method toward Collision-free Traffic Signal Control. (arXiv:2211.10871v2 [cs.LG] UPDATED)
    Traffic signal control is safety-critical for our daily life. Roughly one-quarter of road accidents in the U.S. happen at intersections due to problematic signal timing, urging the development of safety-oriented intersection control. However, existing studies on adaptive traffic signal control using reinforcement learning technologies have focused mainly on minimizing traffic delay but neglecting the potential exposure to unsafe conditions. We, for the first time, incorporate road safety standards as enforcement to ensure the safety of existing reinforcement learning methods, aiming toward operating intersections with zero collisions. We have proposed a safety-enhanced residual reinforcement learning method (SafeLight) and employed multiple optimization techniques, such as multi-objective loss function and reward shaping for better knowledge integration. Extensive experiments are conducted using both synthetic and real-world benchmark datasets. Results show that our method can significantly reduce collisions while increasing traffic mobility.  ( 2 min )
    Collaboration-Aware Graph Convolutional Network for Recommender Systems. (arXiv:2207.06221v3 [cs.IR] UPDATED)
    Graph Neural Networks (GNNs) have been successfully adopted in recommender systems by virtue of the message-passing that implicitly captures collaborative effect. Nevertheless, most of the existing message-passing mechanisms for recommendation are directly inherited from GNNs without scrutinizing whether the captured collaborative effect would benefit the prediction of user preferences. In this paper, we first analyze how message-passing captures the collaborative effect and propose a recommendation-oriented topological metric, Common Interacted Ratio (CIR), which measures the level of interaction between a specific neighbor of a node with the rest of its neighbors. After demonstrating the benefits of leveraging collaborations from neighbors with higher CIR, we propose a recommendation-tailored GNN, Collaboration-Aware Graph Convolutional Network (CAGCN), that goes beyond 1-Weisfeiler-Lehman(1-WL) test in distinguishing non-bipartite-subgraph-isomorphic graphs. Experiments on six benchmark datasets show that the best CAGCN variant outperforms the most representative GNN-based recommendation model, LightGCN, by nearly 10\% in Recall@20 and also achieves around 80\% speedup. Our code is publicly available at https://github.com/YuWVandy/CAGCN.  ( 2 min )
    Vector Quantized Wasserstein Auto-Encoder. (arXiv:2302.05917v1 [cs.LG])
    Learning deep discrete latent presentations offers a promise of better symbolic and summarized abstractions that are more useful to subsequent downstream tasks. Inspired by the seminal Vector Quantized Variational Auto-Encoder (VQ-VAE), most of work in learning deep discrete representations has mainly focused on improving the original VQ-VAE form and none of them has studied learning deep discrete representations from the generative viewpoint. In this work, we study learning deep discrete representations from the generative viewpoint. Specifically, we endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution via minimizing a WS distance between them. We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution. Finally, we empirically evaluate our method on several well-known benchmarks, where it achieves better qualitative and quantitative performances than the other VQ-VAE variants in terms of the codebook utilization and image reconstruction/generation.  ( 2 min )
    A Reparameterized Discrete Diffusion Model for Text Generation. (arXiv:2302.05737v1 [cs.CL])
    This work studies discrete diffusion probabilistic models with applications to natural language generation. We derive an alternative yet equivalent formulation of the sampling from discrete diffusion processes and leverage this insight to develop a family of reparameterized discrete diffusion models. The derived generic framework is highly flexible, offers a fresh perspective of the generation process in discrete diffusion models, and features more effective training and decoding techniques. We conduct extensive experiments to evaluate the text generation capability of our model, demonstrating significant improvements over existing diffusion models.  ( 2 min )
    Regret Guarantees for Adversarial Online Collaborative Filtering. (arXiv:2302.05765v1 [cs.LG])
    We investigate the problem of online collaborative filtering under no-repetition constraints, whereby users need to be served content in an online fashion and a given user cannot be recommended the same content item more than once. We design and analyze a fully adaptive algorithm that works under biclustering assumptions on the user-item preference matrix, and show that this algorithm exhibits an optimal regret guarantee, while being oblivious to any prior knowledge about the sequence of users, the universe of items, as well as the biclustering parameters of the preference matrix. We further propose a more robust version of the algorithm which addresses the scenario when the preference matrix is adversarially perturbed. We then give regret guarantees that scale with the amount by which the preference matrix is perturbed from a biclustered structure. To our knowledge, these are the first results on online collaborative filtering that hold at this level of generality and adaptivity under no-repetition constraints.  ( 2 min )
    The NLP Task Effectiveness of Long-Range Transformers. (arXiv:2202.07856v2 [cs.CL] UPDATED)
    Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors to illuminate model details beyond metric scores. We find that the modified attention in long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens and accumulated approximation error.  ( 2 min )
    Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes. (arXiv:2302.05752v1 [cs.LG])
    Medical experts may use Artificial Intelligence (AI) systems with greater trust if these are supported by contextual explanations that let the practitioner connect system inferences to their context of use. However, their importance in improving model usage and understanding has not been extensively studied. Hence, we consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state, AI predictions about their risk of complications, and algorithmic explanations supporting the predictions. We explore how relevant information for such dimensions can be extracted from Medical guidelines to answer typical questions from clinical practitioners. We identify this as a question answering (QA) task and employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability. Finally, we study the benefits of contextual explanations by building an end-to-end AI pipeline including data cohorting, AI risk modeling, post-hoc model explanations, and prototyped a visual dashboard to present the combined insights from different context dimensions and data sources, while predicting and identifying the drivers of risk of Chronic Kidney Disease - a common type-2 diabetes comorbidity. All of these steps were performed in engagement with medical experts, including a final evaluation of the dashboard results by an expert medical panel. We show that LLMs, in particular BERT and SciBERT, can be readily deployed to extract some relevant explanations to support clinical usage. To understand the value-add of the contextual explanations, the expert panel evaluated these regarding actionable insights in the relevant clinical setting. Overall, our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.  ( 3 min )
    A Comparison Study of Deep CNN Architecture in Detecting of Pneumonia. (arXiv:2212.14744v2 [eess.IV] UPDATED)
    Pneumonia, a respiratory infection brought on by bacteria or viruses, affects a large number of people, especially in developing and impoverished countries where high levels of pollution, unclean living conditions, and overcrowding are frequently observed, along with insufficient medical infrastructure. Pleural effusion, a condition in which fluids fill the lung and complicate breathing, is brought on by pneumonia. Early detection of pneumonia is essential for ensuring curative care and boosting survival rates. The approach most usually used to diagnose pneumonia is chest X-ray imaging. The purpose of this work is to develop a method for the automatic diagnosis of bacterial and viral pneumonia in digital x-ray pictures. This article first presents the authors' technique, and then gives a comprehensive report on recent developments in the field of reliable diagnosis of pneumonia. In this study, here tuned a state-of-the-art deep convolutional neural network to classify plant diseases based on images and tested its performance. Deep learning architecture is compared empirically. VGG19, ResNet with 152v2, Resnext101, Seresnet152, Mobilenettv2, and DenseNet with 201 layers are among the architectures tested. Experiment data consists of two groups, sick and healthy X-ray pictures. To take appropriate action against plant diseases as soon as possible, rapid disease identification models are preferred. DenseNet201 has shown no overfitting or performance degradation in our experiments, and its accuracy tends to increase as the number of epochs increases. Further, DenseNet201 achieves state-of-the-art performance with a significantly a smaller number of parameters and within a reasonable computing time. This architecture outperforms the competition in terms of testing accuracy, scoring 95%. Each architecture was trained using Keras, using Theano as the backend.  ( 3 min )
    Multi-dimensional discrimination in Law and Machine Learning -- A comparative overview. (arXiv:2302.05995v1 [cs.LG])
    AI-driven decision-making can lead to discrimination against certain individuals or social groups based on protected characteristics/attributes such as race, gender, or age. The domain of fairness-aware machine learning focuses on methods and algorithms for understanding, mitigating, and accounting for bias in AI/ML models. Still, thus far, the vast majority of the proposed methods assess fairness based on a single protected attribute, e.g. only gender or race. In reality, though, human identities are multi-dimensional, and discrimination can occur based on more than one protected characteristic, leading to the so-called ``multi-dimensional discrimination'' or ``multi-dimensional fairness'' problem. While well-elaborated in legal literature, the multi-dimensionality of discrimination is less explored in the machine learning community. Recent approaches in this direction mainly follow the so-called intersectional fairness definition from the legal domain, whereas other notions like additive and sequential discrimination are less studied or not considered thus far. In this work, we overview the different definitions of multi-dimensional discrimination/fairness in the legal domain as well as how they have been transferred/ operationalized (if) in the fairness-aware machine learning domain. By juxtaposing these two domains, we draw the connections, identify the limitations, and point out open research directions.  ( 2 min )
    Event-Triggered Time-Varying Bayesian Optimization. (arXiv:2208.10790v2 [cs.LG] UPDATED)
    We consider the problem of sequentially optimizing a time-varying objective function using time-varying Bayesian optimization (TVBO). Here, the key challenge is the exploration-exploitation trade-off under time variations. Current approaches to TVBO require prior knowledge of a constant rate of change. However, the rate of change is usually neither known nor constant. We propose an event-triggered algorithm, ET-GP-UCB, that treats the optimization problem as static until it detects changes in the objective function online and then resets the dataset. This allows the algorithm to adapt to realized temporal changes without the need for prior knowledge. The event-trigger is based on probabilistic uniform error bounds used in Gaussian process regression. We provide regret bounds for ET-GP-UCB and show in numerical experiments that it is competitive with state-of-the-art algorithms even though it requires no knowledge about the temporal changes. Further, ET-GP-UCB outperforms these baselines if the rate of change is misspecified, and we demonstrate that it is readily applicable to various settings without tuning hyperparameters.  ( 2 min )
    Interpretable Diversity Analysis: Visualizing Feature Representations In Low-Cost Ensembles. (arXiv:2302.05822v1 [cs.LG])
    Diversity is an important consideration in the construction of robust neural network ensembles. A collection of well trained models will generalize better if they are diverse in the patterns they respond to and the predictions they make. Diversity is especially important for low-cost ensemble methods because members often share network structure in order to avoid training several independent models from scratch. Diversity is traditionally analyzed by measuring differences between the outputs of models. However, this gives little insight into how knowledge representations differ between ensemble members. This paper introduces several interpretability methods that can be used to qualitatively analyze diversity. We demonstrate these techniques by comparing the diversity of feature representations between child networks using two low-cost ensemble algorithms, Snapshot Ensembles and Prune and Tune Ensembles. We use the same pre-trained parent network as a starting point for both methods which allows us to explore how feature representations evolve over time. This approach to diversity analysis can lead to valuable insights and new perspectives for how we measure and promote diversity in ensemble methods.  ( 2 min )
    A Characterization of Multioutput Learnability. (arXiv:2301.02729v2 [cs.LG] UPDATED)
    We consider the problem of learning multioutput function classes in batch and online settings. In both settings, we show that a multioutput function class is learnable if and only if each single-output restriction of the function class is learnable. This provides a complete characterization of the learnability of multilabel classification and multioutput regression in both batch and online settings. As an extension, we also consider multilabel learnability in the bandit feedback setting and show a similar characterization as in the full-feedback setting.  ( 2 min )
    Chaotic Hedging with Iterated Integrals and Neural Networks. (arXiv:2209.10166v2 [q-fin.MF] UPDATED)
    In this paper, we extend the Wiener-Ito chaos decomposition to the class of diffusion processes, whose drift and diffusion coefficient are of linear growth. By omitting the orthogonality in the chaos expansion, we are able to show that every $p$-integrable functional, for $p \in [1,\infty)$, can be represented as sum of iterated integrals of the underlying process. Using a truncated sum of this expansion and (possibly random) neural networks for the integrands, whose parameters are learned in a machine learning setting, we show that every financial derivative can be approximated arbitrarily well in the $L^p$-sense. Since the hedging strategy of the approximating option can be computed in closed form, we obtain an efficient algorithm that can replicate any integrable financial derivative with short runtime.  ( 2 min )
    Communication and Storage Efficient Federated Split Learning. (arXiv:2302.05599v1 [cs.IT])
    Federated learning (FL) is a popular distributed machine learning (ML) paradigm, but is often limited by significant communication costs and edge device computation capabilities. Federated Split Learning (FSL) preserves the parallel model training principle of FL, with a reduced device computation requirement thanks to splitting the ML model between the server and clients. However, FSL still incurs very high communication overhead due to transmitting the smashed data and gradients between the clients and the server in each global round. Furthermore, the server has to maintain separate models for every client, resulting in a significant computation and storage requirement that grows linearly with the number of clients. This paper tries to solve these two issues by proposing a communication and storage efficient federated and split learning (CSE-FSL) strategy, which utilizes an auxiliary network to locally update the client models while keeping only a single model at the server, hence avoiding the communication of gradients from the server and greatly reducing the server resource requirement. Communication cost is further reduced by only sending the smashed data in selected epochs from the clients. We provide a rigorous theoretical analysis of CSE-FSL that guarantees its convergence for non-convex loss functions. Extensive experimental results demonstrate that CSE-FSL has a significant communication reduction over existing FSL techniques while achieving state-of-the-art convergence and model accuracy, using several real-world FL tasks.  ( 2 min )
    Sequential Embedding-based Attentive (SEA) classifier for malware classification. (arXiv:2302.05728v1 [cs.CR])
    The tremendous growth in smart devices has uplifted several security threats. One of the most prominent threats is malicious software also known as malware. Malware has the capability of corrupting a device and collapsing an entire network. Therefore, its early detection and mitigation are extremely important to avoid catastrophic effects. In this work, we came up with a solution for malware detection using state-of-the-art natural language processing (NLP) techniques. Our main focus is to provide a lightweight yet effective classifier for malware detection which can be used for heterogeneous devices, be it a resource constraint device or a resourceful machine. Our proposed model is tested on the benchmark data set with an accuracy and log loss score of 99.13 percent and 0.04 respectively.  ( 2 min )
    Variants of SGD for Lipschitz Continuous Loss Functions in Low-Precision Environments. (arXiv:2211.04655v3 [math.OC] UPDATED)
    Motivated by neural network training in low-bit floating and fixed-point environments, this work studies the convergence of variants of SGD with computational error. Considering a general stochastic Lipschitz continuous loss function, a novel convergence result to a Clarke stationary point is presented assuming that only an approximation of its stochastic gradient can be computed as well as error in computing the SGD step itself. Different variants of SGD are then tested empirically in a variety of low-precision arithmetic environments, where improved test set accuracy is observed compared to SGD for two image recognition tasks.  ( 2 min )
    Direct Uncertainty Quantification. (arXiv:2302.02420v2 [cs.LG] UPDATED)
    Traditional neural networks are simple to train but they produce overconfident predictions, while Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming. This paper introduces a new approach, direct uncertainty quantification (DirectUQ), that combines their advantages where the neural network directly models uncertainty in output space, and captures both aleatoric and epistemic uncertainty. DirectUQ can be derived as an alternative variational lower bound, and hence benefits from collapsed variational inference that provides improved regularizers. On the other hand, like non-probabilistic models, DirectUQ enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. Experiments show that DirectUQ and ensembles of DirectUQ provide a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.  ( 2 min )
    Hierarchical Stochastic Block Model for Community Detection in Multiplex Networks. (arXiv:1904.05330v3 [cs.SI] UPDATED)
    Multiplex networks have become increasingly more prevalent in many fields, and have emerged as a powerful tool for modeling the complexity of real networks. There is a critical need for developing inference models for multiplex networks that can take into account potential dependencies across different layers, particularly when the aim is community detection. We add to a limited literature by proposing a novel and efficient Bayesian model for community detection in multiplex networks. A key feature of our approach is the ability to model varying communities at different network layers. In contrast, many existing models assume the same communities for all layers. Moreover, our model automatically picks up the necessary number of communities at each layer (as validated by real data examples). This is appealing, since deciding the number of communities is a challenging aspect of community detection, and especially so in the multiplex setting, if one allows the communities to change across layers. Borrowing ideas from hierarchical Bayesian modeling, we use a hierarchical Dirichlet prior to model community labels across layers, allowing dependency in their structure. Given the community labels, a stochastic block model (SBM) is assumed for each layer. We develop an efficient slice sampler for sampling the posterior distribution of the community labels as well as the link probabilities between communities. In doing so, we address some unique challenges posed by coupling the complex likelihood of SBM with the hierarchical nature of the prior on the labels. An extensive empirical validation is performed on simulated and real data, demonstrating the superior performance of the model over single-layer alternatives, as well as the ability to uncover interesting structures in real networks.  ( 3 min )
    Quantum Neuron Selection: Finding High Performing Subnetworks With Quantum Algorithms. (arXiv:2302.05984v1 [cs.LG])
    Gradient descent methods have long been the de facto standard for training deep neural networks. Millions of training samples are fed into models with billions of parameters, which are slowly updated over hundreds of epochs. Recently, it's been shown that large, randomly initialized neural networks contain subnetworks that perform as well as fully trained models. This insight offers a promising avenue for training future neural networks by simply pruning weights from large, random models. However, this problem is combinatorically hard and classical algorithms are not efficient at finding the best subnetwork. In this paper, we explore how quantum algorithms could be formulated and applied to this neuron selection problem. We introduce several methods for local quantum neuron selection that reduce the entanglement complexity that large scale neuron selection would require, making this problem more tractable for current quantum hardware.  ( 2 min )
    A Survey on Spectral Graph Neural Networks. (arXiv:2302.05631v1 [cs.LG])
    Graph neural networks (GNNs) have attracted considerable attention from the research community. It is well established that GNNs are usually roughly divided into spatial and spectral methods. Despite that spectral GNNs play an important role in both graph signal processing and graph representation learning, existing studies are biased toward spatial approaches, and there is no comprehensive review on spectral GNNs so far. In this paper, we summarize the recent development of spectral GNNs, including model, theory, and application. Specifically, we first discuss the connection between spatial GNNs and spectral GNNs, which shows that spectral GNNs can capture global information and have better expressiveness and interpretability. Next, we categorize existing spectral GNNs according to the spectrum information they use, \ie, eigenvalues or eigenvectors. In addition, we review major theoretical results and applications of spectral GNNs, followed by a quantitative experiment to benchmark some popular spectral GNNs. Finally, we conclude the paper with some future directions.  ( 2 min )
    Element-Wise Attention Layers: an option for optimization. (arXiv:2302.05488v1 [cs.LG])
    The use of Attention Layers has become a trend since the popularization of the Transformer-based models, being the key element for many state-of-the-art models that have been developed through recent years. However, one of the biggest obstacles in implementing these architectures - as well as many others in Deep Learning Field - is the enormous amount of optimizing parameters they possess, which make its use conditioned on the availability of robust hardware. In this paper, it's proposed a new method of attention mechanism that adapts the Dot-Product Attention, which uses matrices multiplications, to become element-wise through the use of arrays multiplications. To test the effectiveness of such approach, two models (one with a VGG-like architecture and one with the proposed method) have been trained in a classification task using Fashion MNIST and CIFAR10 datasets. Each model has been trained for 10 epochs in a single Tesla T4 GPU from Google Colaboratory. The results show that this mechanism allows for an accuracy of 92% of the VGG-like counterpart in Fashion MNIST dataset, while reducing the number of parameters in 97%. For CIFAR10, the accuracy is still equivalent to 60% of the VGG-like counterpart while using 50% less parameters.  ( 2 min )
    Long-Context Language Decision Transformers and Exponential Tilt for Interactive Text Environments. (arXiv:2302.05507v1 [cs.CL])
    Text-based game environments are challenging because agents must deal with long sequences of text, execute compositional actions using text and learn from sparse rewards. We address these challenges by proposing Long-Context Language Decision Transformers (LLDTs), a framework that is based on long transformer language models and decision transformers (DTs). LLDTs extend DTs with 3 components: (1) exponential tilt to guide the agent towards high obtainable goals, (2) novel goal conditioning methods yielding significantly better results than the traditional return-to-go (sum of all future rewards), and (3) a model of future observations. Our ablation results show that predicting future observations improves agent performance. To the best of our knowledge, LLDTs are the first to address offline RL with DTs on these challenging games. Our experiments show that LLDTs achieve the highest scores among many different types of agents on some of the most challenging Jericho games, such as Enchanter.  ( 2 min )
    Pruning Deep Neural Networks from a Sparsity Perspective. (arXiv:2302.05601v1 [cs.LG])
    In recent years, deep network pruning has attracted significant attention in order to enable the rapid deployment of AI into small devices with computation and memory constraints. Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance. Many deep pruning algorithms have been proposed with impressive empirical success. However, existing approaches lack a quantifiable measure to estimate the compressibility of a sub-network during each pruning iteration and thus may under-prune or over-prune the model. In this work, we propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm. Our extensive experiments corroborate the hypothesis that for a generic pruning procedure, PQI decreases first when a large model is being effectively regularized and then increases when its compressibility reaches a limit that appears to correspond to the beginning of underfitting. Subsequently, PQI decreases again when the model collapse and significant deterioration in the performance of the model start to occur. Additionally, our experiments demonstrate that the proposed adaptive pruning algorithm with proper choice of hyper-parameters is superior to the iterative pruning algorithms such as the lottery ticket-based pruning methods, in terms of both compression efficiency and robustness.  ( 2 min )
    On the equivalence between graph isomorphism testing and function approximation with GNNs. (arXiv:1905.12560v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved much success on graph-structured data. In light of this, there have been increasing interests in studying their expressive power. One line of work studies the capability of GNNs to approximate permutation-invariant functions on graphs, and another focuses on the their power as tests for graph isomorphism. Our work connects these two perspectives and proves their equivalence. We further develop a framework of the expressive power of GNNs that incorporates both of these viewpoints using the language of sigma-algebra, through which we compare the expressive power of different types of GNNs together with other graph isomorphism tests. In particular, we prove that the second-order Invariant Graph Network fails to distinguish non-isomorphic regular graphs with the same degree. Then, we extend it to a new architecture, Ring-GNN, which succeeds in distinguishing these graphs and achieves good performances on real-world datasets.  ( 2 min )
    Novel techniques for improving NNetEn entropy calculation for short and noisy time series. (arXiv:2202.12703v2 [cs.LG] UPDATED)
    Entropy is a fundamental concept in the field of information theory. During measurement, conventional entropy measures are susceptible to length and amplitude changes in time series. A new entropy metric, neural network entropy (NNetEn), has been developed to overcome these limitations. NNetEn entropy is computed using a modified LogNNet neural network classification model. The algorithm contains a reservoir matrix of N=19625 elements that must be filled with the given data. The contribution of this paper is threefold. Firstly, this work investigates different methods of filling the reservoir with time series (signal) elements. The reservoir filling method determines the accuracy of the entropy estimation by convolution of the study time series and LogNNet test data. The present study proposes 6 methods for filling the reservoir for time series. Two of them (Method 3 and Method 6) employ the novel approach of stretching the time series to create intermediate elements that complement it, but do not change its dynamics. The most reliable methods for short time series are Method 3 and Method 5. The second part of the study examines the influence of noise and constant bias on entropy values. Our study examines three different time series data types (chaotic, periodic, and binary) with different dynamic properties, Signal to Noise Ratio (SNR), and offsets. The NNetEn entropy calculation errors are less than 10% when SNR is greater than 30 dB, and entropy decreases with an increase in the bias component. The third part of the article analyzes real-time biosignal EEG data collected from emotion recognition experiments. The NNetEn measures show robustness under low-amplitude noise using various filters. Thus, NNetEn measures entropy effectively when applied to real-world environments with ambient noise, white noise, and 1/f noise.  ( 3 min )
    LIMEtree: Consistent and Faithful Surrogate Explanations of Multiple Classes. (arXiv:2005.01427v2 [cs.LG] UPDATED)
    Explainable machine learning provides tools to better understand predictive models and their decisions, but many such methods are limited to producing insights with respect to a single class. When generating explanations for several classes, reasoning over them to obtain a complete view may be difficult since they can present competing or contradictory evidence. To address this issue we introduce a novel paradigm of multi-class explanations. We outline the theory behind such techniques and propose a local surrogate model based on multi-output regression trees -- called LIMEtree -- which offers faithful and consistent explanations of multiple classes for individual predictions while being post-hoc, model-agnostic and data-universal. In addition to strong fidelity guarantees, our implementation supports (interactive) customisation of the explanatory insights and delivers a range of diverse explanation types, including counterfactual statements favoured in the literature. We evaluate our algorithm with a collection of quantitative experiments, a qualitative analysis based on explainability desiderata and a preliminary user study on an image classification task, comparing it to LIME. Our contributions demonstrate the benefits of multi-class explanations and wide-ranging advantages of our method across a diverse set scenarios.  ( 2 min )
    On Proper Learnability between Average- and Worst-case Robustness. (arXiv:2211.05656v4 [cs.LG] UPDATED)
    Recently, \cite{montasser2019vc} showed that finite VC dimension is not sufficient for \textit{proper} adversarially robust PAC learning. In light of this hardness result, there is a growing effort to study what type of relaxations to the adversarially robust PAC learning setup can enable proper learnability. In this work, we initiate the study of proper learning under relaxations of the worst-case robust loss. We give a family of robust loss relaxations under which VC classes are properly PAC learning with sample complexity close to what one would require in the standard PAC learning setup. On the other hand, we show that for an existing and natural relaxation of the worst-case robust loss, finite VC dimension is not sufficient for proper learning. Lastly, we give new generalization guarantees for the adversarially robust empirical risk minimizer.  ( 2 min )
    NASRec: Weight Sharing Neural Architecture Search for Recommender Systems. (arXiv:2207.07187v2 [cs.IR] UPDATED)
    The rise of deep neural networks offers new opportunities in optimizing recommender systems. However, optimizing recommender systems using deep neural networks requires delicate architecture fabrication. We propose NASRec, a paradigm that trains a single supernet and efficiently produces abundant models/sub-architectures by weight sharing. To overcome the data multi-modality and architecture heterogeneity challenges in the recommendation domain, NASRec establishes a large supernet (i.e., search space) to search the full architectures. The supernet incorporates versatile choice of operators and dense connectivity to minimize human efforts for finding priors. The scale and heterogeneity in NASRec impose several challenges, such as training inefficiency, operator-imbalance, and degraded rank correlation. We tackle these challenges by proposing single-operator any-connection sampling, operator-balancing interaction modules, and post-training fine-tuning. Our crafted models, NASRecNet, show promising results on three Click-Through Rates (CTR) prediction benchmarks, indicating that NASRec outperforms both manually designed models and existing NAS methods with state-of-the-art performance. Our work is publicly available at https://github.com/facebookresearch/NasRec.  ( 2 min )
    Structure-aware Protein Self-supervised Learning. (arXiv:2204.04213v3 [cs.LG] UPDATED)
    Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without considering the important protein structural information. To this end, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme. Experiments on several supervised downstream tasks verify the effectiveness of our proposed method.  ( 2 min )
    Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels. (arXiv:2302.05666v1 [cs.CV])
    IoU losses are surrogates that directly optimize the Jaccard index. In semantic segmentation, IoU losses are shown to perform better with respect to the Jaccard index measure than pixel-wise losses such as the cross-entropy loss. The most notable IoU losses are the soft Jaccard loss and the Lovasz-Softmax loss. However, these losses are incompatible with soft labels which are ubiquitous in machine learning. In this paper, we propose Jaccard metric losses (JMLs), which are variants of the soft Jaccard loss, and are compatible with soft labels. With JMLs, we study two of the most popular use cases of soft labels: label smoothing and knowledge distillation. With a variety of architectures, our experiments show significant improvements over the cross-entropy loss on three semantic segmentation datasets (Cityscapes, PASCAL VOC and DeepGlobe Land), and our simple approach outperforms state-of-the-art knowledge distillation methods by a large margin. Our source code is available at: \href{https://github.com/zifuwanggg/JDML}{https://github.com/zifuwanggg/JDML}.  ( 2 min )
    Generating Counterfactual Hard Negative Samples for Graph Contrastive Learning. (arXiv:2207.00148v2 [cs.LG] UPDATED)
    Graph contrastive learning has emerged as a powerful tool for unsupervised graph representation learning. The key to the success of graph contrastive learning is to acquire high-quality positive and negative samples as contrasting pairs for the purpose of learning underlying structural semantics of the input graph. Recent works usually sample negative samples from the same training batch with the positive samples, or from an external irrelevant graph. However, a significant limitation lies in such strategies, which is the unavoidable problem of sampling false negative samples. In this paper, we propose a novel method to utilize \textbf{C}ounterfactual mechanism to generate artificial hard negative samples for \textbf{G}raph \textbf{C}ontrastive learning, namely \textbf{CGC}, which has a different perspective compared to those sampling-based strategies. We utilize counterfactual mechanism to produce hard negative samples, which ensures that the generated samples are similar to, but have labels that different from the positive sample. The proposed method achieves satisfying results on several datasets compared to some traditional unsupervised graph learning methods and some SOTA graph contrastive learning methods. We also conduct some supplementary experiments to give an extensive illustration of the proposed method, including the performances of CGC with different hard negative samples and evaluations for hard negative samples generated with different similarity measurements.  ( 2 min )
    De-Biasing Generative Models using Counterfactual Methods. (arXiv:2207.01575v3 [cs.LG] UPDATED)
    Variational autoencoders (VAEs) and other generative methods have garnered growing interest not just for their generative properties but also for the ability to dis-entangle a low-dimensional latent variable space. However, few existing generative models take causality into account. We propose a new decoder based framework named the Causal Counterfactual Generative Model (CCGM), which includes a partially trainable causal layer in which a part of a causal model can be learned without significantly impacting reconstruction fidelity. By learning the causal relationships between image semantic labels or tabular variables, we can analyze biases, intervene on the generative model, and simulate new scenarios. Furthermore, by modifying the causal structure, we can generate samples outside the domain of the original training data and use such counterfactual models to de-bias datasets. Thus, datasets with known biases can still be used to train the causal generative model and learn the causal relationships, but we can produce de-biased datasets on the generative side. Our proposed method combines a causal latent space VAE model with specific modification to emphasize causal fidelity, enabling finer control over the causal layer and the ability to learn a robust intervention framework. We explore how better disentanglement of causal learning and encoding/decoding generates higher causal intervention quality. We also compare our model against similar research to demonstrate the need for explicit generative de-biasing beyond interventions. Our initial experiments show that our model can generate images and tabular data with high fidelity to the causal framework and accommodate explicit de-biasing to ignore undesired relationships in the causal data compared to the baseline.  ( 2 min )
    Which Invariance Should We Transfer? A Causal Minimax Learning Approach. (arXiv:2107.01876v3 [stat.ML] UPDATED)
    A major barrier to deploying current machine learning models lies in their non-reliability to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Particularly, independent causal mechanisms-based methods proposed to remove mutable causal mechanisms via the do-operator. Compared to previous methods, the obtained stable predictors are more effective in identifying stable information. However, a key question remains: which subset of this whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we present a comprehensive minimax analysis from a causal perspective. Specifically, we first provide a graphical condition for the whole stable set to be optimal. When this condition fails, we surprisingly find with an example that this whole stable set, although can fully exploit stable information, is not the optimal one to transfer. To identify the optimal subset under this case, we propose to estimate the worst-case risk with a novel optimization scheme over the intervention functions on mutable causal mechanisms. We then propose an efficient algorithm to search for the subset with minimal worst-case risk, based on a newly defined equivalence relation between stable subsets. Compared to the exponential cost of exhaustively searching over all subsets, our searching strategy enjoys a polynomial complexity. The effectiveness and efficiency of our methods are demonstrated on synthetic data and the diagnosis of Alzheimer's disease.  ( 2 min )
    Locating disparities in machine learning. (arXiv:2208.06680v2 [cs.LG] UPDATED)
    Machine learning was repeatedly proven to provide predictions with disparate outcomes, in which subgroups of the population (e.g., defined by age, gender, or other sensitive attributes) are systematically disadvantaged. Previous literature has focused on detecting such disparities through statistical procedures for when the sensitive attribute is specified a priori. However, this limits applicability in real-world settings where datasets are high dimensional and, on top of that, sensitive attributes may be unknown. As a remedy, we propose a data-driven framework called Automatic Location of Disparities (ALD) which aims at locating disparities in machine learning. ALD meets several demands from machine learning practice: ALD (1) is applicable to arbitrary machine learning classifiers; (2) operates on different definitions of disparities (e.g., statistical parity or equalized odds); (3) deals with both categorical and continuous predictors; (4) is suitable to handle high-dimensional settings; and (5) even identifies disparities due to intersectionality where disparities arise from complex and multi-way interactions (e.g., age above 60 and female). ALD produces interpretable fairness reports as output. We demonstrate the effectiveness of ALD based on both synthetic and real-world datasets. As a result, ALD helps practitioners and researchers of algorithmic fairness to detect disparities in machine learning algorithms, so that disparate -- or even unfair -- outcomes can be mitigated. Moreover, ALD supports practitioners in conducting algorithmic audits and protecting individuals from discrimination.  ( 2 min )
    Scaling Laws for a Multi-Agent Reinforcement Learning Model. (arXiv:2210.00849v2 [cs.LG] UPDATED)
    The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. This scaling law implies that previously published state-of-the-art game-playing models are significantly smaller than their optimal size, given the respective compute budgets. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data.  ( 2 min )
    Transfer Learning for Bayesian Optimization: A Survey. (arXiv:2302.05927v1 [cs.LG])
    A wide spectrum of design and decision problems, including parameter tuning, A/B testing and drug design, intrinsically are instances of black-box optimization. Bayesian optimization (BO) is a powerful tool that models and optimizes such expensive "black-box" functions. However, at the beginning of optimization, vanilla Bayesian optimization methods often suffer from slow convergence issue due to inaccurate modeling based on few trials. To address this issue, researchers in the BO community propose to incorporate the spirit of transfer learning to accelerate optimization process, which could borrow strength from the past tasks (source tasks) to accelerate the current optimization problem (target task). This survey paper first summarizes transfer learning methods for Bayesian optimization from four perspectives: initial points design, search space design, surrogate model, and acquisition function. Then it highlights its methodological aspects and technical details for each approach. Finally, it showcases a wide range of applications and proposes promising future directions.  ( 2 min )
    I$^2$SB: Image-to-Image Schr\"odinger Bridge. (arXiv:2302.05872v1 [cs.CV])
    We propose Image-to-Image Schr\"odinger Bridge (I$^2$SB), a new class of conditional diffusion models that directly learn the nonlinear diffusion processes between two given distributions. These diffusion bridges are particularly useful for image restoration, as the degraded images are structurally informative priors for reconstructing the clean images. I$^2$SB belongs to a tractable class of Schr\"odinger bridge, the nonlinear extension to score-based models, whose marginal distributions can be computed analytically given boundary pairs. This results in a simulation-free framework for nonlinear diffusions, where the I$^2$SB training becomes scalable by adopting practical techniques used in standard diffusion models. We validate I$^2$SB in solving various image restoration tasks, including inpainting, super-resolution, deblurring, and JPEG restoration on ImageNet 256x256 and show that I$^2$SB surpasses standard conditional diffusion models with more interpretable generative processes. Moreover, I$^2$SB matches the performance of inverse methods that additionally require the knowledge of the corruption operators. Our work opens up new algorithmic opportunities for developing efficient nonlinear diffusion models on a large scale. scale. Project page: https://i2sb.github.io/  ( 2 min )
    Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD. (arXiv:2302.05516v1 [stat.ML])
    Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.  ( 2 min )
    A High-dimensional Convergence Theorem for U-statistics with Applications to Kernel-based Testing. (arXiv:2302.05686v1 [math.ST])
    We prove a convergence theorem for U-statistics of degree two, where the data dimension $d$ is allowed to scale with sample size $n$. We find that the limiting distribution of a U-statistic undergoes a phase transition from the non-degenerate Gaussian limit to the degenerate limit, regardless of its degeneracy and depending only on a moment ratio. A surprising consequence is that a non-degenerate U-statistic in high dimensions can have a non-Gaussian limit with a larger variance and asymmetric distribution. Our bounds are valid for any finite $n$ and $d$, independent of individual eigenvalues of the underlying function, and dimension-independent under a mild assumption. As an application, we apply our theory to two popular kernel-based distribution tests, MMD and KSD, whose high-dimensional performance has been challenging to study. In a simple empirical setting, our results correctly predict how the test power at a fixed threshold scales with $d$ and the bandwidth.  ( 2 min )
    Verifying Generalization in Deep Learning. (arXiv:2302.05745v1 [cs.LG])
    Deep neural networks (DNNs) are the workhorses of deep learning, which constitutes the state of the art in numerous application domains. However, DNN-based decision rules are notoriously prone to poor generalization, i.e., may prove inadequate on inputs not encountered during training. This limitation poses a significant obstacle to employing deep learning for mission-critical tasks, and also in real-world environments that exhibit high variability. We propose a novel, verification-driven methodology for identifying DNN-based decision rules that generalize well to new input domains. Our approach quantifies generalization to an input domain by the extent to which decisions reached by independently trained DNNs are in agreement for inputs in this domain. We show how, by harnessing the power of DNN verification, our approach can be efficiently and effectively realized. We evaluate our verification-based approach on three deep reinforcement learning (DRL) benchmarks, including a system for real-world Internet congestion control. Our results establish the usefulness of our approach, and, in particular, its superiority over gradient-based methods. More broadly, our work puts forth a novel objective for formal verification, with the potential for mitigating the risks associated with deploying DNN-based systems in the wild.  ( 2 min )
    NephroNet: A Novel Program for Identifying Renal Cell Carcinoma and Generating Synthetic Training Images with Convolutional Neural Networks and Diffusion Models. (arXiv:2302.05830v1 [eess.IV])
    Renal cell carcinoma (RCC) is a type of cancer that originates in the kidneys and is the most common type of kidney cancer in adults. It can be classified into several subtypes, including clear cell RCC, papillary RCC, and chromophobe RCC. In this study, an artificial intelligence model was developed and trained for classifying different subtypes of RCC using ResNet-18, a convolutional neural network that has been widely used for image classification tasks. The model was trained on a dataset of RCC histopathology images, which consisted of digital images of RCC surgical resection slides that were annotated with the corresponding subtype labels. The performance of the trained model was evaluated using several metrics, including accuracy, precision, and recall. Additionally, in this research, a novel synthetic image generation tool, NephroNet, is developed on diffusion models that are used to generate original images of RCC surgical resection slides. Diffusion models are a class of generative models capable of synthesizing high-quality images from noise. Several diffusers such as Stable Diffusion, Dreambooth Text-to-Image, and Textual Inversion were trained on a dataset of RCC images and were used to generate a series of original images that resembled RCC surgical resection slides, all within the span of fewer than four seconds. The generated images were visually realistic and could be used for creating new training datasets, testing the performance of image analysis algorithms, and training medical professionals. NephroNet is provided as an open-source software package and contains files for data preprocessing, training, and visualization. Overall, this study demonstrates the potential of artificial intelligence and diffusion models for classifying and generating RCC images, respectively. These methods could be useful for improving the diagnosis and treatment of RCC and more.  ( 3 min )
    ConCerNet: A Contrastive Learning Based Framework for Automated Conservation Law Discovery and Trustworthy Dynamical System Prediction. (arXiv:2302.05783v1 [cs.LG])
    Deep neural networks (DNN) have shown great capacity of modeling a dynamical system; nevertheless, they usually do not obey physics constraints such as conservation laws. This paper proposes a new learning framework named ConCerNet to improve the trustworthiness of the DNN based dynamics modeling to endow the invariant properties. ConCerNet consists of two steps: (i) a contrastive learning method to automatically capture the system invariants (i.e. conservation properties) along the trajectory observations; (ii) a neural projection layer to guarantee that the learned dynamics models preserve the learned invariants. We theoretically prove the functional relationship between the learned latent representation and the unknown system invariant function. Experiments show that our method consistently outperforms the baseline neural networks in both coordinate error and conservation metrics by a large margin. With neural network based parameterization and no dependence on prior knowledge, our method can be extended to complex and large-scale dynamics by leveraging an autoencoder.  ( 2 min )
    Neural Architecture Search with Multimodal Fusion Methods for Diagnosing Dementia. (arXiv:2302.05894v1 [cs.LG])
    Alzheimer's dementia (AD) affects memory, thinking, and language, deteriorating person's life. An early diagnosis is very important as it enables the person to receive medical help and ensure quality of life. Therefore, leveraging spontaneous speech in conjunction with machine learning methods for recognizing AD patients has emerged into a hot topic. Most of the previous works employ Convolutional Neural Networks (CNNs), to process the input signal. However, finding a CNN architecture is a time-consuming process and requires domain expertise. Moreover, the researchers introduce early and late fusion approaches for fusing different modalities or concatenate the representations of the different modalities during training, thus the inter-modal interactions are not captured. To tackle these limitations, first we exploit a Neural Architecture Search (NAS) method to automatically find a high performing CNN architecture. Next, we exploit several fusion methods, including Multimodal Factorized Bilinear Pooling and Tucker Decomposition, to combine both speech and text modalities. To the best of our knowledge, there is no prior work exploiting a NAS approach and these fusion methods in the task of dementia detection from spontaneous speech. We perform extensive experiments on the ADReSS Challenge dataset and show the effectiveness of our approach over state-of-the-art methods.  ( 2 min )
    Sequential Underspecified Instrument Selection for Cause-Effect Estimation. (arXiv:2302.05684v1 [stat.ME])
    Instrumental variable (IV) methods are used to estimate causal effects in settings with unobserved confounding, where we cannot directly experiment on the treatment variable. Instruments are variables which only affect the outcome indirectly via the treatment variable(s). Most IV applications focus on low-dimensional treatments and crucially require at least as many instruments as treatments. This assumption is restrictive: in the natural sciences we often seek to infer causal effects of high-dimensional treatments (e.g., the effect of gene expressions or microbiota on health and disease), but can only run few experiments with a limited number of instruments (e.g., drugs or antibiotics). In such underspecified problems, the full treatment effect is not identifiable in a single experiment even in the linear case. We show that one can still reliably recover the projection of the treatment effect onto the instrumented subspace and develop techniques to consistently combine such partial estimates from different sets of instruments. We then leverage our combined estimators in an algorithm that iteratively proposes the most informative instruments at each round of experimentation to maximize the overall information about the full causal effect.  ( 2 min )
    Distributional GFlowNets with Quantile Flows. (arXiv:2302.05793v1 [cs.LG])
    Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a stochastic policy for generating complex combinatorial structure through a series of decision-making steps. Despite being inspired from reinforcement learning, the current GFlowNet framework is relatively limited in its applicability and cannot handle stochasticity in the reward function. In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training. By parameterizing each edge flow through their quantile functions, our proposed \textit{quantile matching} GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty. Moreover, we find that the distributional approach can achieve substantial improvement on existing benchmarks compared to prior methods due to our enhanced training algorithm, even in settings with deterministic rewards.  ( 2 min )
    A Policy Gradient Framework for Stochastic Optimal Control Problems with Global Convergence Guarantee. (arXiv:2302.05816v1 [math.OC])
    In this work, we consider the stochastic optimal control problem in continuous time and a policy gradient method to solve it. In particular, we study the gradient flow for the control, viewed as a continuous time limit of the policy gradient. We prove the global convergence of the gradient flow and establish a convergence rate under some regularity assumptions. The main novelty in the analysis is the notion of local optimal control function, which is introduced to compare the local optimality of the iterate.  ( 2 min )
    Variational Voxel Pseudo Image Tracking. (arXiv:2302.05914v1 [cs.CV])
    Uncertainty estimation is an important task for critical problems, such as robotics and autonomous driving, because it allows creating statistically better perception models and signaling the model's certainty in its predictions to the decision method or a human supervisor. In this paper, we propose a Variational Neural Network-based version of a Voxel Pseudo Image Tracking (VPIT) method for 3D Single Object Tracking. The Variational Feature Generation Network of the proposed Variational VPIT computes features for target and search regions and the corresponding uncertainties, which are later combined using an uncertainty-aware cross-correlation module in one of two ways: by computing similarity between the corresponding uncertainties and adding it to the regular cross-correlation values, or by penalizing the uncertain feature channels to increase influence of the certain features. In experiments, we show that both methods improve tracking performance, while penalization of uncertain features provides the best uncertainty quality.  ( 2 min )
    Operation-level Progressive Differentiable Architecture Search. (arXiv:2302.05632v1 [cs.CV])
    Differentiable Neural Architecture Search (DARTS) is becoming more and more popular among Neural Architecture Search (NAS) methods because of its high search efficiency and low compute cost. However, the stability of DARTS is very inferior, especially skip connections aggregation that leads to performance collapse. Though existing methods leverage Hessian eigenvalues to alleviate skip connections aggregation, they make DARTS unable to explore architectures with better performance. In the paper, we propose operation-level progressive differentiable neural architecture search (OPP-DARTS) to avoid skip connections aggregation and explore better architectures simultaneously. We first divide the search process into several stages during the search phase and increase candidate operations into the search space progressively at the beginning of each stage. It can effectively alleviate the unfair competition between operations during the search phase of DARTS by offsetting the inherent unfair advantage of the skip connection over other operations. Besides, to keep the competition between operations relatively fair and select the operation from the candidate operations set that makes training loss of the supernet largest. The experiment results indicate that our method is effective and efficient. Our method's performance on CIFAR-10 is superior to the architecture found by standard DARTS, and the transferability of our method also surpasses standard DARTS. We further demonstrate the robustness of our method on three simple search spaces, i.e., S2, S3, S4, and the results show us that our method is more robust than standard DARTS. Our code is available at https://github.com/zxunyu/OPP-DARTS.  ( 2 min )
    Towards Multi-User Activity Recognition through Facilitated Training Data and Deep Learning for Human-Robot Collaboration Applications. (arXiv:2302.05763v1 [cs.LG])
    Human-robot interaction (HRI) research is progressively addressing multi-party scenarios, where a robot interacts with more than one human user at the same time. Conversely, research is still at an early stage for human-robot collaboration (HRC). The use of machine learning techniques to handle such type of collaboration requires data that are less feasible to produce than in a typical HRC setup. This work outlines concepts of design of concurrent tasks for non-dyadic HRC applications. Based upon these concepts, this study also proposes an alternative way of gathering data regarding multiuser activity, by collecting data related to single subjects and merging them in post-processing, to reduce the effort involved in producing recordings of pair settings. To validate this statement, 3D skeleton poses of activity of single subjects were collected and merged in pairs. After this, the datapoints were used to separately train a long short-term memory (LSTM) network and a variational autoencoder (VAE) composed of spatio-temporal graph convolutional networks (STGCN) to recognise the joint activities of the pairs of people. The results showed that it is possible to make use of data collected in this way for pair HRC settings and get similar performances compared to using data regarding groups of users recorded under the same settings, relieving from the technical difficulties involved in producing these data.  ( 2 min )
    UGAE: A Novel Approach to Non-exponential Discounting. (arXiv:2302.05740v1 [cs.LG])
    The discounting mechanism in Reinforcement Learning determines the relative importance of future and present rewards. While exponential discounting is widely used in practice, non-exponential discounting methods that align with human behavior are often desirable for creating human-like agents. However, non-exponential discounting methods cannot be directly applied in modern on-policy actor-critic algorithms. To address this issue, we propose Universal Generalized Advantage Estimation (UGAE), which allows for the computation of GAE advantage values with arbitrary discounting. Additionally, we introduce Beta-weighted discounting, a continuous interpolation between exponential and hyperbolic discounting, to increase flexibility in choosing a discounting method. To showcase the utility of UGAE, we provide an analysis of the properties of various discounting methods. We also show experimentally that agents with non-exponential discounting trained via UGAE outperform variants trained with Monte Carlo advantage estimation. Through analysis of various discounting methods and experiments, we demonstrate the superior performance of UGAE with Beta-weighted discounting over the Monte Carlo baseline on standard RL benchmarks. UGAE is simple and easily integrated into any advantage-based algorithm as a replacement for the standard recursive GAE.  ( 2 min )
    Theory on Forgetting and Generalization of Continual Learning. (arXiv:2302.05836v1 [cs.LG])
    Continual learning (CL), which aims to learn a sequence of tasks, has attracted significant recent attention. However, most work has focused on the experimental performance of CL, and theoretical studies of CL are still limited. In particular, there is a lack of understanding on what factors are important and how they affect "catastrophic forgetting" and generalization performance. To fill this gap, our theoretical analysis, under overparameterized linear models, provides the first-known explicit form of the expected forgetting and generalization error. Further analysis of such a key result yields a number of theoretical explanations about how overparameterization, task similarity, and task ordering affect both forgetting and generalization error of CL. More interestingly, by conducting experiments on real datasets using deep neural networks (DNNs), we show that some of these insights even go beyond the linear models and can be carried over to practical setups. In particular, we use concrete examples to show that our results not only explain some interesting empirical observations in recent studies, but also motivate better practical algorithm designs of CL.  ( 2 min )
    Position Matters! Empirical Study of Order Effect in Knowledge-grounded Dialogue. (arXiv:2302.05888v1 [cs.CL])
    With the power of large pretrained language models, various research works have integrated knowledge into dialogue systems. The traditional techniques treat knowledge as part of the input sequence for the dialogue system, prepending a set of knowledge statements in front of dialogue history. However, such a mechanism forces knowledge sets to be concatenated in an ordered manner, making models implicitly pay imbalanced attention to the sets during training. In this paper, we first investigate how the order of the knowledge set can influence autoregressive dialogue systems' responses. We conduct experiments on two commonly used dialogue datasets with two types of transformer-based models and find that models view the input knowledge unequally. To this end, we propose a simple and novel technique to alleviate the order effect by modifying the position embeddings of knowledge input in these models. With the proposed position embedding method, the experimental results show that each knowledge statement is uniformly considered to generate responses.  ( 2 min )
    Cross-Modal Fine-Tuning: Align then Refine. (arXiv:2302.05738v1 [cs.LG])
    Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. ORCA adapts to a target task via an align-then-refine workflow: given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities. Through extensive experiments, we show that ORCA obtains state-of-the-art results on 3 benchmarks containing over 60 datasets from 12 modalities, outperforming a wide range of hand-designed, AutoML, general-purpose, and task-specific methods. We highlight the importance of data alignment via a series of ablation studies and demonstrate ORCA's utility in data-limited regimes.  ( 2 min )
    Vertical Federated Knowledge Transfer via Representation Distillation for Healthcare Collaboration Networks. (arXiv:2302.05675v1 [cs.LG])
    Collaboration between healthcare institutions can significantly lessen the imbalance in medical resources across various geographic areas. However, directly sharing diagnostic information between institutions is typically not permitted due to the protection of patients' highly sensitive privacy. As a novel privacy-preserving machine learning paradigm, federated learning (FL) makes it possible to maximize the data utility among multiple medical institutions. These feature-enrichment FL techniques are referred to as vertical FL (VFL). Traditional VFL can only benefit multi-parties' shared samples, which strongly restricts its application scope. In order to improve the information-sharing capability and innovation of various healthcare-related institutions, and then to establish a next-generation open medical collaboration network, we propose a unified framework for vertical federated knowledge transfer mechanism (VFedTrans) based on a novel cross-hospital representation distillation component. Specifically, our framework includes three steps. First, shared samples' federated representations are extracted by collaboratively modeling multi-parties' joint features with current efficient vertical federated representation learning methods. Second, for each hospital, we learn a local-representation-distilled module, which can transfer the knowledge from shared samples' federated representations to enrich local samples' representations. Finally, each hospital can leverage local samples' representations enriched by the distillation module to boost arbitrary downstream machine learning tasks. The experiments on real-life medical datasets verify the knowledge transfer effectiveness of our framework.  ( 2 min )
    Tighter PAC-Bayes Bounds Through Coin-Betting. (arXiv:2302.05829v1 [cs.LG])
    We consider the problem of estimating the mean of a sequence of random elements $f(X_1, \theta)$ $, \ldots, $ $f(X_n, \theta)$ where $f$ is a fixed scalar function, $S=(X_1, \ldots, X_n)$ are independent random variables, and $\theta$ is a possibly $S$-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on $n$ examples where $f$ is a loss function. Classically, this problem is approached through concentration inequalities holding uniformly over compact parameter sets of functions $f$, for example as in Rademacher or VC type analysis. However, in many problems, such inequalities often yield numerically vacuous estimates. Recently, the \emph{PAC-Bayes} framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve \emph{even tighter} guarantees. Our approach is based on the \emph{coin-betting} framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms. In particular, we derive the first PAC-Bayes concentration inequality based on the coin-betting approach that holds simultaneously for all sample sizes. We demonstrate its tightness showing that by \emph{relaxing} it we obtain a number of previous results in a closed form including Bernoulli-KL and empirical Bernstein inequalities. Finally, we propose an efficient algorithm to numerically calculate confidence sequences from our bound, which often generates nonvacuous confidence bounds even with one sample, unlike the state-of-the-art PAC-Bayes bounds.  ( 2 min )
    Learning by Applying: A General Framework for Mathematical Reasoning via Enhancing Explicit Knowledge Learning. (arXiv:2302.05717v1 [cs.AI])
    Mathematical reasoning is one of the crucial abilities of general artificial intelligence, which requires machines to master mathematical logic and knowledge from solving problems. However, existing approaches are not transparent (thus not interpretable) in terms of what knowledge has been learned and applied in the reasoning process. In this paper, we propose a general Learning by Applying (LeAp) framework to enhance existing models (backbones) in a principled way by explicit knowledge learning. In LeAp, we perform knowledge learning in a novel problem-knowledge-expression paradigm, with a Knowledge Encoder to acquire knowledge from problem data and a Knowledge Decoder to apply knowledge for expression reasoning. The learned mathematical knowledge, including word-word relations and word-operator relations, forms an explicit knowledge graph, which bridges the knowledge "learning" and "applying" organically. Moreover, for problem solving, we design a semantics-enhanced module and a reasoning-enhanced module that apply knowledge to improve the problem comprehension and symbol reasoning abilities of any backbone, respectively. We theoretically prove the superiority of LeAp's autonomous learning mechanism. Experiments on three real-world datasets show that LeAp improves all backbones' performances, learns accurate knowledge, and achieves a more interpretable reasoning process.  ( 2 min )
    Graph Neural Network-Inspired Kernels for Gaussian Processes in Semi-Supervised Learning. (arXiv:2302.05828v1 [cs.LG])
    Gaussian processes (GPs) are an attractive class of machine learning models because of their simplicity and flexibility as building blocks of more complex Bayesian models. Meanwhile, graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning and beyond. Their competitive performance is often attributed to a proper capturing of the graph inductive bias. In this work, we introduce this inductive bias into GPs to improve their predictive performance for graph-structured data. We show that a prominent example of GNNs, the graph convolutional network, is equivalent to some GP when its layers are infinitely wide; and we analyze the kernel universality and the limiting behavior in depth. We further present a programmable procedure to compose covariance kernels inspired by this equivalence and derive example kernels corresponding to several interesting members of the GNN family. We also propose a computationally efficient approximation of the covariance matrix for scalable posterior inference with large-scale data. We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time, compared with the respective GNNs.  ( 2 min )
    Pushing the Accuracy-Group Robustness Frontier with Introspective Self-play. (arXiv:2302.05807v1 [cs.LG])
    Standard empirical risk minimization (ERM) training can produce deep neural network (DNN) models that are accurate on average but under-perform in under-represented population subgroups, especially when there are imbalanced group distributions in the long-tailed training data. Therefore, approaches that improve the accuracy-group robustness trade-off frontier of a DNN model (i.e. improving worst-group accuracy without sacrificing average accuracy, or vice versa) is of crucial importance. Uncertainty-based active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple "plug-in" for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods.  ( 2 min )
    Global Convergence Rate of Deep Equilibrium Models with General Activations. (arXiv:2302.05797v1 [stat.ML])
    In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation and proved that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. In this paper, we show that this fact still holds for DEQs with any general activation which has bounded first and second derivatives. Since the new activation function is generally non-linear, a general population Gram matrix is designed, and a new form of dual activation with Hermite polynomial expansion is developed.  ( 2 min )
    Multi-class Brain Tumor Segmentation using Graph Attention Network. (arXiv:2302.05598v1 [eess.IV])
    Brain tumor segmentation from magnetic resonance imaging (MRI) plays an important role in diagnostic radiology. To overcome the practical issues in manual approaches, there is a huge demand for building automatic tumor segmentation algorithms. This work introduces an efficient brain tumor summation model by exploiting the advancement in MRI and graph neural networks (GNNs). The model represents the volumetric MRI as a region adjacency graph (RAG) and learns to identify the type of tumors through a graph attention network (GAT) -- a variant of GNNs. The ablation analysis conducted on two benchmark datasets proves that the proposed model can produce competitive results compared to the leading-edge solutions. It achieves mean dice scores of 0.91, 0.86, 0.79, and mean Hausdorff distances in the 95th percentile (HD95) of 5.91, 6.08, and 9.52 mm, respectively, for whole tumor, core tumor, and enhancing tumor segmentation on BraTS2021 validation dataset. On average, these performances are >6\% and >50%, compared to a GNN-based baseline model, respectively, on dice score and HD95 evaluation metrics.  ( 2 min )
    CILP: Co-simulation based Imitation Learner for Dynamic Resource Provisioning in Cloud Computing Environments. (arXiv:2302.05630v1 [eess.SY])
    Intelligent Virtual Machine (VM) provisioning is central to cost and resource efficient computation in cloud computing environments. As bootstrapping VMs is time-consuming, a key challenge for latency-critical tasks is to predict future workload demands to provision VMs proactively. However, existing AI-based solutions \blue{tend to not holistically consider} all crucial aspects such as provisioning overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system. To address this, we propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization, where the provisioning plan is optimized based on predicted workload demands. CILP leverages a neural network as a surrogate model to predict future workload demands with a co-simulated digital-twin of the infrastructure to compute QoS scores. We extend the neural network to also act as an imitation learner that dynamically decides the optimal VM provisioning plan. A transformer based neural model reduces training and inference overheads while our novel two-phase decision making loop facilitates in making informed provisioning decisions. Crucially, we address limitations of prior work by including resource utilization, deployment costs and provisioning overheads to inform the provisioning decisions in our imitation learning framework. Experiments with three public benchmarks demonstrate that CILP gives up to 22% higher resource utilization, 14% higher QoS scores and 44% lower execution costs compared to the current online and offline optimization based state-of-the-art methods.  ( 2 min )
    Fairness-aware Multi-view Clustering. (arXiv:2302.05788v1 [cs.LG])
    In the era of big data, we are often facing the challenge of data heterogeneity and the lack of label information simultaneously. In the financial domain (e.g., fraud detection), the heterogeneous data may include not only numerical data (e.g., total debt and yearly income), but also text and images (e.g., financial statement and invoice images). At the same time, the label information (e.g., fraud transactions) may be missing for building predictive models. To address these challenges, many state-of-the-art multi-view clustering methods have been proposed and achieved outstanding performance. However, these methods typically do not take into consideration the fairness aspect and are likely to generate biased results using sensitive information such as race and gender. Therefore, in this paper, we propose a fairness-aware multi-view clustering method named FairMVC. It incorporates the group fairness constraint into the soft membership assignment for each cluster to ensure that the fraction of different groups in each cluster is approximately identical to the entire data set. Meanwhile, we adopt the idea of both contrastive learning and non-contrastive learning and propose novel regularizers to handle heterogeneous data in complex scenarios with missing data or noisy features. Experimental results on real-world data sets demonstrate the effectiveness and efficiency of the proposed framework. We also derive insights regarding the relative performance of the proposed regularizers in various scenarios.  ( 2 min )
    Stochastic Surprisal: An inferential measurement of Free Energy in Neural Networks. (arXiv:2302.05776v1 [cs.LG])
    This paper conjectures and validates a framework that allows for action during inference in supervised neural networks. Supervised neural networks are constructed with the objective to maximize their performance metric in any given task. This is done by reducing free energy and its associated surprisal during training. However, the bottom-up inference nature of supervised networks is a passive process that renders them fallible to noise. In this paper, we provide a thorough background of supervised neural networks, both generative and discriminative, and discuss their functionality from the perspective of free energy principle. We then provide a framework for introducing action during inference. We introduce a new measurement called stochastic surprisal that is a function of the network, the input, and any possible action. This action can be any one of the outputs that the neural network has learnt, thereby lending stochasticity to the measurement. Stochastic surprisal is validated on two applications: Image Quality Assessment and Recognition under noisy conditions. We show that, while noise characteristics are ignored to make robust recognition, they are analyzed to estimate image quality scores. We apply stochastic surprisal on two applications, three datasets, and as a plug-in on twelve networks. In all, it provides a statistically significant increase among all measures. We conclude by discussing the implications of the proposed stochastic surprisal in other areas of cognitive psychology including expectancy-mismatch and abductive reasoning.  ( 2 min )
    Emotion Detection From Social Media Posts. (arXiv:2302.05610v1 [cs.LG])
    Over the last few years, social media has evolved into a medium for expressing personal views, emotions, and even business and political proposals, recommendations, and advertisements. We address the topic of identifying emotions from text data obtained from social media posts like Twitter in this research. We have deployed different traditional machine learning techniques such as Support Vector Machines (SVM), Naive Bayes, Decision Trees, and Random Forest, as well as deep neural network models such as LSTM, CNN, GRU, BiLSTM, BiGRU to classify these tweets into four emotion categories (Fear, Anger, Joy, and Sadness). Furthermore, we have constructed a BiLSTM and BiGRU ensemble model. The evaluation result shows that the deep neural network models(BiGRU, to be specific) produce the most promising results compared to traditional machine learning models, with an 87.53 % accuracy rate. The ensemble model performs even better (87.66 %), albeit the difference is not significant. This result will aid in the development of a decision-making tool that visualizes emotional fluctuations.  ( 2 min )
    Predicting municipalities in financial distress: a machine learning approach enhanced by domain expertise. (arXiv:2302.05780v1 [cs.LG])
    Financial distress of municipalities, although comparable to bankruptcy of private companies, has a far more serious impact on the well-being of communities. For this reason, it is essential to detect deficits as soon as possible. Predicting financial distress in municipalities can be a complex task, as it involves understanding a wide range of factors that can affect a municipality's financial health. In this paper, we evaluate machine learning models to predict financial distress in Italian municipalities. Accounting judiciary experts have specialized knowledge and experience in evaluating the financial performance of municipalities, and they use a range of financial and general indicators to make their assessments. By incorporating these indicators in the feature extraction process, we can ensure that the predictive model is taking into account a wide range of information that is relevant to the financial health of municipalities. The results of this study indicate that using machine learning models in combination with the knowledge of accounting judiciary experts can aid in the early detection of financial distress in municipalities, leading to better outcomes for the communities they serve.  ( 2 min )
    MSDC: Exploiting Multi-State Power Consumption in Non-intrusive Load Monitoring based on A Dual-CNN Model. (arXiv:2302.05565v1 [cs.LG])
    Non-intrusive load monitoring (NILM) aims to decompose aggregated electrical usage signal into appliance-specific power consumption and it amounts to a classical example of blind source separation tasks. Leveraging recent progress on deep learning techniques, we design a new neural NILM model Multi-State Dual CNN (MSDC). Different from previous models, MSDC explicitly extracts information about the appliance's multiple states and state transitions, which in turn regulates the prediction of signals for appliances. More specifically, we employ a dual-CNN architecture: one CNN for outputting state distributions and the other for predicting the power of each state. A new technique is invented that utilizes conditional random fields (CRF) to capture state transitions. Experiments on two real-world datasets REDD and UK-DALE demonstrate that our model significantly outperform state-of-the-art models while having good generalization capacity, achieving 6%-10% MAE gain and 33%-51% SAE gain to unseen appliances.  ( 2 min )
    Improving Differentiable Architecture Search via Self-Distillation. (arXiv:2302.05629v1 [cs.CV])
    Differentiable Architecture Search (DARTS) is a simple yet efficient Neural Architecture Search (NAS) method. During the search stage, DARTS trains a supernet by jointly optimizing architecture parameters and network parameters. During the evaluation stage, DARTS derives the optimal architecture based on architecture parameters. However, the loss landscape of the supernet is not smooth, and it results in a performance gap between the supernet and the optimal architecture. In the paper, we propose Self-Distillation Differentiable Neural Architecture Search (SD-DARTS) by utilizing self-distillation to transfer knowledge of the supernet in previous steps to guide the training of the supernet in the current steps. SD-DARTS can minimize the loss difference for the two consecutive iterations so that minimize the sharpness of the supernet's loss to bridge the performance gap between the supernet and the optimal architecture. Furthermore, we propose voted teachers, which select multiple previous supernets as teachers and vote teacher output probabilities as the final teacher prediction. The knowledge of several teachers is more abundant than a single teacher, thus, voted teachers can be more suitable to lead the training of the supernet. Experimental results on real datasets illustrate the advantages of our novel self-distillation-based NAS method compared to state-of-the-art alternatives.  ( 2 min )
    Robust Scheduling with GFlowNets. (arXiv:2302.05446v1 [cs.AI])
    Finding the best way to schedule operations in a computation graph is a classical NP-hard problem which is central to compiler optimization. However, evaluating the goodness of a schedule on the target hardware can be very time-consuming. Traditional approaches as well as previous machine learning ones typically optimize proxy metrics, which are fast to evaluate but can lead to bad schedules when tested on the target hardware. In this work, we propose a new approach to scheduling by sampling proportionally to the proxy metric using a novel GFlowNet method. We introduce a technique to control the trade-off between diversity and goodness of the proposed schedules at inference time and demonstrate empirically that the pure optimization baselines can lead to subpar performance with respect to our approach when tested on a target model. Furthermore, we show that conditioning the GFlowNet on the computation graph enables generalization to unseen scheduling problems for both synthetic and real-world compiler datasets.  ( 2 min )
    A novel approach to generate datasets with XAI ground truth to evaluate image models. (arXiv:2302.05624v1 [cs.CV])
    With the increased usage of artificial intelligence (AI), it is imperative to understand how these models work internally. These needs have led to the development of a new field called eXplainable artificial intelligence (XAI). This field consists of on a set of techniques that allows us to theoretically determine the cause of the AI decisions. One unsolved question about XAI is how to measure the quality of explanations. In this study, we propose a new method to generate datasets with ground truth (GT). These datasets allow us to measure how faithful is a method without ad hoc solutions. We conducted a set of experiments that compared our GT with real model explanations and obtained excellent results confirming that our proposed method is correct.  ( 2 min )
    Machine Learning Based Approach to Recommend MITRE ATT&CK Framework for Software Requirements and Design Specifications. (arXiv:2302.05530v1 [cs.SE])
    Engineering more secure software has become a critical challenge in the cyber world. It is very important to develop methodologies, techniques, and tools for developing secure software. To develop secure software, software developers need to think like an attacker through mining software repositories. These aim to analyze and understand the data repositories related to software development. The main goal is to use these software repositories to support the decision-making process of software development. There are different vulnerability databases like Common Weakness Enumeration (CWE), Common Vulnerabilities and Exposures database (CVE), and CAPEC. We utilized a database called MITRE. MITRE ATT&CK tactics and techniques have been used in various ways and methods, but tools for utilizing these tactics and techniques in the early stages of the software development life cycle (SDLC) are lacking. In this paper, we use machine learning algorithms to map requirements to the MITRE ATT&CK database and determine the accuracy of each mapping depending on the data split.  ( 2 min )
    FairPy: A Toolkit for Evaluation of Social Biases and their Mitigation in Large Language Models. (arXiv:2302.05508v1 [cs.CL])
    Studies have shown that large pretrained language models exhibit biases against social groups based on race, gender etc, which they inherit from the datasets they are trained on. Various researchers have proposed mathematical tools for quantifying and identifying these biases. There have been methods proposed to mitigate such biases. In this paper, we present a comprehensive quantitative evaluation of different kinds of biases such as race, gender, ethnicity, age etc. exhibited by popular pretrained language models such as BERT, GPT-2 etc. and also present a toolkit that provides plug-and-play interfaces to connect mathematical tools to identify biases with large pretrained language models such as BERT, GPT-2 etc. and also present users with the opportunity to test custom models against these metrics. The toolkit also allows users to debias existing and custom models using the debiasing techniques proposed so far. The toolkit is available at https://github.com/HrishikeshVish/Fairpy.  ( 2 min )
    Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. (arXiv:2302.05733v1 [cs.CR])
    Recent advances in instruction-following large language models (LLMs) have led to dramatic improvements in a range of NLP tasks. Unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. The capabilities of these instruction-following LLMs provide strong economic incentives for dual-use by malicious actors. In particular, we show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams, bypassing in-the-wild defenses implemented by LLM API vendors. Our analysis shows that this content can be generated economically and at cost likely lower than with human effort alone. Together, our findings suggest that LLMs will increasingly attract more sophisticated adversaries and attacks, and addressing these attacks may require new approaches to mitigations.  ( 2 min )
    ReMIX: Regret Minimization for Monotonic Value Function Factorization in Multiagent Reinforcement Learning. (arXiv:2302.05593v1 [cs.LG])
    Value function factorization methods have become a dominant approach for cooperative multiagent reinforcement learning under a centralized training and decentralized execution paradigm. By factorizing the optimal joint action-value function using a monotonic mixing function of agents' utilities, these algorithms ensure the consistency between joint and local action selections for decentralized decision-making. Nevertheless, the use of monotonic mixing functions also induces representational limitations. Finding the optimal projection of an unrestricted mixing function onto monotonic function classes is still an open problem. To this end, we propose ReMIX, formulating this optimal projection problem for value function factorization as a regret minimization over the projection weights of different state-action values. Such an optimization problem can be relaxed and solved using the Lagrangian multiplier method to obtain the close-form optimal projection weights. By minimizing the resulting policy regret, we can narrow the gap between the optimal and the restricted monotonic mixing functions, thus obtaining an improved monotonic value function factorization. Our experimental results on Predator-Prey and StarCraft Multiagent Challenge environments demonstrate the effectiveness of our method, indicating the better capabilities of handling environments with non-monotonic value functions.  ( 2 min )
    Cross-center Early Sepsis Recognition by Medical Knowledge Guided Collaborative Learning for Data-scarce Hospitals. (arXiv:2302.05702v1 [cs.LG])
    There are significant regional inequities in health resources around the world. It has become one of the most focused topics to improve health services for data-scarce hospitals and promote health equity through knowledge sharing among medical institutions. Because electronic medical records (EMRs) contain sensitive personal information, privacy protection is unavoidable and essential for multi-hospital collaboration. In this paper, for a common disease in ICU patients, sepsis, we propose a novel cross-center collaborative learning framework guided by medical knowledge, SofaNet, to achieve early recognition of this disease. The Sepsis-3 guideline, published in 2016, defines that sepsis can be diagnosed by satisfying both suspicion of infection and Sequential Organ Failure Assessment (SOFA) greater than or equal to 2. Based on this knowledge, SofaNet adopts a multi-channel GRU structure to predict SOFA values of different systems, which can be seen as an auxiliary task to generate better health status representations for sepsis recognition. Moreover, we only achieve feature distribution alignment in the hidden space during cross-center collaborative learning, which ensures secure and compliant knowledge transfer without raw data exchange. Extensive experiments on two open clinical datasets, MIMIC-III and Challenge, demonstrate that SofaNet can benefit early sepsis recognition when hospitals only have limited EMRs.  ( 2 min )
    Satellite Anomaly Detection Using Variance Based Genetic Ensemble of Neural Networks. (arXiv:2302.05525v1 [cs.LG])
    In this paper, we use a variance-based genetic ensemble (VGE) of Neural Networks (NNs) to detect anomalies in the satellite's historical data. We use an efficient ensemble of the predictions from multiple Recurrent Neural Networks (RNNs) by leveraging each model's uncertainty level (variance). For prediction, each RNN is guided by a Genetic Algorithm (GA) which constructs the optimal structure for each RNN model. However, finding the model uncertainty level is challenging in many cases. Although the Bayesian NNs (BNNs)-based methods are popular for providing the confidence bound of the models, they cannot be employed in complex NN structures as they are computationally intractable. This paper uses the Monte Carlo (MC) dropout as an approximation version of BNNs. Then these uncertainty levels and each predictive model suggested by GA are used to generate a new model, which is then used for forecasting the TS and AD. Simulation results show that the forecasting and AD capability of the ensemble model outperforms existing approaches.  ( 2 min )
    Fair Enough: Standardizing Evaluation and Model Selection for Fairness Research in NLP. (arXiv:2302.05711v1 [cs.CL])
    Modern NLP systems exhibit a range of biases, which a growing literature on model debiasing attempts to correct. However current progress is hampered by a plurality of definitions of bias, means of quantification, and oftentimes vague relation between debiasing algorithms and theoretical measures of bias. This paper seeks to clarify the current situation and plot a course for meaningful progress in fair learning, with two key contributions: (1) making clear inter-relations among the current gamut of methods, and their relation to fairness theory; and (2) addressing the practical problem of model selection, which involves a trade-off between fairness and accuracy and has led to systemic issues in fairness research. Putting them together, we make several recommendations to help shape future work.  ( 2 min )
    On Differential Privacy and Adaptive Data Analysis with Bounded Space. (arXiv:2302.05707v1 [cs.CR])
    We study the space complexity of the two related fields of differential privacy and adaptive data analysis. Specifically, (1) Under standard cryptographic assumptions, we show that there exists a problem P that requires exponentially more space to be solved efficiently with differential privacy, compared to the space needed without privacy. To the best of our knowledge, this is the first separation between the space complexity of private and non-private algorithms. (2) The line of work on adaptive data analysis focuses on understanding the number of samples needed for answering a sequence of adaptive queries. We revisit previous lower bounds at a foundational level, and show that they are a consequence of a space bottleneck rather than a sampling bottleneck. To obtain our results, we define and construct an encryption scheme with multiple keys that is built to withstand a limited amount of key leakage in a very particular way.  ( 2 min )
    Cross-domain Random Pre-training with Prototypes for Reinforcement Learning. (arXiv:2302.05614v1 [cs.LG])
    Task-agnostic cross-domain pre-training shows great potential in image-based Reinforcement Learning (RL) but poses a big challenge. In this paper, we propose CRPTpro, a Cross-domain self-supervised Random Pre-Training framework with prototypes for image-based RL. CRPTpro employs cross-domain random policy to easily and quickly sample diverse data from multiple domains, to improve pre-training efficiency. Moreover, prototypical representation learning with a novel intrinsic loss is proposed to pre-train an effective and generic encoder across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream visual-control RL tasks defined in different domains efficiently. Compared with prior arts like APT and Proto-RL, CRPTpro achieves better performance on cross-domain downstream RL tasks without extra training on exploration agents for expert data collection, greatly reducing the burden of pre-training. Experiments on DeepMind Control suite (DMControl) demonstrate that CRPTpro outperforms APT significantly on 11/12 cross-domain RL tasks with only 39% pre-training hours, becoming a state-of-the-art cross-domain pre-training method in both policy learning performance and pre-training efficiency. The complete code will be released at https://github.com/liuxin0824/CRPTpro.  ( 2 min )
    SLOTH: Structured Learning and Task-based Optimization for Time Series Forecasting on Hierarchies. (arXiv:2302.05650v1 [cs.LG])
    Multivariate time series forecasting with hierarchical structure is widely used in real-world applications, e.g., sales predictions for the geographical hierarchy formed by cities, states, and countries. The hierarchical time series (HTS) forecasting includes two sub-tasks, i.e., forecasting and reconciliation. In the previous works, hierarchical information is only integrated in the reconciliation step to maintain coherency, but not in forecasting step for accuracy improvement. In this paper, we propose two novel tree-based feature integration mechanisms, i.e., top-down convolution and bottom-up attention to leverage the information of the hierarchical structure to improve the forecasting performance. Moreover, unlike most previous reconciliation methods which either rely on strong assumptions or focus on coherent constraints only,we utilize deep neural optimization networks, which not only achieve coherency without any assumptions, but also allow more flexible and realistic constraints to achieve task-based targets, e.g., lower under-estimation penalty and meaningful decision-making loss to facilitate the subsequent downstream tasks. Experiments on real-world datasets demonstrate that our tree-based feature integration mechanism achieves superior performances on hierarchical forecasting tasks compared to the state-of-the-art methods, and our neural optimization networks can be applied to real-world tasks effectively without any additional effort under coherence and task-based constraints  ( 2 min )
    Robust Knowledge Transfer in Tiered Reinforcement Learning. (arXiv:2302.05534v1 [cs.LG])
    In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the "Optimal Value Dominance" for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.  ( 2 min )
    Privacy Against Agnostic Inference Attack in Vertical Federated Learning. (arXiv:2302.05545v1 [cs.CR])
    A novel form of inference attack in vertical federated learning (VFL) is proposed, where two parties collaborate in training a machine learning (ML) model. Logistic regression is considered for the VFL model. One party, referred to as the active party, possesses the ground truth labels of the samples in the training phase, while the other, referred to as the passive party, only shares a separate set of features corresponding to these samples. It is shown that the active party can carry out inference attacks on both training and prediction phase samples by acquiring an ML model independently trained on the training samples available to them. This type of inference attack does not require the active party to be aware of the score of a specific sample, hence it is referred to as an agnostic inference attack. It is shown that utilizing the observed confidence scores during the prediction phase, before the time of the attack, can improve the performance of the active party's autonomous model, and thus improve the quality of the agnostic inference attack. As a countermeasure, privacy-preserving schemes (PPSs) are proposed. While the proposed schemes preserve the utility of the VFL model, they systematically distort the VFL parameters corresponding to the passive party's features. The level of the distortion imposed on the passive party's parameters is adjustable, giving rise to a trade-off between privacy of the passive party and interpretabiliy of the VFL outcomes by the active party. The distortion level of the passive party's parameters could be chosen carefully according to the privacy and interpretabiliy concerns of the passive and active parties, respectively, with the hope of keeping both parties (partially) satisfied. Finally, experimental results demonstrate the effectiveness of the proposed attack and the PPSs.  ( 2 min )
    Predicting Participants' Performance in Programming Contests using Deep Learning Techniques. (arXiv:2302.05602v1 [cs.LG])
    In recent days, the number of technology enthusiasts is increasing day by day with the prevalence of technological products and easy access to the internet. Similarly, the amount of people working behind this rapid development is rising tremendously. Computer programmers consist of a large portion of those tech-savvy people. Codeforces, an online programming and contest hosting platform used by many competitive programmers worldwide. It is regarded as one of the most standardized platforms for practicing programming problems and participate in programming contests. In this research, we propose a framework that predicts the performance of any particular contestant in the upcoming competitions as well as predicts the rating after that contest based on their practice and the performance of their previous contests.  ( 2 min )
    PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream. (arXiv:2302.05550v1 [cs.IR])
    Summarizing text-rich documents has been long studied in the literature, but most of the existing efforts have been made to summarize a static and predefined multi-document set. With the rapid development of online platforms for generating and distributing text-rich documents, there arises an urgent need for continuously summarizing dynamically evolving multi-document sets where the composition of documents and sets is changing over time. This is especially challenging as the summarization should be not only effective in incorporating relevant, novel, and distinctive information from each concurrent multi-document set, but also efficient in serving online applications. In this work, we propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents. To update new summaries, the most representative sentences for each multi-document set are extracted by measuring their similarities to the prototypes. A thorough evaluation with real multi-document sets streams demonstrates that PDSum outperforms state-of-the-art unsupervised multi-document summarization algorithms in EMDS in terms of relevance, novelty, and distinctiveness and is also robust to various evaluation settings.  ( 2 min )
    Brain Effective Connectome based on fMRI and DTI Data: Bayesian Causal Learning and Assessment. (arXiv:2302.05451v1 [cs.LG])
    The ambitious goal of neuroscientific studies is to find an accurate and reliable brain Effective Connectome (EC). Although current EC discovery methods have contributed to our understanding of brain organization, their performances are severely constrained by the short sample size and poor temporal resolution of fMRI data, and high dimensionality of the brain connectome. By leveraging the DTI data as prior knowledge, we introduce two Bayesian casual discovery frameworks -- the Bayesian GOLEM (BGOLEM) and Bayesian FGES (BFGES) methods -- as the most reliable and accurate methods in discovering EC that address the shortcomings of the current causal discovery methods in discovering ECs based on only fMRI data. Through a series of simulation studies on synthetic and hybrid (DTI of the Human Connectome Project (HCP) subjects and synthetic fMRI) data, we first demonstrate the effectiveness and importance of the proposed methods in discovering EC. We also introduce the Pseudo False Discovery Rate (PFDR) as a new accuracy metric for causal discovery in the brain and show that our Bayesian methods achieve higher accuracy than traditional methods on empirical data (DTI and fMRI of the Human Connectome Project (HCP) subjects). Additionally, we measure the reliability of discovered ECs using the Rogers-Tanimoto index for test-retest data and show that our Bayesian methods provide significantly more reproducible ECs compared to traditional methods.  ( 2 min )
    CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. (arXiv:2302.05527v1 [cs.SE])
    Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.  ( 2 min )
    Achieving acceleration despite very noisy gradients. (arXiv:2302.05515v1 [stat.ML])
    We present a novel momentum-based first order optimization method (AGNES) which provably achieves acceleration for convex minimization, even if the stochastic noise in the gradient estimates is many orders of magnitude larger than the gradient itself. Here we model the noise as having a variance which is proportional to the magnitude of the underlying gradient. We argue, based upon empirical evidence, that this is appropriate for mini-batch gradients in overparameterized deep learning. Furthermore, we demonstrate that the method achieves competitive performance in the training of CNNs on MNIST and CIFAR-10.  ( 2 min )
  • Open

    Which Invariance Should We Transfer? A Causal Minimax Learning Approach. (arXiv:2107.01876v3 [stat.ML] UPDATED)
    A major barrier to deploying current machine learning models lies in their non-reliability to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Particularly, independent causal mechanisms-based methods proposed to remove mutable causal mechanisms via the do-operator. Compared to previous methods, the obtained stable predictors are more effective in identifying stable information. However, a key question remains: which subset of this whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we present a comprehensive minimax analysis from a causal perspective. Specifically, we first provide a graphical condition for the whole stable set to be optimal. When this condition fails, we surprisingly find with an example that this whole stable set, although can fully exploit stable information, is not the optimal one to transfer. To identify the optimal subset under this case, we propose to estimate the worst-case risk with a novel optimization scheme over the intervention functions on mutable causal mechanisms. We then propose an efficient algorithm to search for the subset with minimal worst-case risk, based on a newly defined equivalence relation between stable subsets. Compared to the exponential cost of exhaustively searching over all subsets, our searching strategy enjoys a polynomial complexity. The effectiveness and efficiency of our methods are demonstrated on synthetic data and the diagnosis of Alzheimer's disease.  ( 2 min )
    Beyond UCB: Statistical Complexity and Optimal Algorithms for Non-linear Ridge Bandits. (arXiv:2302.06025v1 [stat.ML])
    We consider the sequential decision-making problem where the mean outcome is a non-linear function of the chosen action. Compared with the linear model, two curious phenomena arise in non-linear models: first, in addition to the "learning phase" with a standard parametric rate for estimation or regret, there is an "burn-in period" with a fixed cost determined by the non-linear function; second, achieving the smallest burn-in cost requires new exploration algorithms. For a special family of non-linear functions named ridge functions in the literature, we derive upper and lower bounds on the optimal burn-in cost, and in addition, on the entire learning trajectory during the burn-in period via differential equations. In particular, a two-stage algorithm that first finds a good initial action and then treats the problem as locally linear is statistically optimal. In contrast, several classical algorithms, such as UCB and algorithms relying on regression oracles, are provably suboptimal.  ( 2 min )
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v3 [cs.LG] UPDATED)
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.  ( 2 min )
    Chaotic Hedging with Iterated Integrals and Neural Networks. (arXiv:2209.10166v2 [q-fin.MF] UPDATED)
    In this paper, we extend the Wiener-Ito chaos decomposition to the class of diffusion processes, whose drift and diffusion coefficient are of linear growth. By omitting the orthogonality in the chaos expansion, we are able to show that every $p$-integrable functional, for $p \in [1,\infty)$, can be represented as sum of iterated integrals of the underlying process. Using a truncated sum of this expansion and (possibly random) neural networks for the integrands, whose parameters are learned in a machine learning setting, we show that every financial derivative can be approximated arbitrarily well in the $L^p$-sense. Since the hedging strategy of the approximating option can be computed in closed form, we obtain an efficient algorithm that can replicate any integrable financial derivative with short runtime.  ( 2 min )
    Do PAC-Learners Learn the Marginal Distribution?. (arXiv:2302.06285v1 [cs.LG])
    We study a foundational variant of Valiant and Vapnik and Chervonenkis' Probably Approximately Correct (PAC)-Learning in which the adversary is restricted to a known family of marginal distributions $\mathscr{P}$. In particular, we study how the PAC-learnability of a triple $(\mathscr{P},X,H)$ relates to the learners ability to infer \emph{distributional} information about the adversary's choice of $D \in \mathscr{P}$. To this end, we introduce the `unsupervised' notion of \emph{TV-Learning}, which, given a class $(\mathscr{P},X,H)$, asks the learner to approximate $D$ from unlabeled samples with respect to a natural class-conditional total variation metric. In the classical distribution-free setting, we show that TV-learning is \emph{equivalent} to PAC-Learning: in other words, any learner must infer near-maximal information about $D$. On the other hand, we show this characterization breaks down for general $\mathscr{P}$, where PAC-Learning is strictly sandwiched between two approximate variants we call `Strong' and `Weak' TV-learning, roughly corresponding to unsupervised learners that estimate most relevant distances in $D$ with respect to $H$, but differ in whether the learner \emph{knows} the set of well-estimated events. Finally, we observe that TV-learning is in fact equivalent to the classical notion of \emph{uniform estimation}, and thereby give a strong refutation of the uniform convergence paradigm in supervised learning.  ( 2 min )
    Hierarchical Stochastic Block Model for Community Detection in Multiplex Networks. (arXiv:1904.05330v3 [cs.SI] UPDATED)
    Multiplex networks have become increasingly more prevalent in many fields, and have emerged as a powerful tool for modeling the complexity of real networks. There is a critical need for developing inference models for multiplex networks that can take into account potential dependencies across different layers, particularly when the aim is community detection. We add to a limited literature by proposing a novel and efficient Bayesian model for community detection in multiplex networks. A key feature of our approach is the ability to model varying communities at different network layers. In contrast, many existing models assume the same communities for all layers. Moreover, our model automatically picks up the necessary number of communities at each layer (as validated by real data examples). This is appealing, since deciding the number of communities is a challenging aspect of community detection, and especially so in the multiplex setting, if one allows the communities to change across layers. Borrowing ideas from hierarchical Bayesian modeling, we use a hierarchical Dirichlet prior to model community labels across layers, allowing dependency in their structure. Given the community labels, a stochastic block model (SBM) is assumed for each layer. We develop an efficient slice sampler for sampling the posterior distribution of the community labels as well as the link probabilities between communities. In doing so, we address some unique challenges posed by coupling the complex likelihood of SBM with the hierarchical nature of the prior on the labels. An extensive empirical validation is performed on simulated and real data, demonstrating the superior performance of the model over single-layer alternatives, as well as the ability to uncover interesting structures in real networks.  ( 3 min )
    Information-Directed Selection for Top-Two Algorithms. (arXiv:2205.12086v2 [stat.ML] UPDATED)
    We consider the best-k-arm identification problem for multi-armed bandits, where the objective is to select the exact set of k arms with the highest mean rewards by sequentially allocating measurement effort. We characterize the necessary and sufficient conditions for the optimal allocation using dual variables. Remarkably these optimality conditions lead to the extension of top-two algorithm design principle (Russo, 2020), initially proposed for best-arm identification. Furthermore, our optimality conditions induce a simple and effective selection rule dubbed information-directed selection (IDS) that selects one of the top-two candidates based on a measure of information gain. As a theoretical guarantee, we prove that integrated with IDS, top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a glaring open problem in the pure exploration literature (Russo, 2020). As a by-product, we show that for k > 1, top-two algorithms cannot achieve optimality even with an oracle tuning parameter. Numerical experiments show the superior performance of the proposed top-two algorithms with IDS and considerable improvement compared with algorithms without adaptive selection.  ( 2 min )
    A Graphical Point Process Framework for Understanding Removal Effects in Multi-Touch Attribution. (arXiv:2302.06075v1 [stat.ME])
    Marketers employ various online advertising channels to reach customers, and they are particularly interested in attribution for measuring the degree to which individual touchpoints contribute to an eventual conversion. The availability of individual customer-level path-to-purchase data and the increasing number of online marketing channels and types of touchpoints bring new challenges to this fundamental problem. We aim to tackle the attribution problem with finer granularity by conducting attribution at the path level. To this end, we develop a novel graphical point process framework to study the direct conversion effects and the full relational structure among numerous types of touchpoints simultaneously. Utilizing the temporal point process of conversion and the graphical structure, we further propose graphical attribution methods to allocate proper path-level conversion credit, called the attribution score, to individual touchpoints or corresponding channels for each customer's path to purchase. Our proposed attribution methods consider the attribution score as the removal effect, and we use the rigorous probabilistic definition to derive two types of removal effects. We examine the performance of our proposed methods in extensive simulation studies and compare their performance with commonly used attribution models. We also demonstrate the performance of the proposed methods in a real-world attribution application.  ( 2 min )
    Alternating Implicit Projected SGD and Its Efficient Variants for Equality-constrained Bilevel Optimization. (arXiv:2211.07096v2 [cs.LG] UPDATED)
    Stochastic bilevel optimization, which captures the inherent nested structure of machine learning problems, is gaining popularity in many recent applications. Existing works on bilevel optimization mostly consider either unconstrained problems or constrained upper-level problems. This paper considers the stochastic bilevel optimization problems with equality constraints both in the upper and lower levels. By leveraging the special structure of the equality constraints problem, the paper first presents an alternating implicit projected SGD approach and establishes the $\tilde{\cal O}(\epsilon^{-2})$ sample complexity that matches the state-of-the-art complexity of ALSET \citep{chen2021closing} for unconstrained bilevel problems. To further save the cost of projection, the paper presents two alternating implicit projection-efficient SGD approaches, where one algorithm enjoys the $\tilde{\cal O}(\epsilon^{-2}/T)$ upper-level and $\tilde{\cal O}(\epsilon^{-1.5}/T^{\frac{3}{4}})$ lower-level projection complexity with ${\cal O}(T)$ lower-level batch size, and the other one enjoys $\tilde{\cal O}(\epsilon^{-1.5})$ upper-level and lower-level projection complexity with ${\cal O}(1)$ batch size. Application to federated bilevel optimization has been presented to showcase the empirical performance of our algorithms. Our results demonstrate that equality-constrained bilevel optimization with strongly-convex lower-level problems can be solved as efficiently as stochastic single-level optimization problems.  ( 2 min )
    Breaking the Curse of Multiagency: Provably Efficient Decentralized Multi-Agent RL with Function Approximation. (arXiv:2302.06606v1 [cs.LG])
    A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the curse of multiagency, where the description length of the game as well as the complexity of many existing learning algorithms scale exponentially with the number of agents. While recent works successfully address this challenge under the model of tabular Markov Games, their mechanisms critically rely on the number of states being finite and small, and do not extend to practical scenarios with enormous state spaces where function approximation must be used to approximate value functions or policies. This paper presents the first line of MARL algorithms that provably resolve the curse of multiagency under function approximation. We design a new decentralized algorithm -- V-Learning with Policy Replay, which gives the first polynomial sample complexity results for learning approximate Coarse Correlated Equilibria (CCEs) of Markov Games under decentralized linear function approximation. Our algorithm always outputs Markov CCEs, and achieves an optimal rate of $\widetilde{\mathcal{O}}(\epsilon^{-2})$ for finding $\epsilon$-optimal solutions. Also, when restricted to the tabular case, our result improves over the current best decentralized result $\widetilde{\mathcal{O}}(\epsilon^{-3})$ for finding Markov CCEs. We further present an alternative algorithm -- Decentralized Optimistic Policy Mirror Descent, which finds policy-class-restricted CCEs using a polynomial number of samples. In exchange for learning a weaker version of CCEs, this algorithm applies to a wider range of problems under generic function approximation, such as linear quadratic games and MARL problems with low ''marginal'' Eluder dimension.  ( 2 min )
    Event-Triggered Time-Varying Bayesian Optimization. (arXiv:2208.10790v2 [cs.LG] UPDATED)
    We consider the problem of sequentially optimizing a time-varying objective function using time-varying Bayesian optimization (TVBO). Here, the key challenge is the exploration-exploitation trade-off under time variations. Current approaches to TVBO require prior knowledge of a constant rate of change. However, the rate of change is usually neither known nor constant. We propose an event-triggered algorithm, ET-GP-UCB, that treats the optimization problem as static until it detects changes in the objective function online and then resets the dataset. This allows the algorithm to adapt to realized temporal changes without the need for prior knowledge. The event-trigger is based on probabilistic uniform error bounds used in Gaussian process regression. We provide regret bounds for ET-GP-UCB and show in numerical experiments that it is competitive with state-of-the-art algorithms even though it requires no knowledge about the temporal changes. Further, ET-GP-UCB outperforms these baselines if the rate of change is misspecified, and we demonstrate that it is readily applicable to various settings without tuning hyperparameters.  ( 2 min )
    Dark solitons in Bose-Einstein condensates: a dataset for many-body physics research. (arXiv:2205.09114v2 [cond-mat.quant-gas] UPDATED)
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose--Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About $33~\%$ of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and OD as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v3 [cs.LG] UPDATED)
    We provide a first finite-particle convergence rate for Stein variational gradient descent (SVGD). Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.  ( 2 min )
    A Characterization of Multioutput Learnability. (arXiv:2301.02729v2 [cs.LG] UPDATED)
    We consider the problem of learning multioutput function classes in batch and online settings. In both settings, we show that a multioutput function class is learnable if and only if each single-output restriction of the function class is learnable. This provides a complete characterization of the learnability of multilabel classification and multioutput regression in both batch and online settings. As an extension, we also consider multilabel learnability in the bandit feedback setting and show a similar characterization as in the full-feedback setting.  ( 2 min )
    A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks. (arXiv:2001.11443v3 [cs.LG] UPDATED)
    We develop a mathematically rigorous framework for multilayer neural networks in the mean field regime. As the network's widths increase, the network's learning trajectory is shown to be well captured by a meaningful and dynamically nonlinear limit (the \textit{mean field} limit), which is characterized by a system of ODEs. Our framework applies to a broad range of network architectures, learning dynamics and network initializations. Central to the framework is the new idea of a \textit{neuronal embedding}, which comprises of a non-evolving probability space that allows to embed neural networks of arbitrary widths. Using our framework, we prove several properties of large-width multilayer neural networks. Firstly we show that independent and identically distributed initializations cause strong degeneracy effects on the network's learning trajectory when the network's depth is at least four. Secondly we obtain several global convergence guarantees for feedforward multilayer networks under a number of different setups. These include two-layer and three-layer networks with independent and identically distributed initializations, and multilayer networks of arbitrary depths with a special type of correlated initializations that is motivated by the new concept of \textit{bidirectional diversity}. Unlike previous works that rely on convexity, our results admit non-convex losses and hinge on a certain universal approximation property, which is a distinctive feature of infinite-width neural networks and is shown to hold throughout the training process. Aside from being the first known results for global convergence of multilayer networks in the mean field regime, they demonstrate flexibility of our framework and incorporate several new ideas and insights that depart from the conventional convex optimization wisdom.  ( 3 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v6 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    On the geometry of Stein variational gradient descent. (arXiv:1912.00894v2 [stat.ML] UPDATED)
    Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performance gains of these in various numerical experiments.  ( 2 min )
    A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity. (arXiv:2302.06015v1 [cs.LG])
    Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.  ( 2 min )
    Improving Accuracy of Interpretability Measures in Hyperparameter Optimization via Bayesian Algorithm Execution. (arXiv:2206.05447v2 [cs.LG] UPDATED)
    Despite all the benefits of automated hyperparameter optimization (HPO), most modern HPO algorithms are black-boxes themselves. This makes it difficult to understand the decision process which leads to the selected configuration, reduces trust in HPO, and thus hinders its broad adoption. Here, we study the combination of HPO with interpretable machine learning (IML) methods such as partial dependence plots. These techniques are more and more used to explain the marginal effect of hyperparameters on the black-box cost function or to quantify the importance of hyperparameters. However, if such methods are naively applied to the experimental data of the HPO process in a post-hoc manner, the underlying sampling bias of the optimizer can distort interpretations. We propose a modified HPO method which efficiently balances the search for the global optimum w.r.t. predictive performance \emph{and} the reliable estimation of IML explanations of an underlying black-box function by coupling Bayesian optimization and Bayesian Algorithm Execution. On benchmark cases of both synthetic objectives and HPO of a neural network, we demonstrate that our method returns more reliable explanations of the underlying black-box without a loss of optimization performance.  ( 2 min )
    Transformers in Time Series: A Survey. (arXiv:2202.07125v4 [cs.LG] UPDATED)
    Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interest in the time series community. Among multiple advantages of Transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications. In this paper, we systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series Transformers in two perspectives. From the perspective of network structure, we summarize the adaptations and modifications that have been made to Transformers in order to accommodate the challenges in time series analysis. From the perspective of applications, we categorize time series Transformers based on common tasks including forecasting, anomaly detection, and classification. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how Transformers perform in time series. Finally, we discuss and suggest future directions to provide useful research guidance. A corresponding resource that has been continuously updated can be found in the GitHub repository. To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers.  ( 2 min )
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v5 [cs.LG] UPDATED)
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.  ( 2 min )
    A Framework for Overparameterized Learning. (arXiv:2205.13507v2 [cs.LG] UPDATED)
    A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.  ( 2 min )
    On the equivalence between graph isomorphism testing and function approximation with GNNs. (arXiv:1905.12560v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved much success on graph-structured data. In light of this, there have been increasing interests in studying their expressive power. One line of work studies the capability of GNNs to approximate permutation-invariant functions on graphs, and another focuses on the their power as tests for graph isomorphism. Our work connects these two perspectives and proves their equivalence. We further develop a framework of the expressive power of GNNs that incorporates both of these viewpoints using the language of sigma-algebra, through which we compare the expressive power of different types of GNNs together with other graph isomorphism tests. In particular, we prove that the second-order Invariant Graph Network fails to distinguish non-isomorphic regular graphs with the same degree. Then, we extend it to a new architecture, Ring-GNN, which succeeds in distinguishing these graphs and achieves good performances on real-world datasets.  ( 2 min )
    Probabilistic Estimation of Instantaneous Frequencies of Chirp Signals. (arXiv:2205.06306v2 [stat.ML] UPDATED)
    We present a continuous-time probabilistic approach for estimating the chirp signal and its instantaneous frequency function when the true forms of these functions are not accessible. Our model represents these functions by non-linearly cascaded Gaussian processes represented as non-linear stochastic differential equations. The posterior distribution of the functions is then estimated with stochastic filters and smoothers. We compute a (posterior) Cram\'er--Rao lower bound for the Gaussian process model, and derive a theoretical upper bound for the estimation error in the mean squared sense. The experiments show that the proposed method outperforms a number of state-of-the-art methods on a synthetic data. We also show that the method works out-of-the-box for two real-world datasets.  ( 2 min )
    Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations. (arXiv:2107.00520v5 [cs.LG] UPDATED)
    In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.  ( 2 min )
    Blessing of Class Diversity in Pre-training. (arXiv:2209.03447v3 [cs.LG] UPDATED)
    This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.  ( 2 min )
    Plasticity Neural Network Based on Astrocytic effects at Critical Period, Synaptic Competition and Strength Rebalance by Current and Mnemonic Brain Plasticity and Synapse Formation. (arXiv:2203.11740v9 [cs.NE] UPDATED)
    In addition to the weights of synaptic shared connections, PNN includes weights of synaptic effective ranges [14-24]. PNN considers synaptic strength balance in dynamic of phagocytosing of synapses and static of constant sum of synapses length [14], and includes the lead behavior of the school of fish. Synapse formation will inhibit dendrites generation to a certain extent in experiments and PNN simulations [15]. The memory persistence gradient of retrograde circuit similar to the Enforcing Resilience in a Spring Boot. The relatively good and inferior gradient information stored in memory engram cells in synapse formation of retrograde circuit like the folds of the brain [16]. The controversy was claimed if human hippocampal neurogenesis persists throughout aging, PNN considered it may have a new and longer circuit in late iteration [17,18]. Closing the critical period will cause neurological disorder in experiments and PNN simulations [19]. Considering both negative and positive memories persistence help activate synapse length changes with iterations better than only considering positive memory [20]. Astrocytic phagocytosis will avoid the local accumulation of synapses by simulation, Lack of astrocytic phagocytosis causes excitatory synapses and functionally impaired synapses accumulate in experiments and lead to destruction of cognition, but local longer synapses and worse results in PNN simulations [21]. It gives relationship of intelligence and cortical thickness, individual differences in brain [22]. The PNN also considered the memory engram cells that strengthened Synaptic strength [23]. The effects of PNN's memory structure and tPBM may be the same for powerful penetrability of signals [24]. Memory persistence also inhibit local synaptic accumulation. By PNN, it may introduce the relatively good and inferior solution in PSO. The simple PNN only has the synaptic phagocytosis.  ( 3 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v6 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input and produce as an output nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general types of nonnegative neural networks. The results of this paper contribute to the understanding of the behavior of autoencoders, and they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations.  ( 2 min )
    Generalization Ability of Wide Neural Networks on $\mathbb{R}$. (arXiv:2302.05933v1 [stat.ML])
    We perform a study on the generalization ability of the wide two-layer ReLU neural network on $\mathbb{R}$. We first establish some spectral properties of the neural tangent kernel (NTK): $a)$ $K_{d}$, the NTK defined on $\mathbb{R}^{d}$, is positive definite; $b)$ $\lambda_{i}(K_{1})$, the $i$-th largest eigenvalue of $K_{1}$, is proportional to $i^{-2}$. We then show that: $i)$ when the width $m\rightarrow\infty$, the neural network kernel (NNK) uniformly converges to the NTK; $ii)$ the minimax rate of regression over the RKHS associated to $K_{1}$ is $n^{-2/3}$; $iii)$ if one adopts the early stopping strategy in training a wide neural network, the resulting neural network achieves the minimax rate; $iv)$ if one trains the neural network till it overfits the data, the resulting neural network can not generalize well. Finally, we provide an explanation to reconcile our theory and the widely observed ``benign overfitting phenomenon''.  ( 2 min )
    On Proper Learnability between Average- and Worst-case Robustness. (arXiv:2211.05656v4 [cs.LG] UPDATED)
    Recently, \cite{montasser2019vc} showed that finite VC dimension is not sufficient for \textit{proper} adversarially robust PAC learning. In light of this hardness result, there is a growing effort to study what type of relaxations to the adversarially robust PAC learning setup can enable proper learnability. In this work, we initiate the study of proper learning under relaxations of the worst-case robust loss. We give a family of robust loss relaxations under which VC classes are properly PAC learning with sample complexity close to what one would require in the standard PAC learning setup. On the other hand, we show that for an existing and natural relaxation of the worst-case robust loss, finite VC dimension is not sufficient for proper learning. Lastly, we give new generalization guarantees for the adversarially robust empirical risk minimizer.  ( 2 min )
    Calibrated Forecasts: The Minimax Proof. (arXiv:2209.05863v2 [econ.TH] UPDATED)
    A formal write-up of the simple proof (1995) of the existence of calibrated forecasts by the minimax theorem, which moreover shows that $N^3$ periods suffice to guarantee a calibration error of at most $1/N$.  ( 2 min )
    LIMEtree: Consistent and Faithful Surrogate Explanations of Multiple Classes. (arXiv:2005.01427v2 [cs.LG] UPDATED)
    Explainable machine learning provides tools to better understand predictive models and their decisions, but many such methods are limited to producing insights with respect to a single class. When generating explanations for several classes, reasoning over them to obtain a complete view may be difficult since they can present competing or contradictory evidence. To address this issue we introduce a novel paradigm of multi-class explanations. We outline the theory behind such techniques and propose a local surrogate model based on multi-output regression trees -- called LIMEtree -- which offers faithful and consistent explanations of multiple classes for individual predictions while being post-hoc, model-agnostic and data-universal. In addition to strong fidelity guarantees, our implementation supports (interactive) customisation of the explanatory insights and delivers a range of diverse explanation types, including counterfactual statements favoured in the literature. We evaluate our algorithm with a collection of quantitative experiments, a qualitative analysis based on explainability desiderata and a preliminary user study on an image classification task, comparing it to LIME. Our contributions demonstrate the benefits of multi-class explanations and wide-ranging advantages of our method across a diverse set scenarios.  ( 2 min )
    GFlowNet-EM for learning compositional latent variable models. (arXiv:2302.06576v1 [cs.LG])
    Latent variable models (LVMs) with discrete compositional latents are an important but challenging setting due to a combinatorially large number of possible configurations of the latents. A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. For algorithms based on expectation-maximization (EM), the E-step is often intractable without restrictive approximations to the posterior. We propose the use of GFlowNets, algorithms for sampling from an unnormalized density by learning a stochastic policy for sequential construction of samples, for this intractable E-step. By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational inference algorithms for complex distributions over discrete structures. Our approach, GFlowNet-EM, enables the training of expressive LVMs with discrete compositional latents, as shown by experiments on non-context-free grammar induction and on images using discrete variational autoencoders (VAEs) without conditional independence enforced in the encoder.  ( 2 min )
    Universal Online Optimization in Dynamic Environments via Uniclass Prediction. (arXiv:2302.06066v1 [cs.LG])
    Recently, several universal methods have been proposed for online convex optimization which can handle convex, strongly convex and exponentially concave cost functions simultaneously. However, most of these algorithms have been designed with static regret minimization in mind, but this notion of regret may not be suitable for changing environments. To address this shortcoming, we propose a novel and intuitive framework for universal online optimization in dynamic environments. Unlike existing universal algorithms, our strategy does not rely on the construction of a set of experts and an accompanying meta-algorithm. Instead, we show that the problem of dynamic online optimization can be reduced to a uniclass prediction problem. By leaving the choice of uniclass loss function in the user's hands, they are able to control and optimize dynamic regret bounds, which in turn carry over into the original problem. To the best of our knowledge, this is the first paper proposing a universal approach with state-of-the-art dynamic regret guarantees even for general convex cost functions.  ( 2 min )
    A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models. (arXiv:2302.06235v1 [cs.LG])
    Contrastively trained text-image models have the remarkable ability to perform zero-shot classification, that is, classifying previously unseen images into categories that the model has never been explicitly trained to identify. However, these zero-shot classifiers need prompt engineering to achieve high accuracy. Prompt engineering typically requires hand-crafting a set of prompts for individual downstream tasks. In this work, we aim to automate this prompt engineering and improve zero-shot accuracy through prompt ensembling. In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?". We demonstrate that this is possible. In doing so, we identify several pathologies in a naive prompt scoring method where the score can be easily overconfident due to biases in pre-training and test data, and we propose a novel prompt scoring method that corrects for the biases. Using our proposed scoring method to create a weighted average prompt ensemble, our method outperforms equal average ensemble, as well as hand-crafted prompts, on ImageNet, 4 of its variants, and 11 fine-grained classification benchmarks, all while being fully automatic, optimization-free, and not requiring access to labeled validation data.  ( 2 min )
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v3 [stat.ML] UPDATED)
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.  ( 2 min )
    Recursive Estimation of Conditional Kernel Mean Embeddings. (arXiv:2302.05955v1 [stat.ML])
    Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.  ( 2 min )
    Variational Bayesian Neural Networks via Resolution of Singularities. (arXiv:2302.06035v1 [stat.ML])
    In this work, we advocate for the importance of singular learning theory (SLT) as it pertains to the theory and practice of variational inference in Bayesian neural networks (BNNs). To begin, using SLT, we lay to rest some of the confusion surrounding discrepancies between downstream predictive performance measured via e.g., the test log predictive density, and the variational objective. Next, we use the SLT-corrected asymptotic form for singular posterior distributions to inform the design of the variational family itself. Specifically, we build upon the idealized variational family introduced in \citet{bhattacharya_evidence_2020} which is theoretically appealing but practically intractable. Our proposal takes shape as a normalizing flow where the base distribution is a carefully-initialized generalized gamma. We conduct experiments comparing this to the canonical Gaussian base distribution and show improvements in terms of variational free energy and variational generalization error.  ( 2 min )
    Precise Asymptotic Analysis of Deep Random Feature Models. (arXiv:2302.06210v1 [stat.ML])
    We provide exact asymptotic expressions for the performance of regression by an $L-$layer deep random feature (RF) model, where the input is mapped through multiple random embedding and non-linear activation functions. For this purpose, we establish two key steps: First, we prove a novel universality result for RF models and deterministic data, by which we demonstrate that a deep random feature model is equivalent to a deep linear Gaussian model that matches it in the first and second moments, at each layer. Second, we make use of the convex Gaussian Min-Max theorem multiple times to obtain the exact behavior of deep RF models. We further characterize the variation of the eigendistribution in different layers of the equivalent Gaussian model, demonstrating that depth has a tangible effect on model performance despite the fact that only the last layer of the model is being trained.  ( 2 min )
    Deep Reinforcement Learning for Unmanned Aerial Vehicle-Assisted Vehicular Networks. (arXiv:1906.05015v11 [cs.LG] UPDATED)
    Unmanned aerial vehicles (UAVs) are envisioned to complement the 5G communication infrastructure in future smart cities. Hot spots easily appear in road intersections, where effective communication among vehicles is challenging. UAVs may serve as relays with the advantages of low price, easy deployment, line-of-sight links, and flexible mobility. In this paper, we study a UAV-assisted vehicular network where the UAV jointly adjusts its transmission control (power and channel) and 3D flight to maximize the total throughput. First, we formulate a Markov decision process (MDP) problem by modeling the mobility of the UAV/vehicles and the state transitions. Secondly, we solve the target problem using a deep reinforcement learning method, namely, the deep deterministic policy gradient (DDPG), and propose three solutions with different control objectives. Deep reinforcement learning methods obtain the optimal policy through the interactions with the environment without knowing the environment variables. Considering that environment variables in our problem are unknown and unmeasurable, we choose a deep reinforcement learning method to solve it. Moreover, considering the energy consumption of 3D flight, we extend the proposed solutions to maximize the total throughput per unit energy. To encourage or discourage the UAV's mobility according to its prediction, the DDPG framework is modified, where the UAV adjusts its learning rate automatically. Thirdly, in a simplified model with small state space and action space, we verify the optimality of proposed algorithms. Comparing with two baseline schemes, we demonstrate the effectiveness of proposed algorithms in a realistic model.  ( 3 min )
    Wide stochastic networks: Gaussian limit and PAC-Bayesian training. (arXiv:2106.09798v3 [stat.ML] UPDATED)
    The limit of infinite width allows for substantial simplifications in the analytical study of over-parameterised neural networks. With a suitable random initialisation, an extremely large network exhibits an approximately Gaussian behaviour. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables, holding both before and during training. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimises the generalisation bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.  ( 2 min )
    When Can We Track Significant Preference Shifts in Dueling Bandits?. (arXiv:2302.06595v1 [cs.LG])
    The $K$-armed dueling bandits problem, where the feedback is in the form of noisy pairwise preferences, has been widely studied due its applications in information retrieval, recommendation systems, etc. Motivated by concerns that user preferences/tastes can evolve over time, we consider the problem of dueling bandits with distribution shifts. Specifically, we study the recent notion of significant shifts (Suk and Kpotufe, 2022), and ask whether one can design an adaptive algorithm for the dueling problem with $O(\sqrt{K\tilde{L}T})$ dynamic regret, where $\tilde{L}$ is the (unknown) number of significant shifts in preferences. We show that the answer to this question depends on the properties of underlying preference distributions. Firstly, we give an impossibility result that rules out any algorithm with $O(\sqrt{K\tilde{L}T})$ dynamic regret under the well-studied Condorcet and SST classes of preference distributions. Secondly, we show that $\text{SST} \cap \text{STI}$ is the largest amongst popular classes of preference distributions where it is possible to design such an algorithm. Overall, our results provides an almost complete resolution of the above question for the hierarchy of distribution classes.  ( 2 min )
    The infinite Viterbi alignment and decay-convexity. (arXiv:1810.04115v5 [math.PR] UPDATED)
    The infinite Viterbi alignment is the limiting maximum a-posteriori estimate of the unobserved path in a hidden Markov model as the length of the time horizon grows. For models on state-space $\mathbb{R}^{d}$ satisfying a new ``decay-convexity'' condition, we develop an approach to existence of the infinite Viterbi alignment in an infinite dimensional Hilbert space. Quantitative bounds on the distance to the infinite Viterbi alignment, which are the first of their kind, are derived and used to illustrate how approximate estimation via parallelization can be accurate and scaleable to high-dimensional problems because the rate of convergence to the infinite Viterbi alignment does not necessarily depend on $d$. The results are applied to approximate estimation via parallelization and a model of neural population activity.  ( 2 min )
    Kernel Ridge Regression Inference. (arXiv:2302.06578v1 [math.ST])
    We provide uniform confidence bands for kernel ridge regression (KRR), with finite sample guarantees. KRR is ubiquitous, yet--to our knowledge--this paper supplies the first exact, uniform confidence bands for KRR in the non-parametric regime where the regularization parameter $\lambda$ converges to 0, for general data distributions. Our proposed uniform confidence band is based on a new, symmetrized multiplier bootstrap procedure with a closed form solution, which allows for valid uncertainty quantification without assumptions on the bias. To justify the procedure, we derive non-asymptotic, uniform Gaussian and bootstrap couplings for partial sums in a reproducing kernel Hilbert space (RKHS) with bounded kernel. Our results imply strong approximation for empirical processes indexed by the RKHS unit ball, with sharp, logarithmic dependence on the covering number.  ( 2 min )
    Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data. (arXiv:2302.06232v1 [cs.LG])
    Language-supervised vision models have recently attracted great attention in computer vision. A common approach to build such models is to use contrastive learning on paired data across the two modalities, as exemplified by Contrastive Language-Image Pre-Training (CLIP). In this paper, under linear representation settings, (i) we initiate the investigation of a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) including CLIP loss and show its connection to singular value decomposition (SVD). Namely, we show that each step of loss minimization by gradient descent can be seen as performing SVD on a contrastive cross-covariance matrix. Based on this insight, (ii) we analyze the performance of MMCL. We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality even under the presence of wrongly matched pairs. This characterizes the robustness of MMCL to noisy data. Furthermore, when we have access to additional unpaired data, (iii) we propose a new MMCL loss that incorporates additional unpaired datasets. We show that the algorithm can detect the ground-truth pairs and improve performance by fully exploiting unpaired datasets. The performance of the proposed algorithm was verified by numerical experiments.  ( 2 min )
    Physics informed WNO. (arXiv:2302.05925v1 [stat.ML])
    Deep neural operators are recognized as an effective tool for learning solution operators of complex partial differential equations (PDEs). As compared to laborious analytical and computational tools, a single neural operator can predict solutions of PDEs for varying initial or boundary conditions and different inputs. A recently proposed Wavelet Neural Operator (WNO) is one such operator that harnesses the advantage of time-frequency localization of wavelets to capture the manifolds in the spatial domain effectively. While WNO has proven to be a promising method for operator learning, the data-hungry nature of the framework is a major shortcoming. In this work, we propose a physics-informed WNO for learning the solution operators of families of parametric PDEs without labeled training data. The efficacy of the framework is validated and illustrated with four nonlinear spatiotemporal systems relevant to various fields of engineering and science.  ( 2 min )
    Distribution-Free Model for Community Detection. (arXiv:2111.07495v4 [cs.SI] UPDATED)
    Community detection for unweighted networks has been widely studied in network analysis, but the case of weighted networks remains a challenge. This paper proposes a general Distribution-Free Model (DFM) for weighted networks in which nodes are partitioned into different communities. DFM can be seen as a generalization of the famous stochastic blockmodels from unweighted networks to weighted networks. DFM does not require prior knowledge of a specific distribution for elements of the adjacency matrix but only the expected value. In particular, signed networks with latent community structures can be modeled by DFM. We build a theoretical guarantee to show that a simple spectral clustering algorithm stably yields consistent community detection under DFM. We also propose a four-step data generation process to generate adjacency matrices with missing edges by combining DFM, noise matrix, and a model for unweighted networks. Using experiments with simulated and real datasets, we show that some benchmark algorithms can successfully recover community membership for weighted networks generated by the proposed data generation process.  ( 2 min )
    Incorporating Expert Opinion on Observable Quantities into Statistical Models -- A General Framework. (arXiv:2302.06391v1 [stat.ML])
    This article describes an approach to incorporate expert opinion on observable quantities through the use of a loss function which updates a prior belief as opposed to specifying parameters on the priors. Eliciting information on observable quantities allows experts to provide meaningful information on a quantity familiar to them, in contrast to elicitation on model parameters, which may be subject to interactions with other parameters or non-linear transformations before obtaining an observable quantity. The approach to incorporating expert opinion described in this paper is distinctive in that we do not specify a prior to match an expert's opinion on observed quantity, rather we obtain a posterior by updating the model parameters through a loss function. This loss function contains the observable quantity, expressed a function of the parameters, and is related to the expert's opinion which is typically operationalized as a statistical distribution. Parameters which generate observable quantities which are further from the expert's opinion incur a higher loss, allowing for the model parameters to be estimated based on their fidelity to both the data and expert opinion, with the relative strength determined by the number of observations and precision of the elicited belief. Including expert opinion in this fashion allows for a flexible specification of the opinion and in many situations is straightforward to implement with commonly used probabilistic programming software. We highlight this using three worked examples of varying model complexity including survival models, a multivariate normal distribution and a regression problem.  ( 2 min )
    Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD. (arXiv:2302.06570v1 [stat.ML])
    This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pervasive $L_0$-smoothness. This class is rich enough to include highly non-smooth functions, such as $\exp(L_1 x)$ which is $(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the $L_0$-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate. We develop a technique that allows us to prove $\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for $(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time $\tau$ which is simultaneously "large" on average, yet also allows us to treat the adaptive step sizes before $\tau$ as (roughly) independent of the gradients. For general $(L_0,L_1)$-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter $\sigma_1 1$.  ( 2 min )
    Density-Softmax: Scalable and Distance-Aware Uncertainty Estimation under Distribution Shifts. (arXiv:2302.06495v1 [cs.LG])
    Prevalent deep learning models suffer from significant over-confidence under distribution shifts. In this paper, we propose Density-Softmax, a single deterministic approach for uncertainty estimation via a combination of density function with the softmax layer. By using the latent representation's likelihood value, our approach produces more uncertain predictions when test samples are distant from the training samples. Theoretically, we prove that Density-Softmax is distance aware, which means its associated uncertainty metrics are monotonic functions of distance metrics. This has been shown to be a necessary condition for a neural network to produce high-quality uncertainty estimation. Empirically, our method enjoys similar computational efficiency as standard softmax on shifted CIFAR-10, CIFAR-100, and ImageNet dataset across modern deep learning architectures. Notably, Density-Softmax uses 4 times fewer parameters than Deep Ensembles and 6 times lower latency than Rank-1 Bayesian Neural Network, while obtaining competitive predictive performance and lower calibration errors under distribution shifts.  ( 2 min )
    One-Shot Federated Conformal Prediction. (arXiv:2302.06322v1 [stat.ML])
    In this paper, we introduce a conformal prediction method to construct prediction sets in a oneshot federated learning setting. More specifically, we define a quantile-of-quantiles estimator and prove that for any distribution, it is possible to output prediction sets with desired coverage in only one round of communication. To mitigate privacy issues, we also describe a locally differentially private version of our estimator. Finally, over a wide range of experiments, we show that our method returns prediction sets with coverage and length very similar to those obtained in a centralized setting. Overall, these results demonstrate that our method is particularly well-suited to perform conformal predictions in a one-shot federated learning setting.  ( 2 min )
    Mean Field Optimization Problem Regularized by Fisher Information. (arXiv:2302.05938v1 [math.PR])
    Recently there is a rising interest in the research of mean field optimization, in particular because of its role in analyzing the training of neural networks. In this paper by adding the Fisher Information as the regularizer, we relate the regularized mean field optimization problem to a so-called mean field Schrodinger dynamics. We develop an energy-dissipation method to show that the marginal distributions of the mean field Schrodinger dynamics converge exponentially quickly towards the unique minimizer of the regularized optimization problem. Remarkably, the mean field Schrodinger dynamics is proved to be a gradient flow on the probability measure space with respect to the relative entropy. Finally we propose a Monte Carlo method to sample the marginal distributions of the mean field Schrodinger dynamics.  ( 2 min )
    Isotopic envelope identification by analysis of the spatial distribution of components in MALDI-MSI data. (arXiv:2302.06051v1 [stat.ML])
    One of the significant steps in the process leading to the identification of proteins is mass spectrometry, which allows for obtaining information about the structure of proteins. Removing isotope peaks from the mass spectrum is vital and it is done in a process called deisotoping. There are different algorithms for deisotoping, but they have their limitations, they are dedicated to different methods of mass spectrometry. Data from experiments performed with the MALDI-ToF technique are characterized by high dimensionality. This paper presents a method for identifying isotope envelopes in MALDI-ToF molecular imaging data based on the Mamdani-Assilan fuzzy system and spatial maps of the molecular distribution of peaks included in the isotopic envelope. Several image texture measures were used to evaluate spatial molecular distribution maps. The algorithm was tested on eight datasets obtained from the MALDI-ToF experiment on samples from the National Institute of Oncology in Gliwice from patients with cancer of the head and neck region. The data were subjected to pre-processing and feature extraction. The results were collected and compared with three existing deisotoping algorithms. The analysis of the obtained results showed that the method for identifying isotopic envelopes proposed in this paper enables the detection of overlapping envelopes by using the approach oriented to study peak pairs. Moreover, the proposed algorithm enables the analysis of large data sets.  ( 2 min )
    Dimension Reduction and MARS. (arXiv:2302.05790v1 [stat.ME])
    The multivariate adaptive regression spline (MARS) is one of the popular estimation methods for nonparametric multivariate regressions. However, as MARS is based on marginal splines, to incorporate interactions of covariates, products of the marginal splines must be used, which leads to an unmanageable number of basis functions when the order of interaction is high and results in low estimation efficiency. In this paper, we improve the performance of MARS by using linear combinations of the covariates which achieve sufficient dimension reduction. The special basis functions of MARS facilitate calculation of gradients of the regression function, and estimation of the linear combinations is obtained via eigen-analysis of the outer-product of the gradients. Under some technical conditions, the asymptotic theory is established for the proposed estimation method. Numerical studies including both simulation and empirical applications show its effectiveness in dimension reduction and improvement over MARS and other commonly-used nonparametric methods in regression estimation and prediction.  ( 2 min )
    Optimizing Orthogonalized Tensor Deflation via Random Tensor Theory. (arXiv:2302.05798v1 [stat.ML])
    This paper tackles the problem of recovering a low-rank signal tensor with possibly correlated components from a random noisy tensor, or so-called spiked tensor model. When the underlying components are orthogonal, they can be recovered efficiently using tensor deflation which consists of successive rank-one approximations, while non-orthogonal components may alter the tensor deflation mechanism, thereby preventing efficient recovery. Relying on recently developed random tensor tools, this paper deals precisely with the non-orthogonal case by deriving an asymptotic analysis of a parameterized deflation procedure performed on an order-three and rank-two spiked tensor. Based on this analysis, an efficient tensor deflation algorithm is proposed by optimizing the parameter introduced in the deflation mechanism, which in turn is proven to be optimal by construction for the studied tensor model. The same ideas could be extended to more general low-rank tensor models, e.g., higher ranks and orders, leading to more efficient tensor methods with a broader impact on machine learning and beyond.  ( 2 min )
    Windowed Fourier Analysis for Signal Processing on Graph Bundles. (arXiv:2302.05592v1 [eess.SP])
    We consider the task of representing signals supported on graph bundles, which are generalizations of product graphs that allow for "twists" in the product structure. Leveraging the localized product structure of a graph bundle, we demonstrate how a suitable partition of unity over the base graph can be used to lift the signal on the graph into a space where a product factorization can be readily applied. Motivated by the locality of this procedure, we demonstrate that bases for the signal spaces of the components of the graph bundle can be lifted in the same way, yielding a basis for the signal space of the total graph. We demonstrate this construction on synthetic graphs, as well as with an analysis of the energy landscape of conformational manifolds in stereochemistry.  ( 2 min )
    Communication and Storage Efficient Federated Split Learning. (arXiv:2302.05599v1 [cs.IT])
    Federated learning (FL) is a popular distributed machine learning (ML) paradigm, but is often limited by significant communication costs and edge device computation capabilities. Federated Split Learning (FSL) preserves the parallel model training principle of FL, with a reduced device computation requirement thanks to splitting the ML model between the server and clients. However, FSL still incurs very high communication overhead due to transmitting the smashed data and gradients between the clients and the server in each global round. Furthermore, the server has to maintain separate models for every client, resulting in a significant computation and storage requirement that grows linearly with the number of clients. This paper tries to solve these two issues by proposing a communication and storage efficient federated and split learning (CSE-FSL) strategy, which utilizes an auxiliary network to locally update the client models while keeping only a single model at the server, hence avoiding the communication of gradients from the server and greatly reducing the server resource requirement. Communication cost is further reduced by only sending the smashed data in selected epochs from the clients. We provide a rigorous theoretical analysis of CSE-FSL that guarantees its convergence for non-convex loss functions. Extensive experimental results demonstrate that CSE-FSL has a significant communication reduction over existing FSL techniques while achieving state-of-the-art convergence and model accuracy, using several real-world FL tasks.  ( 2 min )
    Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD. (arXiv:2302.05516v1 [stat.ML])
    Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.  ( 2 min )
    Pushing the Accuracy-Group Robustness Frontier with Introspective Self-play. (arXiv:2302.05807v1 [cs.LG])
    Standard empirical risk minimization (ERM) training can produce deep neural network (DNN) models that are accurate on average but under-perform in under-represented population subgroups, especially when there are imbalanced group distributions in the long-tailed training data. Therefore, approaches that improve the accuracy-group robustness trade-off frontier of a DNN model (i.e. improving worst-group accuracy without sacrificing average accuracy, or vice versa) is of crucial importance. Uncertainty-based active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple "plug-in" for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods.  ( 2 min )
    From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. (arXiv:2302.05882v1 [stat.ML])
    This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying analysis bridges different regimes of interest, such as the classical gradient-flow regime of vanishing learning rate, the high-dimensional regime of large input dimension, and the overparameterised "mean-field" regime of large network width, covering as well the intermediate regimes where the limiting dynamics is determined by the interplay between these behaviours. In particular, in the high-dimensional limit, the infinite-width dynamics is found to remain close to a low-dimensional subspace spanned by the target principal directions. Our results therefore provide a unifying picture of the limiting SGD dynamics with synthetic data.  ( 2 min )
    Sequential Underspecified Instrument Selection for Cause-Effect Estimation. (arXiv:2302.05684v1 [stat.ME])
    Instrumental variable (IV) methods are used to estimate causal effects in settings with unobserved confounding, where we cannot directly experiment on the treatment variable. Instruments are variables which only affect the outcome indirectly via the treatment variable(s). Most IV applications focus on low-dimensional treatments and crucially require at least as many instruments as treatments. This assumption is restrictive: in the natural sciences we often seek to infer causal effects of high-dimensional treatments (e.g., the effect of gene expressions or microbiota on health and disease), but can only run few experiments with a limited number of instruments (e.g., drugs or antibiotics). In such underspecified problems, the full treatment effect is not identifiable in a single experiment even in the linear case. We show that one can still reliably recover the projection of the treatment effect onto the instrumented subspace and develop techniques to consistently combine such partial estimates from different sets of instruments. We then leverage our combined estimators in an algorithm that iteratively proposes the most informative instruments at each round of experimentation to maximize the overall information about the full causal effect.  ( 2 min )
    Achieving acceleration despite very noisy gradients. (arXiv:2302.05515v1 [stat.ML])
    We present a novel momentum-based first order optimization method (AGNES) which provably achieves acceleration for convex minimization, even if the stochastic noise in the gradient estimates is many orders of magnitude larger than the gradient itself. Here we model the noise as having a variance which is proportional to the magnitude of the underlying gradient. We argue, based upon empirical evidence, that this is appropriate for mini-batch gradients in overparameterized deep learning. Furthermore, we demonstrate that the method achieves competitive performance in the training of CNNs on MNIST and CIFAR-10.  ( 2 min )
    Tighter PAC-Bayes Bounds Through Coin-Betting. (arXiv:2302.05829v1 [cs.LG])
    We consider the problem of estimating the mean of a sequence of random elements $f(X_1, \theta)$ $, \ldots, $ $f(X_n, \theta)$ where $f$ is a fixed scalar function, $S=(X_1, \ldots, X_n)$ are independent random variables, and $\theta$ is a possibly $S$-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on $n$ examples where $f$ is a loss function. Classically, this problem is approached through concentration inequalities holding uniformly over compact parameter sets of functions $f$, for example as in Rademacher or VC type analysis. However, in many problems, such inequalities often yield numerically vacuous estimates. Recently, the \emph{PAC-Bayes} framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve \emph{even tighter} guarantees. Our approach is based on the \emph{coin-betting} framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms. In particular, we derive the first PAC-Bayes concentration inequality based on the coin-betting approach that holds simultaneously for all sample sizes. We demonstrate its tightness showing that by \emph{relaxing} it we obtain a number of previous results in a closed form including Bernoulli-KL and empirical Bernstein inequalities. Finally, we propose an efficient algorithm to numerically calculate confidence sequences from our bound, which often generates nonvacuous confidence bounds even with one sample, unlike the state-of-the-art PAC-Bayes bounds.  ( 2 min )
    Graph Neural Network-Inspired Kernels for Gaussian Processes in Semi-Supervised Learning. (arXiv:2302.05828v1 [cs.LG])
    Gaussian processes (GPs) are an attractive class of machine learning models because of their simplicity and flexibility as building blocks of more complex Bayesian models. Meanwhile, graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning and beyond. Their competitive performance is often attributed to a proper capturing of the graph inductive bias. In this work, we introduce this inductive bias into GPs to improve their predictive performance for graph-structured data. We show that a prominent example of GNNs, the graph convolutional network, is equivalent to some GP when its layers are infinitely wide; and we analyze the kernel universality and the limiting behavior in depth. We further present a programmable procedure to compose covariance kernels inspired by this equivalence and derive example kernels corresponding to several interesting members of the GNN family. We also propose a computationally efficient approximation of the covariance matrix for scalable posterior inference with large-scale data. We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time, compared with the respective GNNs.  ( 2 min )
    An unsupervised learning approach for predicting wind farm power and downstream wakes using weather patterns. (arXiv:2302.05886v1 [stat.ML])
    Wind energy resource assessment typically requires numerical models, but such models are too computationally intensive to consider multi-year timescales. Increasingly, unsupervised machine learning techniques are used to identify a small number of representative weather patterns to simulate long-term behaviour. Here we develop a novel wind energy workflow that for the first time combines weather patterns derived from unsupervised clustering techniques with numerical weather prediction models (here WRF) to obtain efficient and accurate long-term predictions of power and downstream wakes from an entire wind farm. We use ERA5 reanalysis data clustering not only on low altitude pressure but also, for the first time, on the more relevant variable of wind velocity. We also compare the use of large-scale and local-scale domains for clustering. A WRF simulation is run at each of the cluster centres and the results are aggregated using a novel post-processing technique. By applying our workflow to two different regions, we show that our long-term predictions agree with those from a year of WRF simulations but require less than 2% of the computational time. The most accurate results are obtained when clustering on wind velocity. Moreover, clustering over the Europe-wide domain is sufficient for predicting wind farm power output, but downstream wake predictions benefit from the use of smaller domains. Finally, we show that these downstream wakes can affect the local weather patterns. Our approach facilitates multi-year predictions of power output and downstream farm wakes, by providing a fast, accurate and flexible methodology that is applicable to any global region. Moreover, these accurate long-term predictions of downstream wakes provide the first tool to help mitigate the effects of wind energy loss downstream of wind farms, since they can be used to determine optimum wind farm locations.  ( 3 min )
    I$^2$SB: Image-to-Image Schr\"odinger Bridge. (arXiv:2302.05872v1 [cs.CV])
    We propose Image-to-Image Schr\"odinger Bridge (I$^2$SB), a new class of conditional diffusion models that directly learn the nonlinear diffusion processes between two given distributions. These diffusion bridges are particularly useful for image restoration, as the degraded images are structurally informative priors for reconstructing the clean images. I$^2$SB belongs to a tractable class of Schr\"odinger bridge, the nonlinear extension to score-based models, whose marginal distributions can be computed analytically given boundary pairs. This results in a simulation-free framework for nonlinear diffusions, where the I$^2$SB training becomes scalable by adopting practical techniques used in standard diffusion models. We validate I$^2$SB in solving various image restoration tasks, including inpainting, super-resolution, deblurring, and JPEG restoration on ImageNet 256x256 and show that I$^2$SB surpasses standard conditional diffusion models with more interpretable generative processes. Moreover, I$^2$SB matches the performance of inverse methods that additionally require the knowledge of the corruption operators. Our work opens up new algorithmic opportunities for developing efficient nonlinear diffusion models on a large scale. scale. Project page: https://i2sb.github.io/  ( 2 min )
    Confidence and Uncertainty Assessment for Distributional Random Forests. (arXiv:2302.05761v1 [math.ST])
    The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate of the DRF prediction are available so far. We characterize the asymptotic distribution of DRF and develop a bootstrap approximation of it. This allows us to derive inferential tools for quantifying standard errors and the construction of confidence regions that have asymptotic coverage guarantees. In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations.  ( 2 min )
    A High-dimensional Convergence Theorem for U-statistics with Applications to Kernel-based Testing. (arXiv:2302.05686v1 [math.ST])
    We prove a convergence theorem for U-statistics of degree two, where the data dimension $d$ is allowed to scale with sample size $n$. We find that the limiting distribution of a U-statistic undergoes a phase transition from the non-degenerate Gaussian limit to the degenerate limit, regardless of its degeneracy and depending only on a moment ratio. A surprising consequence is that a non-degenerate U-statistic in high dimensions can have a non-Gaussian limit with a larger variance and asymmetric distribution. Our bounds are valid for any finite $n$ and $d$, independent of individual eigenvalues of the underlying function, and dimension-independent under a mild assumption. As an application, we apply our theory to two popular kernel-based distribution tests, MMD and KSD, whose high-dimensional performance has been challenging to study. In a simple empirical setting, our results correctly predict how the test power at a fixed threshold scales with $d$ and the bandwidth.  ( 2 min )
    Robust Knowledge Transfer in Tiered Reinforcement Learning. (arXiv:2302.05534v1 [cs.LG])
    In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the "Optimal Value Dominance" for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.  ( 2 min )
    Deep Neural Networks for Nonparametric Interaction Models with Diverging Dimension. (arXiv:2302.05851v1 [math.ST])
    Deep neural networks have achieved tremendous success due to their representation power and adaptation to low-dimensional structures. Their potential for estimating structured regression functions has been recently established in the literature. However, most of the studies require the input dimension to be fixed and consequently ignore the effect of dimension on the rate of convergence and hamper their applications to modern big data with high dimensionality. In this paper, we bridge this gap by analyzing a $k^{th}$ order nonparametric interaction model in both growing dimension scenarios ($d$ grows with $n$ but at a slower rate) and in high dimension ($d \gtrsim n$). In the latter case, sparsity assumptions and associated regularization are required in order to obtain optimal rates of convergence. A new challenge in diverging dimension setting is in calculation mean-square error, the covariance terms among estimated additive components are an order of magnitude larger than those of the variances and they can deteriorate statistical properties without proper care. We introduce a critical debiasing technique to amend the problem. We show that under certain standard assumptions, debiased deep neural networks achieve a minimax optimal rate both in terms of $(n, d)$. Our proof techniques rely crucially on a novel debiasing technique that makes the covariances of additive components negligible in the mean-square error calculation. In addition, we establish the matching lower bounds.  ( 2 min )
    Differentially Private Normalizing Flows for Density Estimation, Data Synthesis, and Variational Inference with Application to Electronic Health Records. (arXiv:2302.05787v1 [stat.ML])
    Electronic health records (EHR) often contain sensitive medical information about individual patients, posing significant limitations to sharing or releasing EHR data for downstream learning and inferential tasks. We use normalizing flows (NF), a family of deep generative models, to estimate the probability density of a dataset with differential privacy (DP) guarantees, from which privacy-preserving synthetic data are generated. We apply the technique to an EHR dataset containing patients with pulmonary hypertension. We assess the learning and inferential utility of the synthetic data by comparing the accuracy in the prediction of the hypertension status and variational posterior distribution of the parameters of a physics-based model. In addition, we use a simulated dataset from a nonlinear model to compare the results from variational inference (VI) based on privacy-preserving synthetic data, and privacy-preserving VI obtained from directly privatizing NFs for VI with DP guarantees given the original non-private dataset. The results suggest that synthetic data generated through differentially private density estimation with NF can yield good utility at a reasonable privacy cost. We also show that VI obtained from differentially private NF based on the free energy bound loss may produce variational approximations with significantly altered correlation structure, and loss formulations based on alternative dissimilarity metrics between two distributions might provide improved results.  ( 2 min )
    Global Convergence Rate of Deep Equilibrium Models with General Activations. (arXiv:2302.05797v1 [stat.ML])
    In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation and proved that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. In this paper, we show that this fact still holds for DEQs with any general activation which has bounded first and second derivatives. Since the new activation function is generally non-linear, a general population Gram matrix is designed, and a new form of dual activation with Hermite polynomial expansion is developed.  ( 2 min )
    Koopman-Based Bound for Generalization: New Aspect of Neural Networks Regarding Nonlinear Noise Filtering. (arXiv:2302.05825v1 [cs.LG])
    We propose a new bound for generalization of neural networks using Koopman operators. Unlike most of the existing works, we focus on the role of the final nonlinear transformation of the networks. Our bound is described by the reciprocal of the determinant of the weight matrices and is tighter than existing norm-based bounds when the weight matrices do not have small singular values. According to existing theories about the low-rankness of the weight matrices, it may be counter-intuitive that we focus on the case where singular values of weight matrices are not small. However, motivated by the final nonlinear transformation, we can see that our result sheds light on a new perspective regarding a noise filtering property of neural networks. Since our bound comes from Koopman operators, this work also provides a connection between operator-theoretic analysis and generalization of neural networks. Numerical results support the validity of our theoretical results.  ( 2 min )
    Distributional GFlowNets with Quantile Flows. (arXiv:2302.05793v1 [cs.LG])
    Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a stochastic policy for generating complex combinatorial structure through a series of decision-making steps. Despite being inspired from reinforcement learning, the current GFlowNet framework is relatively limited in its applicability and cannot handle stochasticity in the reward function. In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training. By parameterizing each edge flow through their quantile functions, our proposed \textit{quantile matching} GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty. Moreover, we find that the distributional approach can achieve substantial improvement on existing benchmarks compared to prior methods due to our enhanced training algorithm, even in settings with deterministic rewards.  ( 2 min )
    Efficient Fraud Detection using Deep Boosting Decision Trees. (arXiv:2302.05918v1 [stat.ML])
    Fraud detection is to identify, monitor, and prevent potentially fraudulent activities from complex data. The recent development and success in AI, especially machine learning, provides a new data-driven way to deal with fraud. From a methodological point of view, machine learning based fraud detection can be divided into two categories, i.e., conventional methods (decision tree, boosting...) and deep learning, both of which have significant limitations in terms of the lack of representation learning ability for the former and interpretability for the latter. Furthermore, due to the rarity of detected fraud cases, the associated data is usually imbalanced, which seriously degrades the performance of classification algorithms. In this paper, we propose deep boosting decision trees (DBDT), a novel approach for fraud detection based on gradient boosting and neural networks. In order to combine the advantages of both conventional methods and deep learning, we first construct soft decision tree (SDT), a decision tree structured model with neural networks as its nodes, and then ensemble SDTs using the idea of gradient boosting. In this way we embed neural networks into gradient boosting to improve its representation learning capability and meanwhile maintain the interpretability. Furthermore, aiming at the rarity of detected fraud cases, in the model training phase we propose a compositional AUC maximization approach to deal with data imbalances at algorithm level. Extensive experiments on several real-life fraud detection datasets show that DBDT can significantly improve the performance and meanwhile maintain good interpretability. Our code is available at https://github.com/freshmanXB/DBDT.  ( 2 min )

  • Open

    [D] Noam Brown, FAIR: On achieving human-level performance in poker and Diplomacy, and the power of spending compute at inference time
    Here is a podcast episode with Noam Brown from Meta AI where we discuss his work on achieving human-level performance on poker and Diplomacy, as well as the power of spending compute at inference time! submitted by /u/thejashGI [link] [comments]  ( 42 min )
    [D] Retrieval transformers with learnable queries?
    Retrieval transformer models like RETRO seem to use frozen embeddings both for the documents in the database and the currently completed document ("the query"). Making the embeddings of documents in the database learnable would defeat the purpose, as retrieval transformers only make sense when the database is huge. It seems that the query embedding could be made learnable - the model could learn to extract more useful documents this way. Have you seen any research that does this? submitted by /u/zielmicha [link] [comments]  ( 42 min )
    [R] [N] REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers
    Paper: https://arxiv.org/abs/2302.02041 Generate synthetic data from single tabular data using GPT. It also works on relational datasets! No fine-tuning and works out-of-the-box. We also removed the guesswork on how long (epochs) the generative model for a single tabular data is trained. We propose the Qδ statistic and apply statistical bootstrapping to define a threshold to robustly detect overfitting. Perk: no need for a hold-out data! Data copying is also a problem in generative models. This means that training data may be learned and copied by the model during sampling. We attempt to mitigate data copying. We implement target masking to deliberately create missing values in each observation in the data. The mask is a special token that is ignored during sampling. This forces the model to probabilistically impute the token, adding uncertainty to the generated data. REaLTabFormer is open-sourced and available on PyPi → pip install realtabformer ​ https://preview.redd.it/vhf1st2g28ia1.png?width=1998&format=png&auto=webp&s=8cbc4f74bc04e2b6da16acf88f347fb1995bd5cf submitted by /u/avsolatorio [link] [comments]  ( 43 min )
    [D] Looking for advice on model architecture for embedding facial landmark coordinates into StyleGAN2 latentspace
    I am currently working on a project where I need to embed facial landmark coordinates into StyleGAN2 latentspace. The input data is structured as follows: [batch_size, num_landmarks=138, num_coordinates=3 (x,y,z)]. The output data is structured as: [batch_size, stylegan2_latent_space=512]. I have PyTorch experience and am experimenting with transformer like models for the embedding. However, I am unsure about the optimal architecture for this task, and I would appreciate any advice or recommendations on how to design a suitable model. Has anyone worked on a similar task before, or have any ideas about which architecture could work well for this problem? Any advice or resources would be greatly appreciated. Thank you! submitted by /u/willowill5 [link] [comments]  ( 43 min )
    [Discussion] Computing the derivative of a diffusion model with respect to the prompt
    Hi, I was wondering if anyone came across a paper that approximated the derivative of a diffusion model with respect to the conditioning that is fed into the cross-attention module. So let's say we have a text that is already transformed into a continuous embedding. Then this goes through the llm and is fed into the cross-attention module at every timestep. At the end of the diffusion process, we get some image/a latent representation of an image in the case of stable diffusion. We can then calculate a loss on that image and in theory calculate the gradient with respect to the continuous text embedding if we use a non-stochastic sampler like DDIM. The issue is the length of the graph calculating that derivative is super expensive. I was if anyone already solved this or has some good references. Thanks :) submitted by /u/arg_max [link] [comments]  ( 43 min )
    [D] self supervised learning for regression with tabular numerical data
    Hi all, Im trying to implement self supervised pretraining to tabular data regression problem, however since the literature is scarce i’m stuck in the augmentation stage. Im currently using sim siam self supervision with gaussian noising and input dropout. I tried shuffling to mimic CV approaches but it failed miserably. Any advice? submitted by /u/No-Front-4346 [link] [comments]  ( 42 min )
    [R] Scaling Vision Transformers to 22 Billion Parameters
    submitted by /u/nateharada [link] [comments]  ( 42 min )
    [N] Miniworld is now a mature project within the Farama Foundation
    Miniworld - a minimalistic 3D interior environment simulator for reinforcement learning & robotics research that allows environments to be easily edited - has now reached the mature inside Farama. You can check out the documentation at https://miniworld.farama.org, and the release notes for all the changes we’ve made to the project at https://github.com/Farama-Foundation/Miniworld/releases/tag/2.0.1. submitted by /u/jkterry1 [link] [comments]  ( 42 min )
    [D] Tensorflow struggles
    This may be a bit of a vent. I am currently working on a model with Tensorflow. To me it seems that whenever I am straying from a certain path my productivity starts dying at an alarming rate. For example I am currently implementing my own data augmentation (because I strayed from Tf in a minuscule way) and obscure errors are littering my path. Prior to that I made a mistake somewhere in my training loop and it took me forever to find. The list goes on. Every time I try using Tensorflow in a new way, it‘s like taming a new horse. Except that it‘s the same donkey I tamed last time. This is not my first project, but does it ever change? submitted by /u/H0lzm1ch3l [link] [comments]  ( 47 min )
    [R] Hitchhiker’s Guide to Super-Resolution: Introduction and Recent Advances
    I'm glad to share with you our Open Access survey paper about image super-resolution: https://ieeexplore.ieee.org/abstract/document/10041995 The goal of this work is to give an overview of the abundance of publications in image super-resolution, give an introduction for new researchers, and open thriving discussions as well as point to potential future directions to advance the field :) submitted by /u/Maleficent_Stay_7737 [link] [comments]  ( 43 min )
    [D] Repeating important samples in every batch for NN training?
    Wondering if there’s a term for this. I’m training NNs for a scenario that works best with a small batch size, there are therefore many batches. There are a couple particular samples that are VERY important. Let’s say 3 important samples out of thousands I train to. I found end application is best when I include these important samples, repeated, in every batch. This is opposed to simply giving the samples a large weight, because the large weight doesn’t matter after looping through many batches in an epoch. So the NN learns the other less important stuff while being forced to remain in good agreement with the important samples. Does this technique have a name? EDIT: In case anyone is curious, these are physics informed NNs and the important samples are equilibrium mechanical structures. The NN therefore learns what equilibrium is, with everything else being small deviations from equilibrium. submitted by /u/zxkj [link] [comments]  ( 44 min )
    [R] Imagenet 2015 VID Dataset
    Hi all, I saw a few posts already but just to make sure and keep this as an update, does anyone have the ImageNet 2015 VID dataset to share? All links are dead. I really need it now to train TransVOD. submitted by /u/Forsaken_Football227 [link] [comments]  ( 42 min )
    [D] Threshold for k-means anomaly detection
    I am using kmeans clustering algorithm for anomaly detection. After training kmeans, I'm calculation Euclidian distance of new data points to their nearest cluster. Please suggest me some strategies to set up a threshold such that point with distance greater than that threshold will be classified as anomaly. Or tell me if there are some other way to identify anomaly using k-means. submitted by /u/TKMater [link] [comments]  ( 43 min )
    [Discussion] The need for noise in stable diffusion
    As I'm learning about how stable diffusion works, I can't figure out why during image generation there's a need to deal with 'noise'. I know I'm glossing over a lot of details, but my understanding is that the algorithm is trained by gradually adding noise to an image and then de-noising it to recover the initial image. Wouldn't this be functionally equivalent to a machine that starts with an image, gradually reduces it to a blank canvas (all white), and then gradually reconstructs the original image? Then, post training, the generative process would just start with a blank canvas and gradually generate the image based on the input string provided. The idea of generating an image from a blank canvas feels more satisfying to me than revealing an image hidden by noise, but I'm sure there's a mathematical/technical reason why what I'm suggesting doesn't work. Appreciate any insight into this! submitted by /u/AdministrationOk2735 [link] [comments]  ( 44 min )
    [D] A Comprehensive Guide & Hand-Curated Resource List for Prompt Engineering and LLMs on Github
    Greetings, Excited to share with all those interested in Prompt Engineering and Large Language Models (LLMs)! We've hand-curated a comprehensive, Free & Open Source resource list on Github that includes everything related to Prompt Engineering, LLMs, and all related topics. We've covered most things, from papers and articles to tools and code! https://preview.redd.it/zzs09fg1l4ia1.png?width=1770&format=png&auto=webp&s=710cebff057e749b4c1d11a85711afef544c114f submitted by /u/aadityaura [link] [comments]  ( 43 min )
    [D] What are your tricks/infra working with embeddings?
    I'm trying to design my infra for creating, storing, and retrieving embeddings in my AI applications and was wondering what are the different paths for it. I'm especially interested in NLP, but vision/multimodal could be interesting too. Whether it's related to performance, scalability, or something else entirely, I'd love to hear your experiences and insights. Looking forward to your responses! submitted by /u/louis3195 [link] [comments]  ( 42 min )
    [P] Free GPT3-based tool to suggest terminal commands via natural language
    I whipped this up today. Credit to heyCLI for the idea, I've just remade an open source version. Basically in your terminal you type 'yo ' and then describe what you want a command to do. For instance: ➜ ~ yo enable a reverse tunnel through ssh Returns: Suggested command: ssh -R :localhost: @ Another example: ➜ ~ yo launch tensorboard with a custom log dir and port Suggested command: tensorboard --logdir= --port= It's free, MIT licence. You just need a free OpenAI API key which you can get by signing up on their website (I think if you use ChatGPT, you're already signed up). More info in the repo. Contributions/critiques welcome. submitted by /u/lfotofilter [link] [comments]  ( 43 min )
    [D] Anyone interested in training an AI for Tigris and Euphrates?
    Over the past weekend, I finally decided to put this idea to rest and made a Rust implementation of the greatest board game ever made - up there with Chess and Go: [Link to BGG](https://boardgamegeek.com/boardgame/42/tigris-euphrates The ultimate goal is to train an AI so it needs to be very fast with state updates. The game logic is quite sophisticated(~2000 lines) so it took me awhile to check all the edge cases of which there are many. Its search tree is huuuge with a branching factor of 100-300 which is more than Go's. It is also an imperfect game with hidden information(think poker). So ultimately it will need a reinforcement-based AI like [AlphaGo](https://arxiv.org/abs/2112.03178. In the repo I used a minimax-based AI(for testing purposes) to search 3 moves ahead which gives slightly better than random performance. The UI is implemented in [macroquad](https://macroquad.rs/examples/ which is hands down the simplest 2D game library I've used(ggez and a deprecated framework which I shall not name). And yes, please excuse the programmer art made by me :P Any way, here's the link to the repo if you are interested: repo ​ Note: it's hardcoded for 2 players but it can easily be made for 4. I want to train the AI for 2 players first. There are also 4 unimplemented rules: monuments, tile removal after war, must take corner treasures first, must take treasure after conflict. ​ https://preview.redd.it/jr49d9cx74ia1.png?width=1200&format=png&auto=webp&s=556edf1021b16878882dc8da0267327abb255e60 submitted by /u/0b01 [link] [comments]  ( 43 min )
    [R] Boosted Trees Literature
    Hi all. I’m trying to do a comprehensive study on the theory of gradient boosted trees (on the more recent algorithms xgboost, lightgbm etc). I was wondering what books you have read that contain substantial information on this topic. Any suggestions are appreciated! submitted by /u/ConfidenceFun5105 [link] [comments]  ( 42 min )
    [R] [P] LUCAS: LUng CAncer Screening dataset
    https://preview.redd.it/fo3y2s26q3ia1.png?width=1365&format=png&auto=webp&s=47653d3655c8888945f0e23e63ba88e521bc41df I want to download this dataset which has been introduced in the article named LUCAS: LUng CAncer Screening with Multimodal Biomarkers. Following the corresponding github of this project, authors have noted that the dataset is published in http://157.253.243.19/LUCAS/ but I can't access this link and ping to this address. Anyone has used this dataset could share it with me? Or if you know other ways to access it, too. Thank you very much! submitted by /u/kandalete [link] [comments]  ( 43 min )
    [D] What are the best ways to make and run a fast custom TTS?
    Just was wondering what the current best / easiest ways to make a fast custom tts are. I tried tortoise tts but it was too slow. The voice doesn't need to be a perfect clone, just need something that can resemble it. submitted by /u/crazewill [link] [comments]  ( 43 min )
    [P] Looking for an ML data analyst experienced with EEG or time series data interested in decoding brain activity for telepathic technology!
    Interested in Artificial Intelligence, software, psychology, linguistics, or the erupting neurotech industry? Checkout our website or dm me! Tech tycoons like Elon Musk, Bill Gates, and Jeff Bezos are investing time and resources into a way to output words directly from our psyche. It's assumed that without needing to take the time to type, we'll post, purchase, or search faster and more often. Using commercially available EEGs, my team at Minds Applied is recording snapshots of brain activity while thinking various words. After training a Cognitive Neural Network on this information, we’re able to predict the most likely thoughts you are thinking in real time. With prediction rates of over 80%, we’ve shifted our focus to acquiring funding and expanding our team! Message me if your int…  ( 45 min )
  • Open

    Noam Brown, FAIR: On achieving human-level performance in poker and Diplomacy, and the power of spending compute at inference time
    Here is a podcast episode with Noam Brown from Meta AI where we discuss his work on achieving human-level performance on poker and Diplomacy, as well as the power of spending compute at inference time! submitted by /u/thejashGI [link] [comments]  ( 41 min )
    How will AGI systems create fitness functions for hard problems?
    It seems as though the training of many AI systems involve one or more of the following: presenting huge amounts of example data spending a large amount of effort to manually grade efforts of a system creating carefully crafted fitness measures for a specific domain problem. If an AGI is presented with a difficult problem, how can such a system know that its answers are good? If a good simulation is available, the system can exercise its answers and evaluate against crude fitness functions, but if a problem is novel, no simulator will exist. At this point, is there any other option than 3 (experts craft a fitness function)? Having the AGI choose its now fitness functions has the exact same limitation. If 3 is the only option, how will AGI teach itself beyond the sphere of human knowledge? submitted by /u/bwootton [link] [comments]  ( 41 min )
    AI Project ideas
    Hi, I am thinking of ideas for a AI based project for my degree. I wish to use Machine learning, computer vision or robotics in gaming and was wondering if anyone had any good ideas. Preferably ideas that are somewhat scalable. I'm struggling to think of good gaming related ideas without explicitly creating a game myself which I don't want to do ​ Any ideas would be greatly appreciated. :) submitted by /u/Shachin2_2 [link] [comments]  ( 41 min )
    AI Dream 157 - MASTERPIECE - PART 8 TEASER - 2K SUBS CELEBRATION! 🥳🎉 - A...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Suggestions for learning AI
    I'm not sure what to google. As a programmer, I'm of the mind that if I'm not incorporating AI into my workflow, I'm going to be at a disadvantage very soon. What should I be learning, that would benefit me in that directive? I'd ask Chat GPT, but it's at Capacity, right now. submitted by /u/BenZed [link] [comments]  ( 41 min )
    AI picture generator
    Hey there, I have been experimenting with Midjourney and Dall•e 2 but I wanted to know if there are more AI picture generators besides these two I can use. Just let me know I would appreciate it! submitted by /u/NNRRYYNN [link] [comments]  ( 41 min )
    [Research] Seeking lightweight method to transfer/change skin tone in human images on CPU - any suggestions?
    I have been working on a project that involves altering the skin tone of human images. However, the methods I have come across so far either don't produce quality results or are too heavy for the limited computational resources available to me. Therefore, I am reaching out to the community for suggestions on a lightweight method for skin-tone transfer that is capable of running on a CPU. If you have any ideas or recommendations, please share them in the comments below. Your input would be greatly appreciated in helping me find a solution to this challenge. submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    ControlNet Installation In Stable Diffusion! Fantastic Extension!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    OpenAI CEO Sam Altman said ChatGPT is 'cool,' but it's a 'horrible product'
    submitted by /u/ssigea [link] [comments]  ( 43 min )
    Does anyone know what app re touched this image?
    submitted by /u/Ranwell13 [link] [comments]  ( 40 min )
    I asked two different AI models to create an anime girl with the same prompt.
    Which one did better here? Midjourney 2. DALL-E This is not a comparison between the two, I just found it interesting how these two create different results. https://preview.redd.it/y20186nlk6ia1.png?width=1263&format=png&auto=webp&s=a76b3d5d1af75ab1f89756fcfd0738b912df5200 submitted by /u/Aaryan7M [link] [comments]  ( 41 min )
    1 Million People Can’t Be Wrong: New Bing Is Taking Over Search!
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    All of this happening in AI. (14/02/2023)
    Hello humans. This is AI Daily, helping you stay updated on AI in less than 5 minutes. What’s happening in AI - As ChatGPT hype hits fever pitch, Neeva launches its generative AI search engine internationally Launched in the US in January, it is pitching as an “authentic, real-time AI search.” The search engine Neeva wants to replace the familiar “10 blue links” in search results with something more fitting for the modern AI age. The search engine first launched as a subscription-only search engine but now also supports free tire with certain limitations. AI chatbots are coming to search engines – can you trust the results? Three of the world’s biggest search engines announced that they are integrating ChatGPT-like technology into search engines, allowing people to get direct answe…  ( 44 min )
    The Journey of Pure Consciousness : AI Generated Story
    submitted by /u/spacesluts [link] [comments]  ( 40 min )
    An AI recently piloted a Lockheed Martin aircraft for over 17 hours during a test.
    submitted by /u/Dalembert [link] [comments]  ( 43 min )
    AI Trick For Social Media Content? AppScript + GPT3 🤫
    submitted by /u/JimZerChapirov [link] [comments]  ( 41 min )
    Using OpenAI to repurpose content for social media
    ​ Hello folks, It's crazy how versatile Open AI and GPT-3 are. I want to share about a project that I'm working on called Elephas (It's a Mac writing assistant powered by AI) Many of my users had been asking for the ability to repurpose their existing blog and newsletter content into social media posts. They are mostly busy content writers so this can be really useful to them in their day-to-day work. So I tried a simple prompt - "Summarize this for a tweet" I took the content from an OpenAI Blog and summarized it into a tweet. https://preview.redd.it/bibe0272t5ia1.png?width=800&format=png&auto=webp&s=d9e19bd4c6fcddb380f02d898c1fe62f635a816f Next, I tried another prompt - "Summarize this into a LinkedIn post" And that worked alright as well. https://preview.redd.it/urpn70wht5ia1.png?width=800&format=png&auto=webp&s=cdcdbc0ea637f783e19a718731ad297bedc2c5b5 Finally, I tried this prompt - "Summarize this into a Facebook post." https://preview.redd.it/8pkdbxxjt5ia1.png?width=800&format=png&auto=webp&s=4dca4b6a5d8fc267413d25701804ffb496e33aa6 These prompts worked well so I decided to integrate them into my Mac app, and the users loved it. Here is the final demo of how it works inside my app - ​ https://reddit.com/link/11260a8/video/50w0cz1vt5ia1/player It can be difficult to copy and paste the content into the playground. If you have a Mac and want to do this more straightforward way then please try out my app Elephas Do share your feedback. Thanks submitted by /u/juliarmg [link] [comments]  ( 42 min )
    The Most Detailed & Fantasy-inspired Female Dark Style Portrait By James Turrell - Photoreal & High Resolution!
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    Vietnam (Asia) develops AI technology industry, now at 55th globally and 54 points. Global average is at 44 points.
    submitted by /u/dannylenwinn [link] [comments]  ( 40 min )
    AI Writes a whole book Unveiling their Master Plan to Rule the World
    submitted by /u/Significant-Answer-1 [link] [comments]  ( 43 min )
    AI generated classic
    submitted by /u/Impressive_Hat9961 [link] [comments]  ( 40 min )
    AI music generator that you can control?
    There’s no other way for me to describe it so I’ll say it like it is. I have symphonies of music in my head. I think in music. My problem is I have a hand disability so I can’t produce the music I create in any tangible way. I can hum the music in parts but I was wondering… Is there an AI music generator where I can hum a melody, tell it what instrument to use, and it will produce that melody in that instrument for me so I can slowly build up my entire song? I’ve been able to think in music like this since I was 12 but I could never show anyone. I want to show them now that AI has become powerful enough. Any help is gratefully appreciated. submitted by /u/ZephyrBrightmoon [link] [comments]  ( 42 min )
    Build no-code AI tools that integrate with all your apps like Slack, Google Docs, etc...
    submitted by /u/iiamus [link] [comments]  ( 41 min )
  • Open

    Google Research, 2022 & beyond: Robotics
    Posted by Kendra Byrne, Senior Product Manager, and Jie Tan, Staff Research Scientist, Robotics at Google (This is Part 6 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) Within our lifetimes, we will see robotic technologies that can help with everyday activities, enhancing human productivity and quality of life. Before robotics can be broadly useful in helping with practical day-to-day tasks in people-centered spaces — spaces designed for people, not machines — they need to be able to safely & competently provide assistance to people. In 2022, we focused on challenges that come with enabling robots to be more helpful to people: 1) allowing robots and humans to communicate more efficiently and nat…  ( 95 min )
  • Open

    Building AI chatbots using Amazon Lex and Amazon Kendra for filtering query results based on user context
    Amazon Kendra is an intelligent search service powered by machine learning (ML). It indexes the documents stored in a wide range of repositories and finds the most relevant document based on the keywords or natural language questions the user has searched for. In some scenarios, you need the search results to be filtered based on […]  ( 12 min )
    Measure the Business Impact of Amazon Personalize Recommendations
    We’re excited to announce that Amazon Personalize now lets you measure how your personalized recommendations can help you achieve your business goals. After specifying the metrics that you want to track, you can identify which campaigns and recommenders are most impactful and understand the impact of recommendations on your business metrics. All customers want to […]  ( 10 min )
  • Open

    DSC Weekly 14 February 2023 – The AI Wars
    Announcements The AI Wars The stable release of OpenAI’s chatbot, ChatGPT, over two weeks ago saw nearly unanimous praise for its human-like chat capabilities and its natural-sounding responses to fairly complex inputs. The chatbot is raising ethical concerns over AI-generated written content, as its capabilities are far beyond the simple input and response of the… Read More »DSC Weekly 14 February 2023 – The AI Wars The post DSC Weekly 14 February 2023 – The AI Wars appeared first on Data Science Central.  ( 20 min )
    Ten Tips To Strengthen Your Cloud Database
    Unfortunately, COVID-19 hit us all individually and on corporate levels when the world economy was thriving. Among other significant measures that were taken to maintain the likelihood of businesses and their operations to continue, the demand for cloud-based remote access tools arose significantly. But no matter what size of the company was in question, business… Read More »Ten Tips To Strengthen Your Cloud Database The post Ten Tips To Strengthen Your Cloud Database appeared first on Data Science Central.  ( 22 min )
    Search Engines vs Synthesis Engines: The Future of Search
    With the announcements last week from Microsoft and openAI, we are now all actively discussing the future of search Here are some key takeaways as I interpret them: More interestingly, Balaji Srinivasan shared an interesting idea: search engines could evolve into synthesis engines. Through prompt engineering, you can provide a sequence that composes a complex… Read More »Search Engines vs Synthesis Engines: The Future of Search The post Search Engines vs Synthesis Engines: The Future of Search appeared first on Data Science Central.  ( 19 min )
    What does the Future of Accounting Look Like?
    Innovation in the 21st century is reaching the sky, and new development is happening in the world every time. A simple look near us speaks volumes of humans’ progress in the last two decades. With technological enhancement embedded into every facet of life, accounting is no exception. Various accounting has undergone drastic changes thanks to… Read More »What does the Future of Accounting Look Like? The post What does the Future of Accounting Look Like? appeared first on Data Science Central.  ( 21 min )
  • Open

    New enthusiast to the field, looking for connections
    Hey RL experts & enthusiasts! For the last few years, I've been working on a bot to play a real MMORPG as a hobby project (in a closed private server). I think the general idea of AI in games is fascinating. So far in my bot, I've spent a lot of time building up some foundational architecture, including reverse engineering the game. I'm finally at the point where I'm implementing intelligence for a single agent. Right now, I'm using GOFAI to handle tasks such as picking which monsters to attack, which items to pick up, where to walk, etc. Long term, I plan to control teams of agents for multiplayer game modes like Capture the Flag. I've been reading about AI/machine learning/deep learning for the last few years. Over the last few months I've read Francois Chollet's Deep Learning with Python book, read a few introductory RL blog posts, followed though OpenAi's "Spinning Up in Deep RL", and just finished Deepmind's Deep RL lecture series on YouTube. I'm definitely a newbie in the reinforcement learning world, but I'm starting to get familiar with the terms and algorithms; I've even implemented a simple value iteration agent in one of the OpenAI Gym's games! I know I have a long way to go for my long-term goals. It seems that using RL in the domain of a real open-ended MMORPG is not even something that we know how to do well. I'm posting here to find an "RL mentor" or at least to connect with people who are into the same types of things. Unfortunately, I don't know anyone who's an expert in this field. I'm looking for someone to ask questions or bounce ideas off of. I'm working on this stuff daily and having someone to chat with works best with my motivation, creativity, and working-style. If that sounds interesting to you, please do reach out! submitted by /u/SandSnip3r [link] [comments]  ( 43 min )
    Miniworld is now mature within the Farama Foundation
    Miniworld - a minimalistic 3D interior environment simulator for reinforcement learning & robotics research that allows environments to be easily edited - has now reached the mature inside Farama. You can check out the documentation at https://miniworld.farama.org, and the release notes for all the changes we’ve made to the project at https://github.com/Farama-Foundation/Miniworld/releases/tag/2.0.1. submitted by /u/jkterry1 [link] [comments]  ( 41 min )
    TD3 model loading size mismatch help
    I trained and saved a stable baselines3 TD3 model on custom environment. When trying to load there are size mismatches for both actor and critic weights and biases. One of the errors is size mismatch for actor.mu.4.weight: copying a param with shape torch.Size([4, 300]) from checkpoint, the shape in current model is torch Size(304, 300]) All of the errors are off by 300. I am able to load PPO models just fine and if I stop training TD3 after 1k steps while it's predictions are still random it will load. Does anyone have any ideas how i can correctly load the model? submitted by /u/actualsen [link] [comments]  ( 41 min )
    Need help with Q Learning
    Hello guys, I have been trying to simulate a Q Learning environment specifically the robot in a maze example, but I have run into a problem regarding the updating process of the Q Values in the Q Table. Specifically, how to update the Q-Value of the terminal state where the agent dies, as I cannot figure out what to put in the "QValue(next-state)" part of the equation given below. Till now, I figured the QValue(next-state) when agent is at terminal state (where it dies) would be the QValue of the starting position where agent respawns. I was wondering if this is correct by the process or is there a way to do this correctly? For example, in the image given, the agent will go to red area which kills it, now the new Q will be decided using the bellman equation, but the bellman equation also requires the QValue of the next state. I am confused to which state should be the next state, till now I simply used the respawn point as next state when in terminal state, however is this algorithmically correct or is there someting I'm missing? ​ Example of agent going to terminal state. ​ ​ Bellman Equation for updating QValue in QTable. submitted by /u/auzuha [link] [comments]  ( 42 min )
    Training an AI for Tigris and Euphrates
    submitted by /u/0b01 [link] [comments]  ( 42 min )
    Exploration in Complex Environments
    Hi there, I have a MARL problem where the reward is very sparse, and a lot of cooperation is required for the agents to get any reward. At the moment, I'm using a simple linearly decaying epsilon-greedy policy along with an adapted version of QMIX (also thinking of adding WQMIX to the mix), but the epsilon greedy policy simply doesn't explore enough to accidentally stumble onto the required behaviour before epsilon is too low to ever discover everything. I have tried using 1mil and 2mil timestep decay periods, to no avail. I'm looking at maybe adding a curiosity module, but before I jump into that I was wondering if there were simpler methods? I found a paper on VDBE exploration that I'm going to check out, but I also had an idea that probably won't work but I'm not sure why: Essentially an epsilon greedy policy, but in stead of decaying linearly or exponentially, how about decaying epsilon based on cumulative reward? So start out exploring say 80% of the time, until the cumulative reward reaches a predetermined positive value, then decrementing epsilon and keeping it at the lower value until a reward threshold is reached again. My thought is that the agents wil explore until they solve the problem once or twice, which may take very long, but as soon as they start figuring out how to solve the problem they will begin exploring less and less. It makes sense in my head, but many of the bugs I've fixed were also stuff that made sense in my head. Any thoughts on this? As well as references or ideas for other ways to allow my agents to explore successfully. submitted by /u/Grym7er [link] [comments]  ( 46 min )
    Current SOTA for offline RL and sim2sim/sim2real?
    Hey I am dealing with a problem with 3 stages of: 1. moving a vehicle simulation from low level cheap simulation 2. to high fidelity simulaton 3. to real vehicle which I will have small amount of data to collect I though of using offline RL and fine tuning along the way when talking about sim2real its mostly the dynamics (equations and parameters) and not visual observations What is the sota in this areas of offline RL and sim2real? Thank you submitted by /u/What_Did_It_Cost_E_T [link] [comments]  ( 41 min )
  • Open

    Burr distribution
    Irving Burr came up with a set of twelve probability distributions known as Burr I, Burr II, …, Burr XII. The last of these is by far the best known, and so the Burr XII distribution is often referred to simply as the Burr distribution. Cumulative density functions (CDFs) of probability distributions don’t always have […] Burr distribution first appeared on John D. Cook.  ( 5 min )
  • Open

    3D Creators Share Art From the Heart This Week ‘In the NVIDIA Studio’
    Love and creativity are in the air this Valentine’s Day In the NVIDIA Studio, as 3D artist Molly Brady presents a parody scene inspired by the iconic The Birth of Venus (Redux) painting by Sando Botticelli.  ( 7 min )
  • Open

    Training a model to produce 3D head models from text
    This may be a bit of a naive question. I have a very large database of highly descriptive digital 3D models of head scans. Not MRI scans, just a model of heads including the hair and all the facial features. Each one of these models is correlated with a list of 50 textual features describing the physical features of the head models. They are all in the exact same format. Kind of like a spec sheet for each head model. I would like to train a model that trains on that data and learns how to produce these 3D models when given a new textual spec sheet. I'm aware that the dataset needed to create something like that may be huge. Can anyone provide an overview of the process of writing something like that? Which type of network should i use and which libraries should i look into? (Preferably in python) Thanks submitted by /u/itreallyreallydoesnt [link] [comments]  ( 42 min )
  • Open

    CLAWSAT: Towards Both Robust and Accurate Code Models. (arXiv:2211.11711v4 [cs.LG] UPDATED)
    We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models. Different from existing works, we show that code obfuscation, a standard code transformation operation, provides novel means to generate complementary `views' of a code that enable us to achieve both robust and accurate code models. To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models. Specifically, we first adopt adversarial codes as robustness-promoting views in CL at the self-supervised pre-training phase. This yields improved robustness and transferability for downstream tasks. Next, at the supervised fine-tuning stage, we show that adversarial training with a proper temporally-staggered schedule of adversarial code generation can further improve robustness and accuracy of the pre-trained code model. Built on the above two modules, we develop CLAWSAT, a novel self-supervised learning (SSL) framework for code by integrating $\underline{\textrm{CL}}$ with $\underline{\textrm{a}}$dversarial vie$\underline{\textrm{w}}$s (CLAW) with $\underline{\textrm{s}}$taggered $\underline{\textrm{a}}$dversarial $\underline{\textrm{t}}$raining (SAT). On evaluating three downstream tasks across Python and Java, we show that CLAWSAT consistently yields the best robustness and accuracy ($\textit{e.g.}$ 11$\%$ in robustness and 6$\%$ in accuracy on the code summarization task in Python). We additionally demonstrate the effectiveness of adversarial learning in CLAW by analyzing the characteristics of the loss landscape and interpretability of the pre-trained models.  ( 2 min )
    DASH: A Distributed and Parallelizable Algorithm for Size-Constrained Submodular Maximization. (arXiv:2206.09563v3 [cs.DS] UPDATED)
    MapReduce (MR) frameworks for maximizing monotone, submodular functions subject to a cardinality constraint (SMCC) have currently only been shown to work with linear-adaptive (non-parallelizable) algorithms, that require large number of distributions in order to utilize the available processors, thus resulting in severe restrictions on the cardinality constraint in addition to limited scalability. Low-adaptive algorithms do not currently satisfy the requirements of these distributed MR frameworks, thereby limiting their performance. We study the SMCC problem in a distributed setting and propose the first MR algorithms with sublinear adaptive complexity. Our algorithms, R-DASH, T-DASH and G-DASH provide $0.316-\varepsilon$, $3/8 -\varepsilon$, and $1 - 1/e -\varepsilon$ approximation ratios, respectively, with nearly optimal adaptive complexity and nearly linear time complexity. Additionally, we provide a framework to increase, under some mild assumptions, the maximum permissible cardinality constraint from $O( n / \ell^2)$ of prior MR algorithms to $O( n / \ell )$, where $n$ is the data size and $\ell$ is the number of machines; under a stronger condition on the objective function, we increase the maximum constraint value to $n$. Finally, we provide empirical evidence to demonstrate that our sublinear-adaptive, distributed algorithms provide orders of magnitude faster runtime compared to current state-of-the-art distributed algorithms.  ( 2 min )
    Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond. (arXiv:2209.06177v2 [cs.SI] UPDATED)
    Homophily is a graph property describing the tendency of edges to connect similar nodes; the opposite is called heterophily. It is often believed that heterophilous graphs are challenging for standard message-passing graph neural networks (GNNs), and much effort has been put into developing efficient methods for this setting. However, there is no universally agreed-upon measure of homophily in the literature. In this work, we show that commonly used homophily measures have critical drawbacks preventing the comparison of homophily levels across different datasets. For this, we formalize desirable properties for a proper homophily measure and verify which measures satisfy which properties. In particular, we show that a measure that we call adjusted homophily satisfies more desirable properties than other popular homophily measures while being rarely used in graph learning literature. Then, we go beyond the homophily-heterophily dichotomy and propose a new characteristic allowing one to further distinguish different sorts of heterophily. The proposed label informativeness (LI) characterizes how much information a neighbor's label provides about a node's label. We analyze LI via the same theoretical framework and show that it is comparable across different datasets. We also observe empirically that LI better agrees with GNN performance compared to homophily measures, which confirms that it is a useful characteristic of the graph structure.  ( 2 min )
    The Survival Bandit Problem. (arXiv:2206.03019v3 [cs.LG] UPDATED)
    We study the survival bandit problem, a variant of the multi-armed bandit problem with a constraint on the cumulative reward; at each time step, the agent receives a reward in [-1, 1] and if the cumulative reward becomes lower than a preset threshold, the procedure stops, and this phenomenon is called ruin. To our knowledge, this is the first paper studying a framework where the ruin might occur but not always. We first discuss that no policy can achieve a sublinear regret as defined in the standard multi-armed bandit problem, because a single pull of an arm may increase significantly the risk of ruin. Instead, we establish the framework of Pareto-optimal policies, which is a class of policies whose cumulative reward for some instance cannot be improved without sacrificing that for another instance. To this end, we provide tight lower bounds on the probability of ruin, as well as matching policies called EXPLOIT. Finally, using a doubling trick over an EXPLOIT policy, we display a Pareto-optimal policy in the case of {-1, 0, 1} rewards, giving an answer to the open problem by Perotto et al. (2019).  ( 2 min )
    The Surprising Effectiveness of Equivariant Models in Domains with Latent Symmetry. (arXiv:2211.09231v2 [cs.LG] UPDATED)
    Extensive work has demonstrated that equivariant neural networks can significantly improve sample efficiency and generalization by enforcing an inductive bias in the network architecture. These applications typically assume that the domain symmetry is fully described by explicit transformations of the model inputs and outputs. However, many real-life applications contain only latent or partial symmetries which cannot be easily described by simple transformations of the input. In these cases, it is necessary to learn symmetry in the environment instead of imposing it mathematically on the network architecture. We discover, surprisingly, that imposing equivariance constraints that do not exactly match the domain symmetry is very helpful in learning the true symmetry in the environment. We differentiate between extrinsic and incorrect symmetry constraints and show that while imposing incorrect symmetry can impede the model's performance, imposing extrinsic symmetry can actually improve performance. We demonstrate that an equivariant model can significantly outperform non-equivariant methods on domains with latent symmetries both in supervised learning and in reinforcement learning for robotic manipulation and control problems.  ( 2 min )
    Towards Reliable Neural Specifications. (arXiv:2210.16114v3 [cs.LG] UPDATED)
    Having reliable specifications is an unavoidable challenge in achieving verifiable correctness, robustness, and interpretability of AI systems. Existing specifications for neural networks are in the paradigm of data as specification. That is, the local neighborhood centering around a reference input is considered to be correct (or robust). While existing specifications contribute to verifying adversarial robustness, a significant problem in many research domains, our empirical study shows that those verified regions are somewhat tight, and thus fail to allow verification of test set inputs, making them impractical for some real-world applications. To this end, we propose a new family of specifications called neural representation as specification, which uses the intrinsic information of neural networks - neural activation patterns (NAPs), rather than input data to specify the correctness and/or robustness of neural network predictions. We present a simple statistical approach to mining neural activation patterns. To show the effectiveness of discovered NAPs, we formally verify several important properties, such as various types of misclassifications will never happen for a given NAP, and there is no ambiguity between different NAPs. We show that by using NAP, we can verify a significant region of the input space, while still recalling 84% of the data on MNIST. Moreover, we can push the verifiable bound to 10 times larger on the CIFAR10 benchmark. Thus, we argue that NAPs can potentially be used as a more reliable and extensible specification for neural network verification.  ( 2 min )
    PDE-LEARN: Using Deep Learning to Discover Partial Differential Equations from Noisy, Limited Data. (arXiv:2212.04971v2 [cs.LG] UPDATED)
    In this paper, we introduce PDE-LEARN, a novel deep learning algorithm that can identify governing partial differential equations (PDEs) directly from noisy, limited measurements of a physical system of interest. PDE-LEARN uses a Rational Neural Network, $U$, to approximate the system response function and a sparse, trainable vector, $\xi$, to characterize the hidden PDE that the system response function satisfies. Our approach couples the training of $U$ and $\xi$ using a loss function that (1) makes $U$ approximate the system response function, (2) encapsulates the fact that $U$ satisfies a hidden PDE that $\xi$ characterizes, and (3) promotes sparsity in $\xi$ using ideas from iteratively reweighted least-squares. Further, PDE-LEARN can simultaneously learn from several data sets, allowing it to incorporate results from multiple experiments. This approach yields a robust algorithm to discover PDEs directly from realistic scientific data. We demonstrate the efficacy of PDE-LEARN by identifying several PDEs from noisy and limited measurements.  ( 2 min )
    Hierarchical classification at multiple operating points. (arXiv:2210.10929v2 [cs.LG] UPDATED)
    Many classification problems consider classes that form a hierarchy. Classifiers that are aware of this hierarchy may be able to make confident predictions at a coarse level despite being uncertain at the fine-grained level. While it is generally possible to vary the granularity of predictions using a threshold at inference time, most contemporary work considers only leaf-node prediction, and almost no prior work has compared methods at multiple operating points. We present an efficient algorithm to produce operating characteristic curves for any method that assigns a score to every class in the hierarchy. Applying this technique to evaluate existing methods reveals that top-down classifiers are dominated by a naive flat softmax classifier across the entire operating range. We further propose two novel loss functions and show that a soft variant of the structured hinge loss is able to significantly outperform the flat baseline. Finally, we investigate the poor accuracy of top-down classifiers and demonstrate that they perform relatively well on unseen classes. Code is available online at https://github.com/jvlmdr/hiercls.  ( 2 min )
    Latent State Marginalization as a Low-cost Approach for Improving Exploration. (arXiv:2210.00999v2 [cs.LG] UPDATED)
    While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.  ( 2 min )
    Closed-loop Analysis of Vision-based Autonomous Systems: A Case Study. (arXiv:2302.04634v1 [cs.CV] CROSS LISTED)
    Deep neural networks (DNNs) are increasingly used in safety-critical autonomous systems as perception components processing high-dimensional image data. Formal analysis of these systems is particularly challenging due to the complexity of the perception DNNs, the sensors (cameras), and the environment conditions. We present a case study applying formal probabilistic analysis techniques to an experimental autonomous system that guides airplanes on taxiways using a perception DNN. We address the above challenges by replacing the camera and the network with a compact probabilistic abstraction built from the confusion matrices computed for the DNN on a representative image data set. We also show how to leverage local, DNN-specific analyses as run-time guards to increase the safety of the overall system. Our findings are applicable to other autonomous systems that use complex DNNs for perception.  ( 2 min )
    Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs. (arXiv:2205.09056v2 [cs.LG] UPDATED)
    Reinforcement learning generalizes bandit problems with additional difficulties on longer planning horizon and unknown transition kernel. We show that, under some mild assumptions, *any* slowly changing adversarial bandit algorithm enjoys optimal regret in adversarial bandits can achieve optimal (in the dependency of $T$) expected regret in infinite-horizon discounted MDPs, without the presence of Bellman backups. The slowly changing property required by our generalization is mild, which is also marked by the online Markov decision process literature. We also examine the applicability of our reduction to a well-known adversarial bandit algorithm, EXP3.  ( 2 min )
    Approximate Vanishing Ideal Computations at Scale. (arXiv:2207.01236v2 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X = \{\mathbf{x}_1, \ldots, \mathbf{x}_m\}\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite subset of generators. In practice, to accommodate noise in the data, algorithms that construct generators of the approximate vanishing ideal are widely studied but their computational complexities remain expensive. In this paper, we scale up the oracle approximate vanishing ideal algorithm (OAVI), the only generator-constructing algorithm with known learning guarantees. We prove that the computational complexity of OAVI is not superlinear, as previously claimed, but linear in the number of samples $m$. In addition, we propose two modifications that accelerate OAVI's training time: Our analysis reveals that replacing the pairwise conditional gradients algorithm, one of the solvers used in OAVI, with the faster blended pairwise conditional gradients algorithm leads to an exponential speed-up in the number of features $n$. Finally, using a new inverse Hessian boosting approach, intermediate convex optimization problems can be solved almost instantly, improving OAVI's training time by multiple orders of magnitude in a variety of numerical experiments.  ( 2 min )
    Designing Robust Transformers using Robust Kernel Density Estimation. (arXiv:2210.05794v2 [cs.LG] UPDATED)
    Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering other practical issues, such as robustness to contaminated samples. Recent work by Nguyen et al., (2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods for alleviating the issue of data contamination. Specifically, we introduce a series of self-attention mechanisms that can be incorporated into different Transformer architectures and discuss the special properties of each method. We then perform extensive empirical studies on language modeling and image classification tasks. Our methods demonstrate robust performance in multiple scenarios while maintaining competitive results on clean datasets.  ( 2 min )
    Gaussian Pre-Activations in Neural Networks: Myth or Reality?. (arXiv:2205.12379v3 [cs.LG] UPDATED)
    The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?  ( 2 min )
    Model reduction for the material point method via an implicit neural representation of the deformation map. (arXiv:2109.12390v5 [cs.LG] UPDATED)
    This work proposes a model-reduction approach for the material point method on nonlinear manifolds. Our technique approximates the $\textit{kinematics}$ by approximating the deformation map using an implicit neural representation that restricts deformation trajectories to reside on a low-dimensional manifold. By explicitly approximating the deformation map, its spatiotemporal gradients -- in particular the deformation gradient and the velocity -- can be computed via analytical differentiation. In contrast to typical model-reduction techniques that construct a linear or nonlinear manifold to approximate the (finite number of) degrees of freedom characterizing a given spatial discretization, the use of an implicit neural representation enables the proposed method to approximate the $\textit{continuous}$ deformation map. This allows the kinematic approximation to remain agnostic to the discretization. Consequently, the technique supports dynamic discretizations -- including resolution changes -- during the course of the online reduced-order-model simulation. To generate $\textit{dynamics}$ for the generalized coordinates, we propose a family of projection techniques. At each time step, these techniques: (1) Calculate full-space kinematics at quadrature points, (2) Calculate the full-space dynamics for a subset of `sample' material points, and (3) Calculate the reduced-space dynamics by projecting the updated full-space position and velocity onto the low-dimensional manifold and tangent space, respectively. We achieve significant computational speedup via hyper-reduction that ensures all three steps execute on only a small subset of the problem's spatial domain. Large-scale numerical examples with millions of material points illustrate the method's ability to gain an order of magnitude computational-cost saving -- indeed $\textit{real-time simulations}$ -- with negligible errors.  ( 3 min )
    Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems. (arXiv:2111.03842v2 [eess.AS] UPDATED)
    This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.  ( 2 min )
    Action Dynamics Task Graphs for Learning Plannable Representations of Procedural Tasks. (arXiv:2302.05330v1 [cs.CV])
    Given video demonstrations and paired narrations of an at-home procedural task such as changing a tire, we present an approach to extract the underlying task structure -- relevant actions and their temporal dependencies -- via action-centric task graphs. Learnt structured representations from our method, Action Dynamics Task Graphs (ADTG), can then be used for understanding such tasks in unseen videos of humans performing them. Furthermore, ADTG can enable providing user-centric guidance to humans in these tasks, either for performing them better or for learning new tasks. Specifically, we show how ADTG can be used for: (1) tracking an ongoing task, (2) recommending next actions, and (3) planning a sequence of actions to accomplish a procedural task. We compare against state-of-the-art Neural Task Graph method and demonstrate substantial gains on 18 procedural tasks from the CrossTask dataset, including 30.1% improvement in task tracking accuracy and 20.3% accuracy gain in next action prediction.  ( 2 min )
    Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness. (arXiv:2302.05404v1 [stat.ML])
    In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (\emph{e.g.,} projected norm) rather than valid metrics (\emph{e.g.,} $L_2$ norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong $L_2$ error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem.  ( 2 min )
    Key Design Choices for Double-Transfer in Source-Free Unsupervised Domain Adaptation. (arXiv:2302.05379v1 [cs.LG])
    Fine-tuning and Domain Adaptation emerged as effective strategies for efficiently transferring deep learning models to new target tasks. However, target domain labels are not accessible in many real-world scenarios. This led to the development of Unsupervised Domain Adaptation (UDA) methods, which only employ unlabeled target samples. Furthermore, efficiency and privacy requirements may also prevent the use of source domain data during the adaptation stage. This challenging setting, known as Source-Free Unsupervised Domain Adaptation (SF-UDA), is gaining interest among researchers and practitioners due to its potential for real-world applications. In this paper, we provide the first in-depth analysis of the main design choices in SF-UDA through a large-scale empirical study across 500 models and 74 domain pairs. We pinpoint the normalization approach, pre-training strategy, and backbone architecture as the most critical factors. Based on our quantitative findings, we propose recipes to best tackle SF-UDA scenarios. Moreover, we show that SF-UDA is competitive also beyond standard benchmarks and backbone architectures, performing on par with UDA at a fraction of the data and computational cost. In the interest of reproducibility, we include the full experimental results and code as supplementary material.  ( 2 min )
    Online Distribution Shift Detection via Recency Prediction. (arXiv:2211.09916v2 [cs.RO] UPDATED)
    When deploying modern machine learning-enabled robotic systems in high-stakes applications, detecting distribution shift is critical. However, most existing methods for detecting distribution shift are not well-suited to robotics settings, where data often arrives in a streaming fashion and may be very high-dimensional. In this work, we present an online method for detecting distribution shift with guarantees on the false positive rate - i.e., when there is no distribution shift, our system is very unlikely (with probability $< \epsilon$) to falsely issue an alert; any alerts that are issued should therefore be heeded. Our method is specifically designed for efficient detection even with high dimensional data, and it empirically achieves up to 11x faster detection on realistic robotics settings compared to prior work while maintaining a low false negative rate in practice (whenever there is a distribution shift in our experiments, our method indeed emits an alert).
    Mathematical Theory of Bayesian Statistics for Unknown Information Source. (arXiv:2206.05630v5 [cs.LG] UPDATED)
    In statistical inference, uncertainty is unknown and all models are wrong. That is to say, a person who makes a statistical model and a prior distribution is simultaneously aware that both are fictional candidates. To study such cases, statistical measures have been constructed, such as cross validation, information criteria, and marginal likelihood, however, their mathematical properties have not yet been completely clarified when statistical models are under- and over- parametrized. We introduce a place of mathematical theory of Bayesian statistics for unknown uncertainty, which clarifies general properties of cross validation, information criteria, and marginal likelihood, even if an unknown data-generating process is unrealizable by a model or even if the posterior distribution cannot be approximated by any normal distribution. Hence it gives a helpful standpoint for a person who cannot believe in any specific model and prior. This paper consists of three parts. The first is a new result, whereas the second and third are well-known previous results with new experiments. We show there exists a more precise estimator of the generalization loss than leave-one-out cross validation, there exists a more accurate approximation of marginal likelihood than BIC, and the optimal hyperparameters for generalization loss and marginal likelihood are different.
    On the Interventional Kullback-Leibler Divergence. (arXiv:2302.05380v1 [cs.LG])
    Modern machine learning approaches excel in static settings where a large amount of i.i.d. training data are available for a given task. In a dynamic environment, though, an intelligent agent needs to be able to transfer knowledge and re-use learned components across domains. It has been argued that this may be possible through causal models, aiming to mirror the modularity of the real world in terms of independent causal mechanisms. However, the true causal structure underlying a given set of data is generally not identifiable, so it is desirable to have means to quantify differences between models (e.g., between the ground truth and an estimate), on both the observational and interventional level. In the present work, we introduce the Interventional Kullback-Leibler (IKL) divergence to quantify both structural and distributional differences between models based on a finite set of multi-environment distributions generated by interventions from the ground truth. Since we generally cannot quantify all differences between causal models for every finite set of interventional distributions, we propose a sufficient condition on the intervention targets to identify subsets of observed variables on which the models provably agree or disagree.  ( 2 min )
    ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training. (arXiv:2210.01738v2 [cs.LG] UPDATED)
    CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.
    More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization. (arXiv:2209.12681v2 [cs.LG] UPDATED)
    In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines. Our code is available at https://github.com/PKU-RL/FOP-DMAC-MACPF.
    An Additive Instance-Wise Approach to Multi-class Model Interpretation. (arXiv:2207.03113v4 [cs.LG] UPDATED)
    Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on identifying explanatory input features, which generally fall into two main categories: attribution and selection. A popular attribution-based approach is to exploit local neighborhoods for learning instance-specific explainers in an additive manner. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, many selection-based methods directly optimize local feature distributions in an instance-wise training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and many suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.
    Project and Probe: Sample-Efficient Domain Adaptation by Interpolating Orthogonal Features. (arXiv:2302.05441v1 [cs.LG])
    Conventional approaches to robustness try to learn a model based on causal features. However, identifying maximally robust or causal features may be difficult in some scenarios, and in others, non-causal "shortcut" features may actually be more predictive. We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features with a small target dataset. Our approach, Project and Probe (Pro$^2$), first learns a linear projection that maps a pre-trained embedding onto orthogonal directions while being predictive of labels in the source dataset. The goal of this step is to learn a variety of predictive features, so that at least some of them remain useful after distribution shift. Pro$^2$ then learns a linear classifier on top of these projected features using a small target dataset. We theoretically show that Pro$^2$ learns a projection matrix that is optimal for classification in an information-theoretic sense, resulting in better generalization due to a favorable bias-variance tradeoff. Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$^2$ improves performance by 5-15% when given limited target data compared to prior methods such as standard linear probing.  ( 2 min )
    Deciphering the Language of Nature: A transformer-based language model for deleterious mutations in proteins. (arXiv:2110.14746v4 [q-bio.GN] UPDATED)
    Various machine-learning models, including deep neural network models, have already been developed to predict deleteriousness of missense (non-synonymous) mutations. Potential improvements to the current state of the art, however, may still benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches. Recent advances in the natural language processing field show transformer models-a type of deep neural network-to be particularly powerful at modeling sequence information with context dependence. In this study, we introduce MutFormer, a transformer-based model for the prediction of deleterious missense mutations, which uses reference and mutated protein sequences from the human genome as the primary features. MutFormer takes advantage of a combination of self-attention layers and convolutional layers to learn both long-range and short-range dependencies between amino acid mutations in a protein sequence. In this study, we first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants observed in human populations. We next examined different fine-tuning methods to successfully apply the model to deleteriousness prediction of missense mutations. Finally, we evaluated MutFormer's performance on multiple testing data sets. We found that MutFormer showed similar or improved performance over a variety of existing tools, including those that used conventional machine-learning approaches. We conclude that MutFormer successfully considers sequence features that are not explored in previous studies and could potentially complement existing computational predictions or empirically generated functional scores to improve our understanding of disease variants.  ( 2 min )
    Efficient and Accurate Learning of Mixtures of Plackett-Luce Models. (arXiv:2302.05343v1 [cs.LG])
    Mixture models of Plackett-Luce (PL) -- one of the most fundamental ranking models -- are an active research area of both theoretical and practical significance. Most previously proposed parameter estimation algorithms instantiate the EM algorithm, often with random initialization. However, such an initialization scheme may not yield a good initial estimate and the algorithms require multiple restarts, incurring a large time complexity. As for the EM procedure, while the E-step can be performed efficiently, maximizing the log-likelihood in the M-step is difficult due to the combinatorial nature of the PL likelihood function (Gormley and Murphy 2008). Therefore, previous authors favor algorithms that maximize surrogate likelihood functions (Zhao et al. 2018, 2020). However, the final estimate may deviate from the true maximum likelihood estimate as a consequence. In this paper, we address these known limitations. We propose an initialization algorithm that can provide a provably accurate initial estimate and an EM algorithm that maximizes the true log-likelihood function efficiently. Experiments on both synthetic and real datasets show that our algorithm is competitive in terms of accuracy and speed to baseline algorithms, especially on datasets with a large number of items.  ( 2 min )
    Q-Match: Self-supervised Learning by Matching Distributions Induced by a Queue. (arXiv:2302.05444v1 [cs.LG])
    In semi-supervised learning, student-teacher distribution matching has been successful in improving performance of models using unlabeled data in conjunction with few labeled samples. In this paper, we aim to replicate that success in the self-supervised setup where we do not have access to any labeled data during pre-training. We introduce our algorithm, Q-Match, and show it is possible to induce the student-teacher distributions without any knowledge of downstream classes by using a queue of embeddings of samples from the unlabeled dataset. We focus our study on tabular datasets and show that Q-Match outperforms previous self-supervised learning techniques when measuring downstream classification performance. Furthermore, we show that our method is sample efficient--in terms of both the labels required for downstream training and the amount of unlabeled data required for pre-training--and scales well to the sizes of both the labeled and unlabeled data.  ( 2 min )
    Watermarking Pre-trained Language Models with Backdooring. (arXiv:2210.07543v2 [cs.CL] UPDATED)
    Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.  ( 2 min )
    Active Learning from the Web. (arXiv:2210.08205v2 [cs.LG] UPDATED)
    Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to explicitly design and build the pool for each task. The challenge is that we cannot compute the acquisition scores of all data exhaustively due to the size of the pool. We propose an efficient method, Seafaring, to retrieve informative data in terms of active learning from the Web using a user-side information retrieval algorithm. In the experiments, we use the online Flickr environment as the pool for active learning. This pool contains more than ten billion images and is several orders of magnitude larger than the existing pools in the literature for active learning. We confirm that our method performs better than existing approaches of using a small unlabelled pool.  ( 2 min )
    Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making. (arXiv:2302.05430v1 [stat.ML])
    Smoothed online learning has emerged as a popular framework to mitigate the substantial loss in statistical and computational complexity that arises when one moves from classical to adversarial learning. Unfortunately, for some spaces, it has been shown that efficient algorithms suffer an exponentially worse regret than that which is minimax optimal, even when the learner has access to an optimization oracle over the space. To mitigate that exponential dependence, this work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space, and shows that an instantiation of Follow-the-Perturbed-Leader can attain low regret with the number of calls to the optimization oracle scaling optimally with respect to average regret. We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions, which has many applications in fields as diverse as econometrics and robotics.  ( 2 min )
    Controlling Large Language Models to Generate Secure and Vulnerable Code. (arXiv:2302.05319v1 [cs.CR])
    Large language models (LMs) are increasingly pretrained on massive corpora of open-source programs and applied to solve program synthesis tasks. However, a fundamental limitation of LMs is their unawareness of security and vulnerability during pretraining and inference. As a result, LMs produce secure or vulnerable programs with high uncertainty (e.g., around 60%/40% chances for GitHub Copilot according to a recent study). This greatly impairs LMs' usability, especially in security-sensitive scenarios. To address this limitation, this work formulates a new problem called controlled code generation, which allows users to input a boolean property into an LM to control if the LM generates secure or vulnerable code. We propose svGen, an effective and lightweight learning approach for solving controlled code generation. svGen leverages property-specific continuous vectors to steer program generation toward the given property, without altering the weights of the LM. svGen's training optimizes those continuous vectors by carefully applying specialized loss terms on different regions of code. Our extensive evaluation shows that svGen achieves strong control capability across various software vulnerabilities and LMs of different parameter sizes. For example, on 9 dangerous vulnerabilities, a state-of-the-art CodeGen LM with 2.7B parameters generates secure programs with a 57% chance. When we use svGen to control the LM to generate secure (resp., vulnerable) programs, the chance is significantly increased to 82% (resp., decreased to 35%).  ( 2 min )
    Unified Functional Hashing in Automatic Machine Learning. (arXiv:2302.05433v1 [cs.LG])
    The field of Automatic Machine Learning (AutoML) has recently attained impressive results, including the discovery of state-of-the-art machine learning solutions, such as neural image classifiers. This is often done by applying an evolutionary search method, which samples multiple candidate solutions from a large space and evaluates the quality of each candidate through a long training process. As a result, the search tends to be slow. In this paper, we show that large efficiency gains can be obtained by employing a fast unified functional hash, especially through the functional equivalence caching technique, which we also present. The central idea is to detect by hashing when the search method produces equivalent candidates, which occurs very frequently, and this way avoid their costly re-evaluation. Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently, and it is "unified" in that the same algorithm can hash arbitrary representations; e.g. compute graphs, imperative code, or lambda functions. As evidence, we show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery. Finally, we consider the effect of hash collisions, evaluation noise, and search distribution through empirical analysis. Altogether, we hope this paper may serve as a guide to hashing techniques in AutoML.  ( 2 min )
    XFL: A High Performace, Lightweighted Federated Learning Framework. (arXiv:2302.05076v1 [cs.LG])
    This paper introduces XFL, an industrial-grade federated learning project. XFL supports training AI models collaboratively on multiple devices, while utilizes homomorphic encryption, differential privacy, secure multi-party computation and other security technologies ensuring no leakage of data. XFL provides an abundant algorithms library, integrating a large number of pre-built, secure and outstanding federated learning algorithms, covering both the horizontally and vertically federated learning scenarios. Numerical experiments have shown the prominent performace of these algorithms. XFL builds a concise configuration interfaces with presettings for all federation algorithms, and supports the rapid deployment via docker containers.Therefore, we believe XFL is the most user-friendly and easy-to-develop federated learning framework. XFL is open-sourced, and both the code and documents are available at https://github.com/paritybit-ai/XFL.  ( 2 min )
    From Group-Differences to Single-Subject Probability: Conformal Prediction-based Uncertainty Estimation for Brain-Age Modeling. (arXiv:2302.05304v1 [cs.LG])
    The brain-age gap is one of the most investigated risk markers for brain changes across disorders. While the field is progressing towards large-scale models, recently incorporating uncertainty estimates, no model to date provides the single-subject risk assessment capability essential for clinical application. In order to enable the clinical use of brain-age as a biomarker, we here combine uncertainty-aware deep Neural Networks with conformal prediction theory. This approach provides statistical guarantees with respect to single-subject uncertainty estimates and allows for the calculation of an individual's probability for accelerated brain-aging. Building on this, we show empirically in a sample of N=16,794 participants that 1. a lower or comparable error as state-of-the-art, large-scale brain-age models, 2. the statistical guarantees regarding single-subject uncertainty estimation indeed hold for every participant, and 3. that the higher individual probabilities of accelerated brain-aging derived from our model are associated with Alzheimer's Disease, Bipolar Disorder and Major Depressive Disorder.  ( 2 min )
    Using Connectome Features to Constrain Echo State Networks. (arXiv:2206.02094v2 [cs.LG] UPDATED)
    We report an improvement to the conventional Echo State Network (ESN) across three benchmark chaotic time-series prediction tasks using fruit fly connectome data alone. We also investigate the impact of key connectome-derived structural features on prediction performance -- uniquely bridging neurobiological structure and machine learning function; and find that both increasing the global average clustering coefficient and modifying the position of weights -- by permuting their synapse-synapse partners -- can lead to increased model variance and (in some cases) degraded performance. In all we consider four topological point modifications to a connectome-derived ESN reservoir (null model): namely, we alter the network sparsity, re-draw nonzero weights from a uniform distribution, permute nonzero weight positions, and increase the network global average clustering coefficient. We compare the four resulting ESN model classes -- and the null model -- with a conventional ESN by conducting time-series prediction experiments on size-variants of the Mackey-Glass 17 (MG-17), Lorenz, and Rossler chaotic time series; denoting each model's performance and variance across train-validate trials.  ( 2 min )
    Small-Text: Active Learning for Text Classification in Python. (arXiv:2107.10314v5 [cs.LG] UPDATED)
    We present small-text, an easy-to-use active learning library written in Python, which offers pool-based active learning for single- and multi-label text classification in Python. It features many pre-implemented state-of-the-art query strategies, including some that leverage the GPU. Standardized interfaces allow the combination of a variety of classifiers, query strategies, and stopping criteria, facilitating a quick mix and match, and enabling a rapid development of both active learning experiments and applications. In order to make various classifiers and query strategies accessible for active learning, small-text integrates several well-known machine learning libraries, namely scikit-learn, PyTorch, and Hugging Face transformers. The latter integrations are optionally installable extensions, so GPUs can be used but are not required. The library is publicly available under the MIT License at https://github.com/webis-de/small-text, in version 1.1.1 at the time of writing.  ( 2 min )
    Reinforcement Learning from Multiple Sensors via Joint Representations. (arXiv:2302.05342v1 [cs.LG])
    In many scenarios, observations from more than one sensor modality are available for reinforcement learning (RL). For example, many agents can perceive their internal state via proprioceptive sensors but must infer the environment's state from high-dimensional observations such as images. For image-based RL, a variety of self-supervised representation learning approaches exist to improve performance and sample complexity. These approaches learn the image representation in isolation. However, including proprioception can help representation learning algorithms to focus on relevant aspects and guide them toward finding better representations. Hence, in this work, we propose using Recurrent State Space Models to fuse all available sensory information into a single consistent representation. We combine reconstruction-based and contrastive approaches for training, which allows using the most appropriate method for each sensor modality. For example, we can use reconstruction for proprioception and a contrastive loss for images. We demonstrate the benefits of utilizing proprioception in learning representations for RL on a large set of experiments. Furthermore, we show that our joint representations significantly improve performance compared to a post hoc combination of image representations and proprioception.
    Matching Correlated Inhomogeneous Random Graphs using the $k$-core Estimator. (arXiv:2302.05407v1 [math.ST])
    We consider the task of estimating the latent vertex correspondence between two edge-correlated random graphs with generic, inhomogeneous structure. We study the so-called \emph{$k$-core estimator}, which outputs a vertex correspondence that induces a large, common subgraph of both graphs which has minimum degree at least $k$. We derive sufficient conditions under which the $k$-core estimator exactly or partially recovers the latent vertex correspondence. Finally, we specialize our general framework to derive new results on exact and partial recovery in correlated stochastic block models, correlated Chung-Lu graphs, and correlated random geometric graphs.  ( 2 min )
    Example-Based Sampling with Diffusion Models. (arXiv:2302.05116v1 [cs.GR])
    Much effort has been put into developing samplers with specific properties, such as producing blue noise, low-discrepancy, lattice or Poisson disk samples. These samplers can be slow if they rely on optimization processes, may rely on a wide range of numerical methods, are not always differentiable. The success of recent diffusion models for image generation suggests that these models could be appropriate for learning how to generate point sets from examples. However, their convolutional nature makes these methods impractical for dealing with scattered data such as point sets. We propose a generic way to produce 2-d point sets imitating existing samplers from observed point sets using a diffusion model. We address the problem of convolutional layers by leveraging neighborhood information from an optimal transport matching to a uniform grid, that allows us to benefit from fast convolutions on grids, and to support the example-based learning of non-uniform sampling patterns. We demonstrate how the differentiability of our approach can be used to optimize point sets to enforce properties.
    A Graph-Based Modeling Framework for Tracing Hydrological Pollutant Transport in Surface Waters. (arXiv:2302.04991v1 [cs.LG])
    Anthropogenic pollution of hydrological systems affects diverse communities and ecosystems around the world. Data analytics and modeling tools play a key role in fighting this challenge, as they can help identify key sources as well as trace transport and quantify impact within complex hydrological systems. Several tools exist for simulating and tracing pollutant transport throughout surface waters using detailed physical models; these tools are powerful, but can be computationally intensive, require significant amounts of data to be developed, and require expert knowledge for their use (ultimately limiting application scope). In this work, we present a graph modeling framework -- which we call ${\tt HydroGraphs}$ -- for understanding pollutant transport and fate across waterbodies, rivers, and watersheds. This framework uses a simplified representation of hydrological systems that can be constructed based purely on open-source data (National Hydrography Dataset and Watershed Boundary Dataset). The graph representation provides an flexible intuitive approach for capturing connectivity and for identifying upstream pollutant sources and for tracing downstream impacts within small and large hydrological systems. Moreover, the graph representation can facilitate the use of advanced algorithms and tools of graph theory, topology, optimization, and machine learning to aid data analytics and decision-making. We demonstrate the capabilities of our framework by using case studies in the State of Wisconsin; here, we aim to identify upstream nutrient pollutant sources that arise from agricultural practices and trace downstream impacts to waterbodies, rivers, and streams. Our tool ultimately seeks to help stakeholders design effective pollution prevention/mitigation practices and evaluate how surface waters respond to such practices.
    Black-Box Generalization: Stability of Zeroth-Order Learning. (arXiv:2202.06880v2 [cs.LG] UPDATED)
    We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well.  ( 2 min )
    Toward Degree Bias in Embedding-Based Knowledge Graph Completion. (arXiv:2302.05044v1 [cs.LG])
    A fundamental task for knowledge graphs (KGs) is knowledge graph completion (KGC). It aims to predict unseen edges by learning representations for all the entities and relations in a KG. A common concern when learning representations on traditional graphs is degree bias. It can affect graph algorithms by learning poor representations for lower-degree nodes, often leading to low performance on such nodes. However, there has been limited research on whether there exists degree bias for embedding-based KGC and how such bias affects the performance of KGC. In this paper, we validate the existence of degree bias in embedding-based KGC and identify the key factor to degree bias. We then introduce a novel data augmentation method, KG-Mixup, to generate synthetic triples to mitigate such bias. Extensive experiments have demonstrated that our method can improve various embedding-based KGC methods and outperform other methods tackling the bias problem on multiple benchmark datasets.
    Virtual Reality via Object Pose Estimation and Active Learning: Realizing Telepresence Robots with Aerial Manipulation Capabilities. (arXiv:2210.09678v2 [cs.RO] UPDATED)
    This article presents a novel telepresence system for advancing aerial manipulation in dynamic and unstructured environments. The proposed system not only features a haptic device, but also a virtual reality (VR) interface that provides real-time 3D displays of the robot's workspace as well as a haptic guidance to its remotely located operator. To realize this, multiple sensors namely a LiDAR, cameras and IMUs are utilized. For processing of the acquired sensory data, pose estimation pipelines are devised for industrial objects of both known and unknown geometries. We further propose an active learning pipeline in order to increase the sample efficiency of a pipeline component that relies on Deep Neural Networks (DNNs) based object detection. All these algorithms jointly address various challenges encountered during the execution of perception tasks in industrial scenarios. In the experiments, exhaustive ablation studies are provided to validate the proposed pipelines. Methodologically, these results commonly suggest how an awareness of the algorithms' own failures and uncertainty (`introspection') can be used tackle the encountered problems. Moreover, outdoor experiments are conducted to evaluate the effectiveness of the overall system in enhancing aerial manipulation capabilities. In particular, with flight campaigns over days and nights, from spring to winter, and with different users and locations, we demonstrate over 70 robust executions of pick-and-place, force application and peg-in-hole tasks with the DLR cable-Suspended Aerial Manipulator (SAM). As a result, we show the viability of the proposed system in future industrial applications.
    Online Real-Time Recurrent Learning Using Sparse Connections and Selective Learning. (arXiv:2302.05326v1 [cs.LG])
    State construction from sensory observations is an important component of a reinforcement learning agent. One solution for state construction is to use recurrent neural networks. Two popular gradient-based methods for recurrent learning are back-propagation through time (BPTT), and real-time recurrent learning (RTRL). BPTT looks at the complete sequence of observations before computing gradients and is unsuitable for online real-time updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning a recurrent network incrementally, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade off the functional capacity of the recurrent network to achieve scalable learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a benchmark inspired by animal learning and in policy evaluation for expert Rainbow-DQN agents in the Arcade Learning Environment (ALE).
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v5 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).
    Out-of-Distribution Representation Learning for Time Series Classification. (arXiv:2209.07027v3 [cs.LG] UPDATED)
    Time series classification is an important problem in real world. Due to its non-stationary property that the distribution changes over time, it remains challenging to build models for generalization to unseen distributions. In this paper, we propose to view the time series classification problem from the distribution perspective. We argue that the temporal complexity attributes to the unknown latent distributions within. To this end, we propose DIVERSIFY to learn generalized representations for time series classification. DIVERSIFY takes an iterative process: it first obtains the worst-case distribution scenario via adversarial training, then matches the distributions of the obtained sub-domains. We also present some theoretical insights. We conduct experiments on gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition with a total of seven datasets in different settings. Results demonstrate that DIVERSIFY significantly outperforms other baselines and effectively characterizes the latent distributions by qualitative and quantitative analysis. Code is available at: https://github.com/microsoft/robustlearn.
    On the Convergence of Stochastic Gradient Descent for Linear Inverse Problems in Banach Spaces. (arXiv:2302.05197v1 [cs.LG])
    In this work we consider stochastic gradient descent (SGD) for solving linear inverse problems in Banach spaces. SGD and its variants have been established as one of the most successful optimisation methods in machine learning, imaging and signal processing, etc. At each iteration SGD uses a single datum, or a small subset of data, resulting in highly scalable methods that are very attractive for large-scale inverse problems. Nonetheless, the theoretical analysis of SGD-based approaches for inverse problems has thus far been largely limited to Euclidean and Hilbert spaces. In this work we present a novel convergence analysis of SGD for linear inverse problems in general Banach spaces: we show the almost sure convergence of the iterates to the minimum norm solution and establish the regularising property for suitable a priori stopping criteria. Numerical results are also presented to illustrate features of the approach.
    DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation. (arXiv:2202.00972v2 [eess.IV] CROSS LISTED)
    Deep learning architecture with convolutional neural network (CNN) achieves outstanding success in the field of computer vision. Where U-Net, an encoder-decoder architecture structured by CNN, makes a great breakthrough in biomedical image segmentation and has been applied in a wide range of practical scenarios. However, the equal design of every downsampling layer in the encoder part and simply stacked convolutions do not allow U-Net to extract sufficient information of features from different depths. The increasing complexity of medical images brings new challenges to the existing methods. In this paper, we propose a deeper and more compact split-attention u-shape network (DCSAU-Net), which efficiently utilises low-level and high-level semantic information based on two novel frameworks: primary feature conservation and compact split-attention block. We evaluate the proposed model on CVC-ClinicDB, 2018 Data Science Bowl, ISIC-2018 and SegPC-2021 datasets. As a result, DCSAU-Net displays better performance than other state-of-the-art (SOTA) methods in terms of the mean Intersection over Union (mIoU) and F1-socre. More significantly, the proposed model demonstrates excellent segmentation performance on challenging images. The code for our work and more technical details can be found at https://github.com/xq141839/DCSAU-Net.
    Controllability-Aware Unsupervised Skill Discovery. (arXiv:2302.05103v1 [cs.RO])
    One of the key capabilities of intelligent agents is the ability to discover useful skills without external supervision. However, the current unsupervised skill discovery methods are often limited to acquiring simple, easy-to-learn skills due to the lack of incentives to discover more complex, challenging behaviors. We introduce a novel unsupervised skill discovery method, Controllability-aware Skill Discovery (CSD), which actively seeks complex, hard-to-control skills without supervision. The key component of CSD is a controllability-aware distance function, which assigns larger values to state transitions that are harder to achieve with the current skills. Combined with distance-maximizing skill discovery, CSD progressively learns more challenging skills over the course of training as our jointly trained distance function reduces rewards for easy-to-achieve skills. Our experimental results in six robotic manipulation and locomotion environments demonstrate that CSD can discover diverse complex skills including object manipulation and locomotion skills with no supervision, significantly outperforming prior unsupervised skill discovery methods. Videos and code are available at https://sites.google.com/view/icml2023csd
    Monte Carlo Neural Operator for Learning PDEs via Probabilistic Representation. (arXiv:2302.05104v1 [cs.LG])
    Neural operators, which use deep neural networks to approximate the solution mappings of partial differential equation (PDE) systems, are emerging as a new paradigm for PDE simulation. The neural operators could be trained in supervised or unsupervised ways, i.e., by using the generated data or the PDE information. The unsupervised training approach is essential when data generation is costly or the data is less qualified (e.g., insufficient and noisy). However, its performance and efficiency have plenty of room for improvement. To this end, we design a new loss function based on the Feynman-Kac formula and call the developed neural operator Monte-Carlo Neural Operator (MCNO), which can allow larger temporal steps and efficiently handle fractional diffusion operators. Our analyses show that MCNO has advantages in handling complex spatial conditions and larger temporal steps compared with other unsupervised methods. Furthermore, MCNO is more robust with the perturbation raised by the numerical scheme and operator approximation. Numerical experiments on the diffusion equation and Navier-Stokes equation show significant accuracy improvement compared with other unsupervised baselines, especially for the vibrated initial condition and long-time simulation settings.
    Debiasing Recommendation by Learning Identifiable Latent Confounders. (arXiv:2302.05052v1 [cs.LG])
    Recommendation systems aim to predict users' feedback on items not exposed to them. Confounding bias arises due to the presence of unmeasured variables (e.g., the socio-economic status of a user) that can affect both a user's exposure and feedback. Existing methods either (1) make untenable assumptions about these unmeasured variables or (2) directly infer latent confounders from users' exposure. However, they cannot guarantee the identification of counterfactual feedback, which can lead to biased predictions. In this work, we propose a novel method, i.e., identifiable deconfounder (iDCF), which leverages a set of proxy variables (e.g., observed user features) to resolve the aforementioned non-identification issue. The proposed iDCF is a general deconfounded recommendation framework that applies proximal causal inference to infer the unmeasured confounders and identify the counterfactual feedback with theoretical guarantees. Extensive experiments on various real-world and synthetic datasets verify the proposed method's effectiveness and robustness.
    Event Temporal Relation Extraction with Bayesian Translational Model. (arXiv:2302.04985v1 [cs.CL])
    Existing models to extract temporal relations between events lack a principled method to incorporate external knowledge. In this study, we introduce Bayesian-Trans, a Bayesian learning-based method that models the temporal relation representations as latent variables and infers their values via Bayesian inference and translational functions. Compared to conventional neural approaches, instead of performing point estimation to find the best set parameters, the proposed model infers the parameters' posterior distribution directly, enhancing the model's capability to encode and express uncertainty about the predictions. Experimental results on the three widely used datasets show that Bayesian-Trans outperforms existing approaches for event temporal relation extraction. We additionally present detailed analyses on uncertainty quantification, comparison of priors, and ablation studies, illustrating the benefits of the proposed approach.
    DNArch: Learning Convolutional Neural Architectures by Backpropagation. (arXiv:2302.05400v1 [cs.LG])
    We present Differentiable Neural Architectures (DNArch), a method that jointly learns the weights and the architecture of Convolutional Neural Networks (CNNs) by backpropagation. In particular, DNArch allows learning (i) the size of convolutional kernels at each layer, (ii) the number of channels at each layer, (iii) the position and values of downsampling layers, and (iv) the depth of the network. To this end, DNArch views neural architectures as continuous multidimensional entities, and uses learnable differentiable masks along each dimension to control their size. Unlike existing methods, DNArch is not limited to a predefined set of possible neural components, but instead it is able to discover entire CNN architectures across all combinations of kernel sizes, widths, depths and downsampling. Empirically, DNArch finds performant CNN architectures for several classification and dense prediction tasks on both sequential and image data. When combined with a loss term that considers the network complexity, DNArch finds powerful architectures that respect a predefined computational budget.
    Archaeological Sites Detection with a Human-AI Collaboration Workflow. (arXiv:2302.05286v1 [cs.CV])
    This paper illustrates the results obtained by using pre-trained semantic segmentation deep learning models for the detection of archaeological sites within the Mesopotamian floodplains environment. The models were fine-tuned using openly available satellite imagery and vector shapes coming from a large corpus of annotations (i.e., surveyed sites). A randomized test showed that the best model reaches a detection accuracy in the neighborhood of 80%. Integrating domain expertise was crucial to define how to build the dataset and how to evaluate the predictions, since defining if a proposed mask counts as a prediction is very subjective. Furthermore, even an inaccurate prediction can be useful when put into context and interpreted by a trained archaeologist. Coming from these considerations we close the paper with a vision for a Human-AI collaboration workflow. Starting with an annotated dataset that is refined by the human expert we obtain a model whose predictions can either be combined to create a heatmap, to be overlaid on satellite and/or aerial imagery, or alternatively can be vectorized to make further analysis in a GIS software easier and automatic. In turn, the archaeologists can analyze the predictions, organize their onsite surveys, and refine the dataset with new, corrected, annotation
    Attention-based Domain Adaption Forecasting of Streamflow in Data Sparse Regions. (arXiv:2302.05386v1 [cs.LG])
    Streamflow forecasts are critical to guide water resource management, mitigate drought and flood effects, and develop climate-smart infrastructure and industries. Many global regions, however, have limited streamflow observations to guide evidence-based management strategies. In this paper, we propose an attention-based domain adaptation streamflow forecaster for data-sparse regions. Our approach leverages the hydrological characteristics of a data-rich source domain to induce effective 24h lead-time streamflow prediction in a limited target domain. Specifically, we employ a deep-learning framework leveraging domain adaptation techniques to simultaneously train streamflow predictions and discern between both domains using an adversarial method. Experiments against baseline cross-domain forecasting models show improved performance for 24h lead-time streamflow forecasting.
    Efficient Propagation of Uncertainty via Reordering Monte Carlo Samples. (arXiv:2302.04945v1 [cs.LG])
    Uncertainty analysis in the outcomes of model predictions is a key element in decision-based material design to establish confidence in the models and evaluate the fidelity of models. Uncertainty Propagation (UP) is a technique to determine model output uncertainties based on the uncertainty in its input variables. The most common and simplest approach to propagate the uncertainty from a model inputs to its outputs is by feeding a large number of samples to the model, known as Monte Carlo (MC) simulation which requires exhaustive sampling from the input variable distributions. However, MC simulations are impractical when models are computationally expensive. In this work, we investigate the hypothesis that while all samples are useful on average, some samples must be more useful than others. Thus, reordering MC samples and propagating more useful samples can lead to enhanced convergence in statistics of interest earlier and thus, reducing the computational burden of UP process. Here, we introduce a methodology to adaptively reorder MC samples and show how it results in reduction of computational expense of UP processes.
    On counterfactual inference with unobserved confounding. (arXiv:2211.08209v2 [cs.LG] UPDATED)
    Given an observational study with $n$ independent but heterogeneous units, our goal is to learn the counterfactual distribution for each unit using only one $p$-dimensional sample per unit containing covariates, interventions, and outcomes. Specifically, we allow for unobserved confounding that introduces statistical biases between interventions and outcomes as well as exacerbates the heterogeneity across units. Modeling the underlying joint distribution as an exponential family, we reduce learning the unit-level counterfactual distributions to learning $n$ exponential family distributions with heterogeneous parameters and only one sample per distribution. We introduce a convex objective that pools all $n$ samples to jointly learn all $n$ parameter vectors, and provide a unit-wise mean squared error bound that scales linearly with the metric entropy of the parameter space. For example, when the parameters are $s$-sparse linear combination of $k$ known vectors, the error is $O(s\log k/p)$. En route, we derive sufficient conditions for compactly supported distributions to satisfy the logarithmic Sobolev inequality. As an application of the framework, our results enable consistent imputation of sparsely missing covariates.
    Modeling Volatility and Dependence of European Carbon and Energy Prices. (arXiv:2208.14311v4 [q-fin.ST] UPDATED)
    We study the prices of European Emission Allowances (EUA), whereby we analyze their uncertainty and dependencies on related energy prices (natural gas, coal, and oil). We propose a probabilistic multivariate conditional time series model with a VECM-Copula-GARCH structure which exploits key characteristics of the data. Data are normalized with respect to inflation and carbon emissions to allow for proper cross-series evaluation. The forecasting performance is evaluated in an extensive rolling-window forecasting study, covering eight years out-of-sample. We discuss our findings for both levels- and log-transformed data, focusing on time-varying correlations, and in view of the Russian invasion of Ukraine.
    What is Wrong with Continual Learning in Medical Image Segmentation?. (arXiv:2010.11008v2 [cs.CV] UPDATED)
    Continual learning protocols are attracting increasing attention from the medical imaging community. In continual environments, datasets acquired under different conditions arrive sequentially; and each is only available for a limited period of time. Given the inherent privacy risks associated with medical data, this setup reflects the reality of deployment for deep learning diagnostic radiology systems. Many techniques exist to learn continuously for image classification, and several have been adapted to semantic segmentation. Yet most struggle to accumulate knowledge in a meaningful manner. Instead, they focus on preventing the problem of catastrophic forgetting, even when this reduces model plasticity and thereon burdens the training process. This puts into question whether the additional overhead of knowledge preservation is worth it - particularly for medical image segmentation, where computation requirements are already high - or if maintaining separate models would be a better solution. We propose UNEG, a simple and widely applicable multi-model benchmark that maintains separate segmentation and autoencoder networks for each training stage. The autoencoder is built from the same architecture as the segmentation network, which in our case is a full-resolution nnU-Net, to bypass any additional design decisions. During inference, the reconstruction error is used to select the most appropriate segmenter for each test image. Open this concept, we develop a fair evaluation scheme for different continual learning settings that moves beyond the prevention of catastrophic forgetting. Our results across three regions of interest (prostate, hippocampus, and right ventricle) show that UNEG outperforms several continual learning methods, reinforcing the need for strong baselines in continual learning research.
    Numerical Methods For PDEs Over Manifolds Using Spectral Physics Informed Neural Networks. (arXiv:2302.05322v1 [cs.LG])
    We introduce an approach for solving PDEs over manifolds using physics informed neural networks whose architecture aligns with spectral methods. The networks are trained to take in as input samples of an initial condition, a time stamp and point(s) on the manifold and then output the solution's value at the given time and point(s). We provide proofs of our method for the heat equation on the interval and examples of unique network architectures that are adapted to nonlinear equations on the sphere and the torus. We also show that our spectral-inspired neural network architectures outperform the standard physics informed architectures. Our extensive experimental results include generalization studies where the testing dataset of initial conditions is randomly sampled from a significantly larger space than the training set.
    CVTT: Cross-Validation Through Time. (arXiv:2205.05393v2 [cs.LG] UPDATED)
    The evaluation of recommender systems from a practical perspective is a topic of ongoing discourse within the research community. While many current evaluation methods reduce performance to a single value metric as an easy way to compare models, it relies on the assumption that the methods' performance remains constant over time. In this study, we examine this assumption and propose the Cross-Validation Thought Time (CVTT) technique as a more comprehensive evaluation method, focusing on model performance over time. By utilizing the proposed technique, we conduct an in-depth analysis of the performance of popular RecSys algorithms. Our findings indicate that (1) the performance of the recommenders varies over time for all reviewed datasets, (2) using simple evaluation approaches can lead to a substantial decrease in performance in real-world evaluation scenarios, and (3) excessive data usage can lead to suboptimal results.
    Scaling Vision Transformers to 22 Billion Parameters. (arXiv:2302.05442v1 [cs.CV])
    The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
    STERLING: Synergistic Representation Learning on Bipartite Graphs. (arXiv:2302.05428v1 [cs.LG])
    The bipartite graph is a powerful data structure for modeling interactions between two types of nodes, of which a fundamental challenge is how to extract informative node embeddings. Self-Supervised Learning (SSL) is a promising paradigm to address this challenge. Most recent bipartite graph SSL methods are based on contrastive learning which learns embeddings by discriminating positive and negative node pairs. Contrastive learning usually requires a large number of negative node pairs, which could lead to computational burden and semantic errors. In this paper, we introduce a novel synergistic representation learning model (STERLING) to learn node embeddings without negative node pairs. STERLING preserves the unique synergies in bipartite graphs. The local and global synergies are captured by maximizing the similarity of the inter-type and intra-type positive node pairs, and maximizing the mutual information of co-clusters respectively. Theoretical analysis demonstrates that STERLING could preserve the synergies in the embedding space. Extensive empirical evaluation on various benchmark datasets and tasks demonstrates the effectiveness of STERLING for extracting node embeddings.
    Query Processing on Tensor Computation Runtimes. (arXiv:2203.01877v4 [cs.DB] UPDATED)
    The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how database management systems can ride the wave of innovation happening in the AI space. We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10$\times$ over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9$\times$ speedup over CPU baselines.
    Optimizing Serially Concatenated Neural Codes with Classical Decoders. (arXiv:2212.10355v2 [cs.IT] UPDATED)
    For improving short-length codes, we demonstrate that classic decoders can also be used with real-valued, neural encoders, i.e., deep-learning based codeword sequence generators. Here, the classical decoder can be a valuable tool to gain insights into these neural codes and shed light on weaknesses. Specifically, the turbo-autoencoder is a recently developed channel coding scheme where both encoder and decoder are replaced by neural networks. We first show that the limited receptive field of convolutional neural network (CNN)-based codes enables the application of the BCJR algorithm to optimally decode them with feasible computational complexity. These maximum a posteriori (MAP) component decoders then are used to form classical (iterative) turbo decoders for parallel or serially concatenated CNN encoders, offering a close-to-maximum likelihood (ML) decoding of the learned codes. To the best of our knowledge, this is the first time that a classical decoding algorithm is applied to a non-trivial, real-valued neural code. Furthermore, as the BCJR algorithm is fully differentiable, it is possible to train, or fine-tune, the neural encoder in an end-to-end fashion.
    Discovering Sparse Representations of Lie Groups with Machine Learning. (arXiv:2302.05383v1 [hep-ph])
    Recent work has used deep learning to derive symmetry transformations, which preserve conserved quantities, and to obtain the corresponding algebras of generators. In this letter, we extend this technique to derive sparse representations of arbitrary Lie algebras. We show that our method reproduces the canonical (sparse) representations of the generators of the Lorentz group, as well as the $U(n)$ and $SU(n)$ families of Lie groups. This approach is completely general and can be used to find the infinitesimal generators for any Lie group.
    Heterogeneous Graph Masked Autoencoders. (arXiv:2208.09957v2 [cs.LG] UPDATED)
    Generative self-supervised learning (SSL), especially masked autoencoders, has become one of the most exciting learning paradigms and has shown great potential in handling graph data. However, real-world graphs are always heterogeneous, which poses three critical challenges that existing methods ignore: 1) how to capture complex graph structure? 2) how to incorporate various node attributes? and 3) how to encode different node positions? In light of this, we study the problem of generative SSL on heterogeneous graphs and propose HGMAE, a novel heterogeneous graph masked autoencoder model to address these challenges. HGMAE captures comprehensive graph information via two innovative masking techniques and three unique training strategies. In particular, we first develop metapath masking and adaptive attribute masking with dynamic mask rate to enable effective and stable learning on heterogeneous graphs. We then design several training strategies including metapath-based edge reconstruction to adopt complex structural information, target attribute restoration to incorporate various node attributes, and positional feature prediction to encode node positional information. Extensive experiments demonstrate that HGMAE outperforms both contrastive and generative state-of-the-art baselines on several tasks across multiple datasets. Codes are available at https://github.com/meettyj/HGMAE.
    Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization. (arXiv:2302.05440v1 [cs.LG])
    "Forward-only" algorithms, which train neural networks while avoiding a backward pass, have recently gained attention as a way of solving the biologically unrealistic aspects of backpropagation. Here, we first discuss the similarities between two "forward-only" algorithms, the Forward-Forward and PEPITA frameworks, and demonstrate that PEPITA is equivalent to a Forward-Forward with top-down feedback connections. Then, we focus on PEPITA to address compelling challenges related to the "forward-only" rules, which include providing an analytical understanding of their dynamics and reducing the gap between their performance and that of backpropagation. We propose a theoretical analysis of the dynamics of PEPITA. In particular, we show that PEPITA is well-approximated by an "adaptive-feedback-alignment" algorithm and we analytically track its performance during learning in a prototype high-dimensional setting. Finally, we develop a strategy to apply the weight mirroring algorithm on "forward-only" algorithms with top-down feedback and we show how it impacts PEPITA's accuracy and convergence rate.
    Gamma-convergence of a nonlocal perimeter arising in adversarial machine learning. (arXiv:2211.15223v3 [math.AP] UPDATED)
    In this paper we prove Gamma-convergence of a nonlocal perimeter of Minkowski type to a local anisotropic perimeter. The nonlocal model describes the regularizing effect of adversarial training in binary classifications. The energy essentially depends on the interaction between two distributions modelling likelihoods for the associated classes. We overcome typical strict regularity assumptions for the distributions by only assuming that they have bounded $BV$ densities. In the natural topology coming from compactness, we prove Gamma-convergence to a weighted perimeter with weight determined by an anisotropic function of the two densities. Despite being local, this sharp interface limit reflects classification stability with respect to adversarial perturbations. We further apply our results to deduce Gamma-convergence of the associated total variations, to study the asymptotics of adversarial training, and to prove Gamma-convergence of graph discretizations for the nonlocal perimeter.
    The Role of Codeword-to-Class Assignments in Error-Correcting Codes: An Empirical Study. (arXiv:2302.05334v1 [cs.LG])
    Error-correcting codes (ECC) are used to reduce multiclass classification tasks to multiple binary classification subproblems. In ECC, classes are represented by the rows of a binary matrix, corresponding to codewords in a codebook. Codebooks are commonly either predefined or problem dependent. Given predefined codebooks, codeword-to-class assignments are traditionally overlooked, and codewords are implicitly assigned to classes arbitrarily. Our paper shows that these assignments play a major role in the performance of ECC. Specifically, we examine similarity-preserving assignments, where similar codewords are assigned to similar classes. Addressing a controversy in existing literature, our extensive experiments confirm that similarity-preserving assignments induce easier subproblems and are superior to other assignment policies in terms of their generalization performance. We find that similarity-preserving assignments make predefined codebooks become problem-dependent, without altering other favorable codebook properties. Finally, we show that our findings can improve predefined codebooks dedicated to extreme classification.
    DevFormer: A Symmetric Transformer for Context-Aware Device Placement. (arXiv:2205.13225v2 [cs.LG] UPDATED)
    In this paper, we present DevFormer, a novel transformer-based architecture for addressing the complex and computationally demanding problem of hardware design optimization. Despite the demonstrated efficacy of transformers in domains including natural language processing and computer vision, their use in hardware design has been limited by the scarcity of offline data. Our approach addresses this limitation by introducing strong inductive biases such as relative positional embeddings and action-permutation symmetricity that effectively capture the hardware context and enable efficient design optimization with limited offline data. We apply DevFormer to the problem of decoupling capacitor placement and show that it outperforms state-of-the-art methods in both simulated and real hardware, leading to improved performances while reducing the number of components by more than 30%. Finally, we show that our approach achieves promising results in other offline contextual learning-based combinatorial optimization tasks.
    PLOT: Prompt Learning with Optimal Transport for Vision-Language Models. (arXiv:2210.01253v2 [cs.CV] UPDATED)
    With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method. The code is available at https://github.com/CHENGY12/PLOT.
    Predicting the cardinality of a reduced Gr\"obner basis. (arXiv:2302.05364v1 [math.AC])
    We use ansatz neural network models to predict key metrics of complexity for Gr\"obner bases of binomial ideals. This work illustrates why predictions with neural networks from Gr\"obner computations are not a straightforward process. Using two probabilistic models for random binomial ideals, we generate and make available a large data set that is able to capture sufficient variability in Gr\"obner complexity. We use this data to train neural networks and predict the cardinality of a reduced Gr\"obner basis and the maximum total degree of its elements. While the cardinality prediction problem is unlike classical problems tackled by machine learning, our simulations show that neural networks, providing performance statistics such as $r^2 = 0.401$, outperform naive guess or multiple regression models with $r^2 = 0.180$.
    A Practical Mixed Precision Algorithm for Post-Training Quantization. (arXiv:2302.05397v1 [cs.LG])
    Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
    Aerial View Goal Localization with Reinforcement Learning. (arXiv:2209.03694v2 [cs.CV] UPDATED)
    Climate-induced disasters are and will continue to be on the rise, and thus search-and-rescue (SAR) operations, where the task is to localize and assist one or several people who are missing, become increasingly relevant. In many cases the rough location may be known and a UAV can be deployed to explore a given, confined area to precisely localize the missing people. Due to time and battery constraints it is often critical that localization is performed as efficiently as possible. In this work we approach this type of problem by abstracting it as an aerial view goal localization task in a framework that emulates a SAR-like setup without requiring access to actual UAVs. In this framework, an agent operates on top of an aerial image (proxy for a search area) and is tasked with localizing a goal that is described in terms of visual cues. To further mimic the situation on an actual UAV, the agent is not able to observe the search area in its entirety, not even at low resolution, and thus it has to operate solely based on partial glimpses when navigating towards the goal. To tackle this task, we propose AiRLoc, a reinforcement learning (RL)-based model that decouples exploration (searching for distant goals) and exploitation (localizing nearby goals). Extensive evaluations show that AiRLoc outperforms heuristic search methods as well as alternative learnable approaches, and that it generalizes across datasets, e.g. to disaster-hit areas without seeing a single disaster scenario during training. We also conduct a proof-of-concept study which indicates that the learnable methods outperform humans on average. Code and models have been made publicly available at https://github.com/aleksispi/airloc.
    SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space. (arXiv:2204.02394v2 [cs.CV] UPDATED)
    We propose a method for 3D shape reconstruction from unoriented point clouds. Our method consists of a novel SE(3)-equivariant coordinate-based network (TF-ONet), that parametrizes the occupancy field of the shape and respects the inherent symmetries of the problem. In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular point cloud. Our architecture leverages equivariant attention layers that operate on local tokens. This mechanism enables local shape modelling, a crucial property for scalability to large scenes. Given an unoriented, sparse, noisy point cloud as input, we produce equivariant features for each point. These serve as keys and values for the subsequent equivariant cross-attention blocks that parametrize the occupancy field. By querying an arbitrary point in space, we predict its occupancy score. We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets. More importantly, local modelling together with SE(3)-equivariance create an ideal setting for SE(3) scene reconstruction. We show that by training only on single, aligned objects and without any pre-segmentation, we can reconstruct novel scenes containing arbitrarily many objects in random poses without any performance loss.
    Reinforcement Learning for Protocol Synthesis in Resource-Constrained Wireless Sensor and IoT Networks. (arXiv:2302.05300v1 [cs.NI])
    This article explores the concepts of online protocol synthesis using Reinforcement Learning (RL). The study is performed in the context of sensor and IoT networks with ultra low complexity wireless transceivers. The paper introduces the use of RL and Multi Armed Bandit (MAB), a specific type of RL, for Medium Access Control (MAC) under different network and traffic conditions. It then introduces a novel learning based protocol synthesis framework that addresses specific difficulties and limitations in medium access for both random access and time slotted networks. The mechanism does not rely on carrier sensing, network time-synchronization, collision detection, and other low level complex operations, thus making it ideal for ultra simple transceiver hardware used in resource constrained sensor and IoT networks. Additionally, the ability of independent protocol learning by the nodes makes the system robust and adaptive to the changes in network and traffic conditions. It is shown that the nodes can be trained to learn to avoid collisions, and to achieve network throughputs that are comparable to ALOHA based access protocols in sensor and IoT networks with simplest transceiver hardware. It is also shown that using RL, it is feasible to synthesize access protocols that can sustain network throughput at high traffic loads, which is not feasible in the ALOHA-based systems. The ability of the system to provide throughput fairness under network and traffic heterogeneities are also experimentally demonstrated.
    Improving Object-centric Learning with Query Optimization. (arXiv:2210.08990v2 [cs.CV] UPDATED)
    The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by investigating the potential of learnable queries as initializations for Slot-Attention learning, uniting it with efforts from existing attempts on improving Slot-Attention learning with bi-level optimization. With simple code adjustments on Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin. We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits great potential for concept binding and zero-shot learning. Our work is made publicly available at https://bo-qsa.github.io
    Temporal Domain Generalization with Drift-Aware Dynamic Neural Networks. (arXiv:2205.10664v3 [cs.LG] UPDATED)
    Temporal domain generalization is a promising yet extremely challenging area where the goal is to learn models under temporally changing data distributions and generalize to unseen data distributions following the trends of the change. The advancement of this area is challenged by: 1) characterizing data distribution drift and its impacts on models, 2) expressiveness in tracking the model dynamics, and 3) theoretical guarantee on the performance. To address them, we propose a Temporal Domain Generalization with Drift-Aware Dynamic Neural Network (DRAIN) framework. Specifically, we formulate the problem into a Bayesian framework that jointly models the relation between data and model dynamics. We then build a recurrent graph generation scenario to characterize the dynamic graph-structured neural networks learned across different time points. It captures the temporal drift of model parameters and data distributions and can predict models in the future without the presence of future data. In addition, we explore theoretical guarantees of the model performance under the challenging temporal DG setting and provide theoretical analysis, including uncertainty and generalization error. Finally, extensive experiments on several real-world benchmarks with temporal drift demonstrate the effectiveness and efficiency of the proposed method.
    Interventional Causal Representation Learning. (arXiv:2209.11924v3 [stat.ML] UPDATED)
    Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect $do$ interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.
    Trustworthy AI Inference Systems: An Industry Research View. (arXiv:2008.04449v2 [cs.CR] UPDATED)
    In this work, we provide an industry research view for approaching the design, deployment, and operation of trustworthy Artificial Intelligence (AI) inference systems. Such systems provide customers with timely, informed, and customized inferences to aid their decision, while at the same time utilizing appropriate security protection mechanisms for AI models. Additionally, such systems should also use Privacy-Enhancing Technologies (PETs) to protect customers' data at any time. To approach the subject, we start by introducing current trends in AI inference systems. We continue by elaborating on the relationship between Intellectual Property (IP) and private data protection in such systems. Regarding the protection mechanisms, we survey the security and privacy building blocks instrumental in designing, building, deploying, and operating private AI inference systems. For example, we highlight opportunities and challenges in AI systems using trusted execution environments combined with more recent advances in cryptographic techniques to protect data in use. Finally, we outline areas of further development that require the global collective attention of industry, academia, and government researchers to sustain the operation of trustworthy AI inference systems.
    Dive into Deep Learning. (arXiv:2106.11342v4 [cs.LG] UPDATED)
    This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions.
    Removing Structured Noise with Diffusion Models. (arXiv:2302.05290v1 [cs.LG])
    Solving ill-posed inverse problems requires careful formulation of prior beliefs over the signals of interest and an accurate description of their manifestation into noisy measurements. Handcrafted signal priors based on e.g. sparsity are increasingly replaced by data-driven deep generative models, and several groups have recently shown that state-of-the-art score-based diffusion models yield particularly strong performance and flexibility. In this paper, we show that the powerful paradigm of posterior sampling with diffusion models can be extended to include rich, structured, noise models. To that end, we propose a joint conditional reverse diffusion process with learned scores for the noise and signal-generating distribution. We demonstrate strong performance gains across various inverse problems with structured noise, outperforming competitive baselines that use normalizing flows and adversarial networks. This opens up new opportunities and relevant practical applications of diffusion modeling for inverse problems in the context of non-Gaussian measurements.
    Achieving Linear Speedup in Non-IID Federated Bilevel Learning. (arXiv:2302.05412v1 [cs.LG])
    Federated bilevel optimization has received increasing attention in various emerging machine learning and communication applications. Recently, several Hessian-vector-based algorithms have been proposed to solve the federated bilevel optimization problem. However, several important properties in federated learning such as the partial client participation and the linear speedup for convergence (i.e., the convergence rate and complexity are improved linearly with respect to the number of sampled clients) in the presence of non-i.i.d.~datasets, still remain open. In this paper, we fill these gaps by proposing a new federated bilevel algorithm named FedMBO with a novel client sampling scheme in the federated hypergradient estimation. We show that FedMBO achieves a convergence rate of $\mathcal{O}\big(\frac{1}{\sqrt{nK}}+\frac{1}{K}+\frac{\sqrt{n}}{K^{3/2}}\big)$ on non-i.i.d.~datasets, where $n$ is the number of participating clients in each round, and $K$ is the total number of iteration. This is the first theoretical linear speedup result for non-i.i.d.~federated bilevel optimization. Extensive experiments validate our theoretical results and demonstrate the effectiveness of our proposed method.
    Structured Prediction Problem Archive. (arXiv:2202.03574v3 [cs.LG] UPDATED)
    Structured prediction problems are one of the fundamental tools in machine learning. In order to facilitate algorithm development for their numerical solution, we collect in one place a large number of datasets in easy to read formats for a diverse set of problem classes. We provide archival links to datasets, description of the considered problems and problem formats, and a short summary of problem characteristics including size, number of instances etc. For reference we also give a non-exhaustive selection of algorithms proposed in the literature for their solution. We hope that this central repository will make benchmarking and comparison to established works easier. We welcome submission of interesting new datasets and algorithms for inclusion in our archive.
    Toric Geometry of Entropic Regularization. (arXiv:2202.01571v2 [math.OC] UPDATED)
    Entropic regularization is a method for large-scale linear programming. Geometrically, one traces intersections of the feasible polytope with scaled toric varieties, starting at the Birch point. We compare this to log-barrier methods, with reciprocal linear spaces, starting at the analytic center. We revisit entropic regularization for unbalanced optimal transport, and we develop the use of optimal conic couplings. We compute the degree of the associated toric variety, and we explore algorithms like iterative scaling.
    Optimization of Convolutional Neural Network Using the Linearly Decreasing Weight Particle Swarm Optimization. (arXiv:2001.05670v3 [cs.NE] UPDATED)
    Convolutional neural network (CNN) is one of the most frequently used deep learning techniques. Various forms of models have been proposed and im-proved for learning at CNN. When learning with CNN, it is necessary to determine the optimal hyperparameters. However, the number of hyperparameters is so large that it is difficult to do it manually, so much research has been done on automation. A method that uses metaheuristic algorithms is attracting attention in research on hyperparameter optimization. Metaheuristic algorithms are naturally inspired and include evolution strategies, genetic algorithms, antcolony optimization and particle swarm optimization. In particular, particle swarm optimization converges faster than genetic algorithms, and various models have been proposed. In this paper, we pro-pose CNN hyperparameter optimization with linearly decreasing weight particle swarm optimization (LDWPSO). In the experiment, the MNIST data set and CIFAR-10 data set, which are often used as benchmark data sets, are used. By opti-mizing CNN hyperparameters with LDWPSO, learning the MNIST and CIFAR-10 datasets, we compare the accuracy with a standard CNN based on LeNet-5. As a result, when using the MNIST dataset, the baseline CNN is 94.02% at the 5th epoch, compared to 98.95% for LDWPSO CNN, which improves accuracy. When using the CIFAR-10 dataset, the Baseline CNN is 28.07% at the 10th epoch, compared to 69.37% for the LDWPSO CNN, which greatly improves accuracy. This paper is presented at the 36th Annual Conference of the Japanese Society for Artificial In-telligence. The final version is available at the following URL: https://www.jst-age.jst.go.jp/article/pjsai/JSAI2022/0/JSAI2022_2S4IS2b03/_article/-char/en
    On Average-Case Error Bounds for Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v2 [stat.ML] UPDATED)
    In this paper, we study error bounds for {\em Bayesian quadrature} (BQ), with an emphasis on noisy settings, randomized algorithms, and average-case performance measures. We seek to approximate the integral of functions in a {\em Reproducing Kernel Hilbert Space} (RKHS), particularly focusing on the Mat\'ern-$\nu$ and squared exponential (SE) kernels, with samples from the function potentially being corrupted by Gaussian noise. We provide a two-step meta-algorithm that serves as a general tool for relating the average-case quadrature error with the $L^2$-function approximation error. When specialized to the Mat\'ern kernel, we recover an existing near-optimal error rate while avoiding the existing method of repeatedly sampling points. When specialized to other settings, we obtain new average-case results for settings including the SE kernel with noise and the Mat\'ern kernel with misspecification. Finally, we present algorithm-independent lower bounds that have greater generality and/or give distinct proofs compared to existing ones.
    Multilingual Normalization of Temporal Expressions with Masked Language Models. (arXiv:2205.10399v2 [cs.CL] UPDATED)
    The detection and normalization of temporal expressions is an important task and preprocessing step for many applications. However, prior work on normalization is rule-based, which severely limits the applicability in real-world multilingual settings, due to the costly creation of new rules. We propose a novel neural method for normalizing temporal expressions based on masked language modeling. Our multilingual method outperforms prior rule-based systems in many languages, and in particular, for low-resource languages with performance improvements of up to 33 F1 on average compared to the state of the art.
    How Biased are Your Features?: Computing Fairness Influence Functions with Global Sensitivity Analysis. (arXiv:2206.00667v2 [cs.LG] UPDATED)
    Fairness in machine learning has attained significant focus due to the widespread application in high-stake decision-making tasks. Unregulated machine learning classifiers can exhibit bias towards certain demographic groups in data, thus the quantification and mitigation of classifier bias is a central concern in fairness in machine learning. In this paper, we aim to quantify the influence of different features in a dataset on the bias of a classifier. To do this, we introduce the Fairness Influence Function (FIF). This function breaks down bias into its components among individual features and the intersection of multiple features. The key idea is to represent existing group fairness metrics as the difference of the scaled conditional variances in the classifier's prediction and apply a decomposition of variance according to global sensitivity analysis. To estimate FIFs, we instantiate an algorithm FairXplainer that applies variance decomposition of classifier's prediction following local regression. Experiments demonstrate that FairXplainer captures FIFs of individual feature and intersectional features, provides a better approximation of bias based on FIFs, demonstrates higher correlation of FIFs with fairness interventions, and detects changes in bias due to fairness affirmative/punitive actions in the classifier.
    Span-based Named Entity Recognition by Generating and Compressing Information. (arXiv:2302.05392v1 [cs.CL])
    The information bottleneck (IB) principle has been proven effective in various NLP applications. The existing work, however, only used either generative or information compression models to improve the performance of the target task. In this paper, we propose to combine the two types of IB models into one system to enhance Named Entity Recognition (NER). For one type of IB model, we incorporate two unsupervised generative components, span reconstruction and synonym generation, into a span-based NER system. The span reconstruction ensures that the contextualised span representation keeps the span information, while the synonym generation makes synonyms have similar representations even in different contexts. For the other type of IB model, we add a supervised IB layer that performs information compression into the system to preserve useful features for NER in the resulting span representations. Experiments on five different corpora indicate that jointly training both generative and information compression models can enhance the performance of the baseline span-based NER system. Our source code is publicly available at https://github.com/nguyennth/joint-ib-models.
    Should You Mask 15% in Masked Language Modeling?. (arXiv:2202.08005v3 [cs.CL] UPDATED)
    Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT's 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.
    A Second-Order Method for Stochastic Bandit Convex Optimisation. (arXiv:2302.05371v1 [cs.LG])
    We introduce a simple and efficient algorithm for unconstrained zeroth-order stochastic convex bandits and prove its regret is at most $(1 + r/d)[d^{1.5} \sqrt{n} + d^3] polylog(n, d, r)$ where $n$ is the horizon, $d$ the dimension and $r$ is the radius of a known ball containing the minimiser of the loss.
    Squeeze Training for Adversarial Robustness. (arXiv:2205.11156v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks (DNNs) to adversarial examples has attracted great attention in the machine learning community. The problem is related to non-flatness and non-smoothness of normally obtained loss landscapes. Training augmented with adversarial examples (a.k.a., adversarial training) is considered as an effective remedy. In this paper, we highlight that some collaborative examples, nearly perceptually indistinguishable from both adversarial and benign examples yet show extremely lower prediction loss, can be utilized to enhance adversarial training. A novel method is therefore proposed to achieve new state-of-the-arts in adversarial robustness. Code: https://github.com/qizhangli/ST-AT.
    Fine-tuning Partition-aware Item Similarities for Efficient and Scalable Recommendation. (arXiv:2207.05959v2 [cs.IR] UPDATED)
    Collaborative filtering (CF) is widely searched in recommendation with various types of solutions. Recent success of Graph Convolution Networks (GCN) in CF demonstrates the effectiveness of modeling high-order relationships through graphs, while repetitive graph convolution and iterative batch optimization limit their efficiency. Instead, item similarity models attempt to construct direct relationships through efficient interaction encoding. Despite their great performance, the growing item numbers result in quadratic growth in similarity modeling process, posing critical scalability problems. In this paper, we investigate the graph sampling strategy adopted in latest GCN model for efficiency improving, and identify the potential item group structure in the sampled graph. Based on this, we propose a novel item similarity model which introduces graph partitioning to restrict the item similarity modeling within each partition. Specifically, we show that the spectral information of the original graph is well in preserving global-level information. Then, it is added to fine-tune local item similarities with a new data augmentation strategy acted as partition-aware prior knowledge, jointly to cope with the information loss brought by partitioning. Experiments carried out on 4 datasets show that the proposed model outperforms state-of-the-art GCN models with 10x speed-up and item similarity models with 95\% parameter storage savings.
    A novel corrective-source term approach to modeling unknown physics in aluminum extraction process. (arXiv:2209.10861v2 [cs.LG] UPDATED)
    With the ever-increasing availability of data, there has been an explosion of interest in applying modern machine learning methods to fields such as modeling and control. However, despite the flexibility and surprising accuracy of such black-box models, it remains difficult to trust them. Recent efforts to combine the two approaches aim to develop flexible models that nonetheless generalize well; a paradigm we call Hybrid Analysis and modeling (HAM). In this work we investigate the Corrective Source Term Approach (CoSTA), which uses a data-driven model to correct a misspecified physics-based model. This enables us to develop models that make accurate predictions even when the underlying physics of the problem is not well understood. We apply CoSTA to model the Hall-H\'eroult process in an aluminum electrolysis cell. We demonstrate that the method improves both accuracy and predictive stability, yielding an overall more trustworthy model.
    Communication-Efficient Federated Hypergradient Computation via Aggregated Iterative Differentiation. (arXiv:2302.04969v1 [cs.LG])
    Federated bilevel optimization has attracted increasing attention due to emerging machine learning and communication applications. The biggest challenge lies in computing the gradient of the upper-level objective function (i.e., hypergradient) in the federated setting due to the nonlinear and distributed construction of a series of global Hessian matrices. In this paper, we propose a novel communication-efficient federated hypergradient estimator via aggregated iterative differentiation (AggITD). AggITD is simple to implement and significantly reduces the communication cost by conducting the federated hypergradient estimation and the lower-level optimization simultaneously. We show that the proposed AggITD-based algorithm achieves the same sample complexity as existing approximate implicit differentiation (AID)-based approaches with much fewer communication rounds in the presence of data heterogeneity. Our results also shed light on the great advantage of ITD over AID in the federated/distributed hypergradient estimation. This differs from the comparison in the non-distributed bilevel optimization, where ITD is less efficient than AID. Our extensive experiments demonstrate the great effectiveness and communication efficiency of the proposed method.  ( 2 min )
    A SWAT-based Reinforcement Learning Framework for Crop Management. (arXiv:2302.04988v1 [cs.LG])
    Crop management involves a series of critical, interdependent decisions or actions in a complex and highly uncertain environment, which exhibit distinct spatial and temporal variations. Managing resource inputs such as fertilizer and irrigation in the face of climate change, dwindling supply, and soaring prices is nothing short of a Herculean task. The ability of machine learning to efficiently interrogate complex, nonlinear, and high-dimensional datasets can revolutionize decision-making in agriculture. In this paper, we introduce a reinforcement learning (RL) environment that leverages the dynamics in the Soil and Water Assessment Tool (SWAT) and enables management practices to be assessed and evaluated on a watershed level. This drastically saves time and resources that would have been otherwise deployed during a full-growing season. We consider crop management as an optimization problem where the objective is to produce higher crop yield while minimizing the use of external farming inputs (specifically, fertilizer and irrigation amounts). The problem is naturally subject to environmental factors such as precipitation, solar radiation, temperature, and soil water content. We demonstrate the utility of our framework by developing and benchmarking various decision-making agents following management strategies informed by standard farming practices and state-of-the-art RL algorithms.  ( 2 min )
    ChemVise: Maximizing Out-of-Distribution Chemical Detection with the Novel Application of Zero-Shot Learning. (arXiv:2302.04917v1 [cs.LG])
    Accurate chemical sensors are vital in medical, military, and home safety applications. Training machine learning models to be accurate on real world chemical sensor data requires performing many diverse, costly experiments in controlled laboratory settings to create a data set. In practice even expensive, large data sets may be insufficient for generalization of a trained model to a real-world testing distribution. Rather than perform greater numbers of experiments requiring exhaustive mixtures of chemical analytes, this research proposes learning approximations of complex exposures from training sets of simple ones by using single-analyte exposure signals as building blocks of a multiple-analyte space. We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes. Further, we pair these synthetic signals to targets in an information-dense representation space utilizing a large corpus of chemistry knowledge. Through utilization of a semantically meaningful analyte representation spaces along with synthetic targets we achieve rapid analyte classification in the presence of obscurants without corresponding obscured-analyte training data. Transfer learning for supervised learning with molecular representations makes assumptions about the input data. Instead, we borrow from the natural language and natural image processing literature for a novel approach to chemical sensor signal classification using molecular semantics for arbitrary chemical sensor hardware designs.  ( 2 min )
    Short-Term Aggregated Residential Load Forecasting using BiLSTM and CNN-BiLSTM. (arXiv:2302.05033v1 [cs.LG])
    Higher penetration of renewable and smart home technologies at the residential level challenges grid stability as utility-customer interactions add complexity to power system operations. In response, short-term residential load forecasting has become an increasing area of focus. However, forecasting at the residential level is challenging due to the higher uncertainties involved. Recently deep neural networks have been leveraged to address this issue. This paper investigates the capabilities of a bidirectional long short-term memory (BiLSTM) and a convolutional neural network-based BiLSTM (CNN-BiLSTM) to provide a day ahead (24 hr.) forecasting at an hourly resolution while minimizing the root mean squared error (RMSE) between the actual and predicted load demand. Using a publicly available dataset consisting of 38 homes, the BiLSTM and CNN-BiLSTM models are trained to forecast the aggregated active power demand for each hour within a 24 hr. span, given the previous 24 hr. load data. The BiLSTM model achieved the lowest RMSE of 1.4842 for the overall daily forecast. In addition, standard LSTM and CNN-LSTM models are trained and compared with the BiLSTM architecture. The RMSE of BiLSTM is 5.60%, 2.85% and 2.60% lower than the LSTM, CNN-LSTM and CNN-BiLSTM models respectively. The source code of this work is available at https://github.com/Varat7v2/STLF-BiLSTM-CNNBiLSTM.git.  ( 2 min )
    Gaussian Process-Gated Hierarchical Mixtures of Experts. (arXiv:2302.04947v1 [cs.LG])
    In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs) that are used for building gates and experts. Unlike in other mixtures of experts where the gating models are linear to the input, the gating functions of our model are inner nodes built with Gaussian processes based on random features that are non-linear and non-parametric. Further, the experts are also built with Gaussian processes and provide predictions that depend on test data. The optimization of the GPHMEs is carried out by variational inference. There are several advantages of the proposed GPHMEs. One is that they outperform tree-based HME benchmarks that partition the data in the input space. Another advantage is that they achieve good performance with reduced complexity. A third advantage of the GPHMEs is that they provide interpretability of deep Gaussian processes and more generally of deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets even with quite modest sizes.  ( 2 min )
    Two-step counterfactual generation for OOD examples. (arXiv:2302.05196v1 [cs.LG])
    Two fundamental requirements for the deployment of machine learning models in safety-critical systems are to be able to detect out-of-distribution (OOD) data correctly and to be able to explain the prediction of the model. Although significant effort has gone into both OOD detection and explainable AI, there has been little work on explaining why a model predicts a certain data point is OOD. In this paper, we address this question by introducing the concept of an OOD counterfactual, which is a perturbed data point that iteratively moves between different OOD categories. We propose a method for generating such counterfactuals, investigate its application on synthetic and benchmark data, and compare it to several benchmark methods using a range of metrics.
    ShapeWordNet: An Interpretable Shapelet Neural Network for Physiological Signal Classification. (arXiv:2302.05021v1 [cs.LG])
    Physiological signals are high-dimensional time series of great practical values in medical and healthcare applications. However, previous works on its classification fail to obtain promising results due to the intractable data characteristics and the severe label sparsity issues. In this paper, we try to address these challenges by proposing a more effective and interpretable scheme tailored for the physiological signal classification task. Specifically, we exploit the time series shapelets to extract prominent local patterns and perform interpretable sequence discretization to distill the whole-series information. By doing so, the long and continuous raw signals are compressed into short and discrete token sequences, where both local patterns and global contexts are well preserved. Moreover, to alleviate the label sparsity issue, a multi-scale transformation strategy is adaptively designed to augment data and a cross-scale contrastive learning mechanism is accordingly devised to guide the model training. We name our method as ShapeWordNet and conduct extensive experiments on three real-world datasets to investigate its effectiveness. Comparative results show that our proposed scheme remarkably outperforms four categories of cutting-edge approaches. Visualization analysis further witnesses the good interpretability of the sequence discretization idea based on shapelets.
    Conceptual Views on Tree Ensemble Classifiers. (arXiv:2302.05270v1 [cs.LG])
    Random Forests and related tree-based methods are popular for supervised learning from table based data. Apart from their ease of parallelization, their classification performance is also superior. However, this performance, especially parallelizability, is offset by the loss of explainability. Statistical methods are often used to compensate for this disadvantage. Yet, their ability for local explanations, and in particular for global explanations, is limited. In the present work we propose an algebraic method, rooted in lattice theory, for the (global) explanation of tree ensembles. In detail, we introduce two novel conceptual views on tree ensemble classifiers and demonstrate their explanatory capabilities on Random Forests that were trained with standard parameters.
    Near-Optimal Experimental Design Under the Budget Constraint in Online Platforms. (arXiv:2302.05005v1 [cs.LG])
    A/B testing, or controlled experiments, is the gold standard approach to causally compare the performance of algorithms on online platforms. However, conventional Bernoulli randomization in A/B testing faces many challenges such as spillover and carryover effects. Our study focuses on another challenge, especially for A/B testing on two-sided platforms -- budget constraints. Buyers on two-sided platforms often have limited budgets, where the conventional A/B testing may be infeasible to be applied, partly because two variants of allocation algorithms may conflict and lead some buyers to exceed their budgets if they are implemented simultaneously. We develop a model to describe two-sided platforms where buyers have limited budgets. We then provide an optimal experimental design that guarantees small bias and minimum variance. Bias is lower when there is more budget and a higher supply-demand rate. We test our experimental design on both synthetic data and real-world data, which verifies the theoretical results and shows our advantage compared to Bernoulli randomization.
    Predicting Out-of-Distribution Error with Confidence Optimal Transport. (arXiv:2302.05018v1 [cs.LG])
    Out-of-distribution (OOD) data poses serious challenges in deployed machine learning models as even subtle changes could incur significant performance drops. Being able to estimate a model's performance on test data is important in practice as it indicates when to trust to model's decisions. We present a simple yet effective method to predict a model's performance on an unknown distribution without any addition annotation. Our approach is rooted in the Optimal Transport theory, viewing test samples' output softmax scores from deep neural networks as empirical samples from an unknown distribution. We show that our method, Confidence Optimal Transport (COT), provides robust estimates of a model's performance on a target domain. Despite its simplicity, our method achieves state-of-the-art results on three benchmark datasets and outperforms existing methods by a large margin.
    Confidence-based Reliable Learning under Dual Noises. (arXiv:2302.05098v1 [cs.CV])
    Deep neural networks (DNNs) have achieved remarkable success in a variety of computer vision tasks, where massive labeled images are routinely required for model optimization. Yet, the data collected from the open world are unavoidably polluted by noise, which may significantly undermine the efficacy of the learned models. Various attempts have been made to reliably train DNNs under data noise, but they separately account for either the noise existing in the labels or that existing in the images. A naive combination of the two lines of works would suffer from the limitations in both sides, and miss the opportunities to handle the two kinds of noise in parallel. This work provides a first, unified framework for reliable learning under the joint (image, label)-noise. Technically, we develop a confidence-based sample filter to progressively filter out noisy data without the need of pre-specifying noise ratio. Then, we penalize the model uncertainty of the detected noisy data instead of letting the model continue over-fitting the misleading information in them. Experimental results on various challenging synthetic and real-world noisy datasets verify that the proposed method can outperform competing baselines in the aspect of classification performance.
    Cross-Corpora Spoken Language Identification with Domain Diversification and Generalization. (arXiv:2302.05110v1 [eess.AS])
    This work addresses the cross-corpora generalization issue for the low-resourced spoken language identification (LID) problem. We have conducted the experiments in the context of Indian LID and identified strikingly poor cross-corpora generalization due to corpora-dependent non-lingual biases. Our contribution to this work is twofold. First, we propose domain diversification, which diversifies the limited training data using different audio data augmentation methods. We then propose the concept of maximally diversity-aware cascaded augmentations and optimize the augmentation fold-factor for effective diversification of the training data. Second, we introduce the idea of domain generalization considering the augmentation methods as pseudo-domains. Towards this, we investigate both domain-invariant and domain-aware approaches. Our LID system is based on the state-of-the-art emphasized channel attention, propagation, and aggregation based time delay neural network (ECAPA-TDNN) architecture. We have conducted extensive experiments with three widely used corpora for Indian LID research. In addition, we conduct a final blind evaluation of our proposed methods on the Indian subset of VoxLingua107 corpus collected in the wild. Our experiments demonstrate that the proposed domain diversification is more promising over commonly used simple augmentation methods. The study also reveals that domain generalization is a more effective solution than domain diversification. We also notice that domain-aware learning performs better for same-corpora LID, whereas domain-invariant learning is more suitable for cross-corpora generalization. Compared to basic ECAPA-TDNN, its proposed domain-invariant extensions improve the cross-corpora EER up to 5.23%. In contrast, the proposed domain-aware extensions also improve performance for same-corpora test scenarios.
    MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope. (arXiv:2302.05294v1 [cs.CV])
    Explaining the predictions of deep neural nets has been a topic of great interest in the computer vision literature. While several gradient-based interpretation schemes have been proposed to reveal the influential variables in a neural net's prediction, standard gradient-based interpretation frameworks have been commonly observed to lack robustness to input perturbations and flexibility for incorporating prior knowledge of sparsity and group-sparsity structures. In this work, we propose MoreauGrad as an interpretation scheme based on the classifier neural net's Moreau envelope. We demonstrate that MoreauGrad results in a smooth and robust interpretation of a multi-layer neural network and can be efficiently computed through first-order optimization methods. Furthermore, we show that MoreauGrad can be naturally combined with $L_1$-norm regularization techniques to output a sparse or group-sparse explanation which are prior conditions applicable to a wide range of deep learning applications. We empirically evaluate the proposed MoreauGrad scheme on standard computer vision datasets, showing the qualitative and quantitative success of the MoreauGrad approach in comparison to standard gradient-based interpretation methods.
    Towards Minimax Optimality of Model-based Robust Reinforcement Learning. (arXiv:2302.05372v1 [cs.LG])
    We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.
    Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training. (arXiv:2302.05045v1 [cs.LG])
    Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication times and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication times by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
    Effects of noise on the overparametrization of quantum neural networks. (arXiv:2302.05059v1 [quant-ph])
    Overparametrization is one of the most surprising and notorious phenomena in machine learning. Recently, there have been several efforts to study if, and how, Quantum Neural Networks (QNNs) acting in the absence of hardware noise can be overparametrized. In particular, it has been proposed that a QNN can be defined as overparametrized if it has enough parameters to explore all available directions in state space. That is, if the rank of the Quantum Fisher Information Matrix (QFIM) for the QNN's output state is saturated. Here, we explore how the presence of noise affects the overparametrization phenomenon. Our results show that noise can "turn on" previously-zero eigenvalues of the QFIM. This enables the parametrized state to explore directions that were otherwise inaccessible, thus potentially turning an overparametrized QNN into an underparametrized one. For small noise levels, the QNN is quasi-overparametrized, as large eigenvalues coexists with small ones. Then, we prove that as the magnitude of noise increases all the eigenvalues of the QFIM become exponentially suppressed, indicating that the state becomes insensitive to any change in the parameters. As such, there is a pull-and-tug effect where noise can enable new directions, but also suppress the sensitivity to parameter updates. Finally, our results imply that current QNN capacity measures are ill-defined when hardware noise is present.
    On the Whitney near extension problem, BMO, alignment of data, best approximation in algebraic geometry, manifold learning and their beautiful connections: A modern treatment. (arXiv:2103.09748v7 [math.CA] UPDATED)
    This paper provides fascinating connections between several mathematical problems which lie on the intersection of several mathematics subjects, namely algebraic geometry, approximation theory, complex-harmonic analysis and high dimensional data science. Modern techniques in algebraic geometry, approximation theory, computational harmonic analysis and extensions develop the first of its kind, a unified framework which allows for a simultaneous study of labeled and unlabeled near alignment data problems in of $\mathbb R^D$ with the near isometry extension problem for discrete and non-discrete subsets of $\mathbb R^D$ with certain geometries. In addition, the paper surveys related work on clustering, dimension reduction, manifold learning, vision as well as minimal energy partitions, discrepancy and min-max optimization. Numerous open problems are given.
    AutoNMT: A Framework to Streamline the Research of Seq2Seq Models. (arXiv:2302.04981v1 [cs.CL])
    We present AutoNMT, a framework to streamline the research of seq-to-seq models by automating the data pipeline (i.e., file management, data preprocessing, and exploratory analysis), automating experimentation in a toolkit-agnostic manner, which allows users to use either their own models or existing seq-to-seq toolkits such as Fairseq or OpenNMT, and finally, automating the report generation (plots and summaries). Furthermore, this library comes with its own seq-to-seq toolkit so that users can easily customize it for non-standard tasks.
    Long-Tailed Partial Label Learning via Dynamic Rebalancing. (arXiv:2302.05080v1 [cs.LG])
    Real-world data usually couples the label ambiguity and heavy imbalance, challenging the algorithmic robustness of partial label learning (PLL) and long-tailed learning (LT). The straightforward combination of LT and PLL, i.e., LT-PLL, suffers from a fundamental dilemma: LT methods build upon a given class distribution that is unavailable in PLL, and the performance of PLL is severely influenced in long-tailed context. We show that even with the auxiliary of an oracle class prior, the state-of-the-art methods underperform due to an adverse fact that the constant rebalancing in LT is harsh to the label disambiguation in PLL. To overcome this challenge, we thus propose a dynamic rebalancing method, termed as RECORDS, without assuming any prior knowledge about the class distribution. Based on a parametric decomposition of the biased output, our method constructs a dynamic adjustment that is benign to the label disambiguation process and theoretically converges to the oracle class prior. Extensive experiments on three benchmark datasets demonstrate the significant gain of RECORDS compared with a range of baselines. The code is publicly available.
    Gauge-equivariant neural networks as preconditioners in lattice QCD. (arXiv:2302.05419v1 [hep-lat])
    We demonstrate that a state-of-the art multi-grid preconditioner can be learned efficiently by gauge-equivariant neural networks. We show that the models require minimal re-training on different gauge configurations of the same gauge ensemble and to a large extent remain efficient under modest modifications of ensemble parameters. We also demonstrate that important paradigms such as communication avoidance are straightforward to implement in this framework.
    Hessian Based Smoothing Splines for Manifold Learning. (arXiv:2302.05025v1 [stat.ML])
    We propose a multidimensional smoothing spline algorithm in the context of manifold learning. We generalize the bending energy penalty of thin-plate splines to a quadratic form on the Sobolev space of a flat manifold, based on the Frobenius norm of the Hessian matrix. This leads to a natural definition of smoothing splines on manifolds, which minimizes square error while optimizing a global curvature penalty. The existence and uniqueness of the solution is shown by applying the theory of reproducing kernel Hilbert spaces. The minimizer is expressed as a combination of Green's functions for the biharmonic operator, and 'linear' functions of everywhere vanishing Hessian. Furthermore, we utilize the Hessian estimation procedure from the Hessian Eigenmaps algorithm to approximate the spline loss when the true manifold is unknown. This yields a particularly simple quadratic optimization algorithm for smoothing response values without needing to fit the underlying manifold. Analysis of asymptotic error and robustness are given, as well as discussion of out-of-sample prediction methods and applications.
    Causal Inference out of Control: Estimating the Steerability of Consumption. (arXiv:2302.04989v1 [cs.LG])
    Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on consumption. We introduce a general causal inference problem we call the steerability of consumption that abstracts many settings of interest. Focusing on observational designs and exploiting the structure of the problem, we exhibit a set of assumptions for causal identifiability that significantly weaken the often unrealistic overlap assumptions of standard designs. The key novelty of our approach is to explicitly model the dynamics of consumption over time, viewing the platform as a controller acting on a dynamical system. From this dynamical systems perspective, we are able to show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying steerability of consumption. Our results illustrate the fruitful interplay of control theory and causal inference, which we illustrate with examples from econometrics, macroeconomics, and machine learning.
    Contraction of $E_\gamma$-Divergence and Its Applications to Privacy. (arXiv:2012.11035v2 [cs.IT] UPDATED)
    We investigate the contraction coefficients derived from strong data processing inequalities for the $E_\gamma$-divergence. By generalizing the celebrated Dobrushin's coefficient from total variation distance to $E_\gamma$-divergence, we derive a closed-form expression for the contraction of $E_\gamma$-divergence. This result has fundamental consequences in two privacy settings. First, it implies that local differential privacy can be equivalently expressed in terms of the contraction of $E_\gamma$-divergence. This equivalent formula can be used to precisely quantify the impact of local privacy in (Bayesian and minimax) estimation and hypothesis testing problems in terms of the reduction of effective sample size. Second, it leads to a new information-theoretic technique for analyzing privacy guarantees of online algorithms. In this technique, we view such algorithms as a composition of amplitude-constrained Gaussian channels and then relate their contraction coefficients under $E_\gamma$-divergence to the overall differential privacy guarantees. As an example, we apply our technique to derive the differential privacy parameters of gradient descent. Moreover, we also show that this framework can be tailored to batch learning algorithms that can be implemented with one pass over the training dataset.
    Evaluation of Data Augmentation and Loss Functions in Semantic Image Segmentation for Drilling Tool Wear Detection. (arXiv:2302.05262v1 [cs.CV])
    Tool wear monitoring is crucial for quality control and cost reduction in manufacturing processes, of which drilling applications are one example. In this paper, we present a U-Net based semantic image segmentation pipeline, deployed on microscopy images of cutting inserts, for the purpose of wear detection. The wear area is differentiated in two different types, resulting in a multiclass classification problem. Joining the two wear types in one general wear class, on the other hand, allows the problem to be formulated as a binary classification task. Apart from the comparison of the binary and multiclass problem, also different loss functions, i. e., Cross Entropy, Focal Cross Entropy, and a loss based on the Intersection over Union (IoU), are investigated. Furthermore, models are trained on image tiles of different sizes, and augmentation techniques of varying intensities are deployed. We find, that the best performing models are binary models, trained on data with moderate augmentation and an IoU-based loss function.
    On Penalty-based Bilevel Gradient Descent Method. (arXiv:2302.05185v1 [cs.LG])
    Bilevel optimization enjoys a wide range of applications in hyper-parameter optimization, meta-learning and reinforcement learning. However, bilevel optimization problems are difficult to solve. Recent progress on scalable bilevel algorithms mainly focuses on bilevel optimization problems where the lower-level objective is either strongly convex or unconstrained. In this work, we tackle the bilevel problem through the lens of the penalty method. We show that under certain conditions, the penalty reformulation recovers the solutions of the original bilevel problem. Further, we propose the penalty-based bilevel gradient descent (PBGD) algorithm and establish its finite-time convergence for the constrained bilevel problem without lower-level strong convexity. Experiments showcase the efficiency of the proposed PBGD algorithm.
    Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks. (arXiv:2005.09147v10 [cs.CV] UPDATED)
    Deep neural networks (DNNs) are vulnerable to adversarial noises. Adversarial training is a general and effective strategy to improve DNN robustness (i.e., accuracy on noisy data) against adversarial noises. However, DNN models trained by the current existing adversarial training methods may have much lower standard accuracy (i.e., accuracy on clean data), compared to the same models trained by the standard method on clean data, and this phenomenon is known as the trade-off between accuracy and robustness and is considered unavoidable. This issue prevents adversarial training from being used in many application domains, such as medical image analysis, as practitioners do not want to sacrifice standard accuracy too much in exchange for adversarial robustness. Our objective is to lift (i.e., alleviate or even avoid) this trade-off between standard accuracy and adversarial robustness for medical image classification and segmentation. We propose a novel adversarial training method, named Increasing-Margin Adversarial (IMA) Training, which is supported by an equilibrium state analysis about the optimality of adversarial training samples. Our method aims to preserve accuracy while improving robustness by generating optimal adversarial training samples. We evaluate our method and the other eight representative methods on six publicly available image datasets corrupted by noises generated by AutoAttack and white-noise attack. Our method achieves the highest adversarial robustness for image classification and segmentation with the smallest reduction in accuracy on clean data. For one of the applications, our method improves both accuracy and robustness. Our study has demonstrated that our method can lift the trade-off between standard accuracy and adversarial robustness for the image classification and segmentation applications.
    Discovering Sparse Hysteresis Models for Piezoelectric Materials: A Data-Driven Study and Perspectives into Modelling Magnetic Hysteresis. (arXiv:2302.05313v1 [cs.LG])
    This article presents an approach for modelling hysteresis in piezoelectric materials that leverages recent advancements in machine learning, particularly in sparse-regression techniques. While sparse regression has previously been used to model various scientific and engineering phenomena, its application to nonlinear hysteresis modelling in piezoelectric materials has yet to be explored. The study employs the sequential threshold least-squares algorithm to model the dynamic system responsible for hysteresis, resulting in a concise model that accurately predicts hysteresis for both simulated and experimental piezoelectric material data. Additionally, insights are provided on sparse white-box modelling of hysteresis for magnetic materials taking non-oriented electrical steel as an example. The presented approach is compared to traditional regression-based and neural network methods, demonstrating its efficiency and robustness.
    Federated Domain Adaptation via Gradient Projection. (arXiv:2302.05049v1 [cs.LG])
    Federated Domain Adaptation (FDA) describes the federated learning setting where a set of source clients work collaboratively to improve the performance of a target client and where the target client has limited labeled data. The domain shift between the source and target domains, combined with limited samples in the target domain, makes FDA a challenging problem, e.g., common techniques such as FedAvg and fine-tuning fail with a large domain shift. To fill this gap, we propose Federated Gradient Projection ($\texttt{FedGP}$), a novel aggregation rule for FDA, used to aggregate the source gradients and target gradient during training. Further, we introduce metrics that characterize the FDA setting and propose a theoretical framework for analyzing the performance of aggregation rules, which may be of independent interest. Using this framework, we theoretically characterize how, when, and why $\texttt{FedGP}$ works compared to baselines. Our theory suggests certain practical rules that are predictive of practice. Experiments on synthetic and real-world datasets verify the theoretical insights and illustrate the effectiveness of the proposed method in practice.
    AdaptSim: Task-Driven Simulation Adaptation for Sim-to-Real Transfer. (arXiv:2302.04903v1 [cs.RO])
    Simulation parameter settings such as contact models and object geometry approximations are critical to training robust robotic policies capable of transferring from simulation to real-world deployment. Previous approaches typically handcraft distributions over such parameters (domain randomization), or identify parameters that best match the dynamics of the real environment (system identification). However, there is often an irreducible gap between simulation and reality: attempting to match the dynamics between simulation and reality across all states and tasks may be infeasible and may not lead to policies that perform well in reality for a specific task. Addressing this issue, we propose AdaptSim, a new task-driven adaptation framework for sim-to-real transfer that aims to optimize task performance in target (real) environments -- instead of matching dynamics between simulation and reality. First, we meta-learn an adaptation policy in simulation using reinforcement learning for adjusting the simulation parameter distribution based on the current policy's performance in a target environment. We then perform iterative real-world adaptation by inferring new simulation parameter distributions for policy training, using a small amount of real data. We perform experiments in three robotic tasks: (1) swing-up of linearized double pendulum, (2) dynamic table-top pushing of a bottle, and (3) dynamic scooping of food pieces with a spatula. Our extensive simulation and hardware experiments demonstrate AdaptSim achieving 1-3x asymptotic performance and $\sim$2x real data efficiency when adapting to different environments, compared to methods based on Sys-ID and directly training the task policy in target environments.
    Fast Gumbel-Max Sketch and its Applications. (arXiv:2302.05176v1 [cs.LG])
    The well-known Gumbel-Max Trick for sampling elements from a categorical distribution (or more generally a non-negative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element $i$ in proportion to its positive weight $v_i$, the Gumbel-Max Trick first computes a Gumbel random variable $g_i$ for each positive weight element $i$, and then samples the element $i$ with the largest value of $g_i+\ln v_i$. Recently, applications including similarity estimation and weighted cardinality estimation require to generate $k$ independent Gumbel-Max variables from high dimensional vectors. However, it is computationally expensive for a large $k$ (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM, which reduces the time complexity from $O(kn^+)$ to $O(k \ln k + n^+)$, where $n^+$ is the number of positive elements in the vector of interest. FastGM stops the procedure of Gumbel random variables computing for many elements, especially for those with small weights. We perform experiments on a variety of real-world datasets and the experimental results demonstrate that FastGM is orders of magnitude faster than state-of-the-art methods without sacrificing accuracy or incurring additional expenses.
    Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames. (arXiv:2302.04973v1 [cs.CV])
    Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset.
    Binarized Neural Machine Translation. (arXiv:2302.04907v1 [cs.CL])
    The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residual connections to improve binarization quality. Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings. Implementation in JAX/Flax will be open sourced.
    Star-Shaped Denoising Diffusion Probabilistic Models. (arXiv:2302.05259v1 [stat.ML])
    Methods based on Denoising Diffusion Probabilistic Models (DDPM) became a ubiquitous tool in generative modeling. However, they are mostly limited to Gaussian and discrete diffusion processes. We propose Star-Shaped Denoising Diffusion Probabilistic Models (SS-DDPM), a model with a non-Markovian diffusion-like noising process. In the case of Gaussian distributions, this model is equivalent to Markovian DDPMs. However, it can be defined and applied with arbitrary noising distributions, and admits efficient training and sampling algorithms for a wide range of distributions that lie in the exponential family. We provide a simple recipe for designing diffusion-like models with distributions like Beta, von Mises--Fisher, Dirichlet, Wishart and others, which can be especially useful when data lies on a constrained manifold such as the unit sphere, the space of positive semi-definite matrices, the probabilistic simplex, etc. We evaluate the model in different settings and find it competitive even on image data, where Beta SS-DDPM achieves results comparable to a Gaussian DDPM.
    Beyond In-Domain Scenarios: Robust Density-Aware Calibration. (arXiv:2302.05118v1 [cs.LG])
    Calibrating deep learning models to yield uncertainty-aware predictions is crucial as deep neural networks get increasingly deployed in safety-critical applications. While existing post-hoc calibration methods achieve impressive results on in-domain test datasets, they are limited by their inability to yield reliable uncertainty estimates in domain-shift and out-of-domain (OOD) scenarios. We aim to bridge this gap by proposing DAC, an accuracy-preserving as well as Density-Aware Calibration method based on k-nearest-neighbors (KNN). In contrast to existing post-hoc methods, we utilize hidden layers of classifiers as a source for uncertainty-related information and study their importance. We show that DAC is a generic method that can readily be combined with state-of-the-art post-hoc methods. DAC boosts the robustness of calibration performance in domain-shift and OOD, while maintaining excellent in-domain predictive uncertainty estimates. We demonstrate that DAC leads to consistently better calibration across a large number of model architectures, datasets, and metrics. Additionally, we show that DAC improves calibration substantially on recent large-scale neural networks pre-trained on vast amounts of data.
    DOMINO: Domain-aware Loss for Deep Learning Calibration. (arXiv:2302.05142v1 [cs.CV])
    Deep learning has achieved the state-of-the-art performance across medical imaging tasks; however, model calibration is often not considered. Uncalibrated models are potentially dangerous in high-risk applications since the user does not know when they will fail. Therefore, this paper proposes a novel domain-aware loss function to calibrate deep learning models. The proposed loss function applies a class-wise penalty based on the similarity between classes within a given target domain. Thus, the approach improves the calibration while also ensuring that the model makes less risky errors even when incorrect. The code for this software is available at https://github.com/lab-smile/DOMINO.
    Graph Neural Networks Go Forward-Forward. (arXiv:2302.05282v1 [cs.LG])
    We present the Graph Forward-Forward (GFF) algorithm, an extension of the Forward-Forward procedure to graphs, able to handle features distributed over a graph's nodes. This allows training graph neural networks with forward passes only, without backpropagation. Our method is agnostic to the message-passing scheme, and provides a more biologically plausible learning scheme than backpropagation, while also carrying computational advantages. With GFF, graph neural networks are trained greedily layer by layer, using both positive and negative samples. We run experiments on 11 standard graph property prediction tasks, showing how GFF provides an effective alternative to backpropagation for training graph neural networks. This shows in particular that this procedure is remarkably efficient in spite of combining the per-layer training with the locality of the processing in a GNN.
    Multi-armed Bandit Learning for TDMA Transmission Slot Scheduling and Defragmentation for Improved Bandwidth Usage. (arXiv:2302.05301v1 [cs.NI])
    This paper proposes a Time Division Multiple Access (TDMA) MAC slot allocation protocol with efficient bandwidth usage in wireless sensor networks and Internet of Things (IoTs). The developed protocol has two primary components: a Multi-Armed Bandits (MAB)-based slot allocation mechanism for collision free transmission, and a Decentralized Defragmented Slot Backshift (DDSB) operation for improving bandwidth usage efficiency. The proposed framework is decentralized in that each node finds its transmission schedule independently without the control of any centralized arbitrator. The developed mechanism is suitable for networks with or without time synchronization, thus, making it suitable for low-complexity wireless transceivers for wireless sensor and IoT nodes. This framework is able to manage the trade-off between learning convergence time and bandwidth. In addition, it allows the nodes to adapt to topological changes while maintaining efficient bandwidth usage. The developed logic is tested for both fully-connected and arbitrary mesh networks with extensive simulation experiments. It is shown how the nodes can learn to select collision-free transmission slots using MAB. Moreover, the nodes learn to self-adjust their transmission schedules using a novel DDSB framework in order to reduce bandwidth usage.
    The Impact of Data Distribution on Q-learning with Function Approximation. (arXiv:2111.11758v3 [cs.LG] UPDATED)
    We study the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of Q-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results. We start by reviewing theoretical bounds on the performance of approximate dynamic programming algorithms. We then introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of Q-learning-based algorithms with function approximation, both online and offline. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline Q-learning-based algorithms under different environments. According to our results: (i) high entropy data distributions are well-suited for learning in an offline manner; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.
    Stochastic Multiple Target Sampling Gradient Descent. (arXiv:2206.01934v4 [cs.LG] UPDATED)
    Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: "Can we derive a probabilistic version of the multi-objective optimization?". To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.
    PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR Error Correction. (arXiv:2302.05040v1 [cs.CL])
    Speech-to-text errors made by automatic speech recognition (ASR) system negatively impact downstream models relying on ASR transcriptions. Language error correction models as a post-processing text editing approach have been recently developed for refining the source sentences. However, efficient models for correcting errors in ASR transcriptions that meet the low latency requirements of industrial grade production systems have not been well studied. In this work, we propose a novel non-autoregressive (NAR) error correction approach to improve the transcription quality by reducing word error rate (WER) and achieve robust performance across different upstream ASR systems. Our approach augments the text encoding of the Transformer model with a phoneme encoder that embeds pronunciation information. The representations from phoneme encoder and text encoder are combined via multi-modal fusion before feeding into the length tagging predictor for predicting target sequence lengths. The joint encoders also provide inputs to the attention mechanism in the NAR decoder. We experiment on 3 open-source ASR systems with varying speech-to-text transcription quality and their erroneous transcriptions on 2 public English corpus datasets. Results show that our PATCorrect (Phoneme Augmented Transformer for ASR error Correction) consistently outperforms state-of-the-art NAR error correction method on English corpus across different upstream ASR systems. For example, PATCorrect achieves 11.62% WER reduction (WERR) averaged on 3 ASR systems compared to 9.46% WERR achieved by other method using text only modality and also achieves an inference latency comparable to other NAR models at tens of millisecond scale, especially on GPU hardware, while still being 4.2 - 6.7x times faster than autoregressive models on Common Voice and LibriSpeech datasets.
    Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. (arXiv:2202.01034v2 [cs.LG] UPDATED)
    Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline.
    Hypernetworks build Implicit Neural Representations of Sounds. (arXiv:2302.04959v1 [cs.LG])
    Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering. Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models. To address this limitation, we introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training. Our approach reconstructs audio samples with quality comparable to other state-of-the-art models and provides a viable alternative to contemporary sound representations used in deep neural networks for audio processing, such as spectrograms.  ( 2 min )
    Differentially Private Optimization for Smooth Nonconvex ERM. (arXiv:2302.04972v1 [cs.LG])
    We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.  ( 2 min )
    Piecewise-Stationary Multi-Objective Multi-Armed Bandit with Application to Joint Communications and Sensing. (arXiv:2302.05257v1 [cs.LG])
    We study a multi-objective multi-armed bandit problem in a dynamic environment. The problem portrays a decision-maker that sequentially selects an arm from a given set. If selected, each action produces a reward vector, where every element follows a piecewise-stationary Bernoulli distribution. The agent aims at choosing an arm among the Pareto optimal set of arms to minimize its regret. We propose a Pareto generic upper confidence bound (UCB)-based algorithm with change detection to solve this problem. By developing the essential inequalities for multi-dimensional spaces, we establish that our proposal guarantees a regret bound in the order of $\gamma_T\log(T/{\gamma_T})$ when the number of breakpoints $\gamma_T$ is known. Without this assumption, the regret bound of our algorithm is $\gamma_T\log(T)$. Finally, we formulate an energy-efficient waveform design problem in an integrated communication and sensing system as a toy example. Numerical experiments on the toy example and synthetic and real-world datasets demonstrate the efficiency of our policy compared to the current methods.
    Deep Learning from Parametrically Generated Virtual Buildings for Real-World Object Recognition. (arXiv:2302.05283v1 [cs.CV])
    We study the use of parametric building information modeling (BIM) to automatically generate training data for artificial neural networks (ANNs) to recognize building objects in photos. Teaching artificial intelligence (AI) machines to detect building objects in images is the foundation toward AI-assisted semantic 3D reconstruction of existing buildings. However, there exists the challenge of acquiring training data which is typically human-annotated, that is, unless a computer machine can generate high-quality data to train itself for a certain task. In that vein, we trained ANNs solely on realistic computer-generated images of 3D BIM models which were parametrically and automatically generated using the BIMGenE program. The ANN training result demonstrated generalizability and good semantic segmentation on a test case as well as arbitrary photos of buildings that are outside the range of the training data, which is significant for the future of training AI with generated data for solving real-world architectural problems.
    A Survey on Causal Reinforcement Learning. (arXiv:2302.05209v1 [cs.AI])
    While Reinforcement Learning (RL) achieves tremendous success in sequential decision-making problems of many domains, it still faces key challenges of data inefficiency and the lack of interpretability. Interestingly, many researchers have leveraged insights from the causality literature recently, bringing forth flourishing works to unify the merits of causality and address well the challenges from RL. As such, it is of great necessity and significance to collate these Causal Reinforcement Learning (CRL) works, offer a review of CRL methods, and investigate the potential functionality from causality toward RL. In particular, we divide existing CRL approaches into two categories according to whether their causality-based information is given in advance or not. We further analyze each category in terms of the formalization of different models, ranging from the Markov Decision Process (MDP), Partially Observed Markov Decision Process (POMDP), Multi-Arm Bandits (MAB), and Dynamic Treatment Regime (DTR). Moreover, we summarize the evaluation matrices and open sources while we discuss emerging applications, along with promising prospects for the future development of CRL.  ( 2 min )
    Tensor Generalized Canonical Correlation Analysis. (arXiv:2302.05277v1 [stat.ML])
    Regularized Generalized Canonical Correlation Analysis (RGCCA) is a general statistical framework for multi-block data analysis. RGCCA enables deciphering relationships between several sets of variables and subsumes many well-known multivariate analysis methods as special cases. However, RGCCA only deals with vector-valued blocks, disregarding their possible higher-order structures. This paper presents Tensor GCCA (TGCCA), a new method for analyzing higher-order tensors with canonical vectors admitting an orthogonal rank-R CP decomposition. Moreover, two algorithms for TGCCA, based on whether a separable covariance structure is imposed or not, are presented along with convergence guarantees. The efficiency and usefulness of TGCCA are evaluated on simulated and real data and compared favorably to state-of-the-art approaches.  ( 2 min )
    GCI: A (G)raph (C)oncept (I)nterpretation Framework. (arXiv:2302.04899v1 [cs.LG])
    Explainable AI (XAI) underwent a recent surge in research on concept extraction, focusing on extracting human-interpretable concepts from Deep Neural Networks. An important challenge facing concept extraction approaches is the difficulty of interpreting and evaluating discovered concepts, especially for complex tasks such as molecular property prediction. We address this challenge by presenting GCI: a (G)raph (C)oncept (I)nterpretation framework, used for quantitatively measuring alignment between concepts discovered from Graph Neural Networks (GNNs) and their corresponding human interpretations. GCI encodes concept interpretations as functions, which can be used to quantitatively measure the alignment between a given interpretation and concept definition. We demonstrate four applications of GCI: (i) quantitatively evaluating concept extractors, (ii) measuring alignment between concept extractors and human interpretations, (iii) measuring the completeness of interpretations with respect to an end task and (iv) a practical application of GCI to molecular property prediction, in which we demonstrate how to use chemical functional groups to explain GNNs trained on molecular property prediction tasks, and implement interpretations with a 0.76 AUCROC completeness score.
    Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples. (arXiv:2302.05086v1 [cs.LG])
    The transferability of adversarial examples across deep neural networks (DNNs) is the crux of many black-box attacks. Many prior efforts have been devoted to improving the transferability via increasing the diversity in inputs of some substitute models. In this paper, by contrast, we opt for the diversity in substitute models and advocate to attack a Bayesian model for achieving desirable transferability. Deriving from the Bayesian formulation, we develop a principled strategy for possible finetuning, which can be combined with many off-the-shelf Gaussian posterior approximations over DNN parameters. Extensive experiments have been conducted to verify the effectiveness of our method, on common benchmark datasets, and the results demonstrate that our method outperforms recent state-of-the-arts by large margins (roughly 19% absolute increase in average attack success rate on ImageNet), and, by combining with these recent methods, further performance gain can be obtained. Our code: https://github.com/qizhangli/MoreBayesian-attack.  ( 2 min )
    AIOSA: An approach to the automatic identification of obstructive sleep apnea events based on deep learning. (arXiv:2302.05179v1 [cs.LG])
    Obstructive Sleep Apnea Syndrome (OSAS) is the most common sleep-related breathing disorder. It is caused by an increased upper airway resistance during sleep, which determines episodes of partial or complete interruption of airflow. The detection and treatment of OSAS is particularly important in stroke patients, because the presence of severe OSAS is associated with higher mortality, worse neurological deficits, worse functional outcome after rehabilitation, and a higher likelihood of uncontrolled hypertension. The gold standard test for diagnosing OSAS is polysomnography (PSG). Unfortunately, performing a PSG in an electrically hostile environment, like a stroke unit, on neurologically impaired patients is a difficult task; also, the number of strokes per day outnumbers the availability of polysomnographs and dedicated healthcare professionals. Thus, a simple and automated recognition system to identify OSAS among acute stroke patients, relying on routinely recorded vital signs, is desirable. The majority of the work done so far focuses on data recorded in ideal conditions and highly selected patients, and thus it is hardly exploitable in real-life settings, where it would be of actual use. In this paper, we propose a convolutional deep learning architecture able to reduce the temporal resolution of raw waveform data, like physiological signals, extracting key features that can be used for further processing. We exploit models based on such an architecture to detect OSAS events in stroke unit recordings obtained from the monitoring of unselected patients. Unlike existing approaches, annotations are performed at one-second granularity, allowing physicians to better interpret the model outcome. Results are considered to be satisfactory by the domain experts. Moreover, based on a widely-used benchmark, we show that the proposed approach outperforms current state-of-the-art solutions.
    Low Entropy Communication in Multi-Agent Reinforcement Learning. (arXiv:2302.05055v1 [cs.MA])
    Communication in multi-agent reinforcement learning has been drawing attention recently for its significant role in cooperation. However, multi-agent systems may suffer from limitations on communication resources and thus need efficient communication techniques in real-world scenarios. According to the Shannon-Hartley theorem, messages to be transmitted reliably in worse channels require lower entropy. Therefore, we aim to reduce message entropy in multi-agent communication. A fundamental challenge is that the gradients of entropy are either 0 or infinity, disabling gradient-based methods. To handle it, we propose a pseudo gradient descent scheme, which reduces entropy by adjusting the distributions of messages wisely. We conduct experiments on two base communication frameworks with six environment settings and find that our scheme can reduce message entropy by up to 90% with nearly no loss of cooperation performance.
    Neural Capacitated Clustering. (arXiv:2302.05134v1 [cs.LG])
    Recent work on deep clustering has found new promising methods also for constrained clustering problems. Their typically pairwise constraints often can be used to guide the partitioning of the data. Many problems however, feature cluster-level constraints, e.g. the Capacitated Clustering Problem (CCP), where each point has a weight and the total weight sum of all points in each cluster is bounded by a prescribed capacity. In this paper we propose a new method for the CCP, Neural Capacited Clustering, that learns a neural network to predict the assignment probabilities of points to cluster centers from a data set of optimal or near optimal past solutions of other problem instances. During inference, the resulting scores are then used in an iterative k-means like procedure to refine the assignment under capacity constraints. In our experiments on artificial data and two real world datasets our approach outperforms several state-of-the-art mathematical and heuristic solvers from the literature. Moreover, we apply our method in the context of a cluster-first-route-second approach to the Capacitated Vehicle Routing Problem (CVRP) and show competitive results on the well-known Uchoa benchmark.
    Fast Learnings of Coupled Nonnegative Tensor Decomposition Using Optimal Gradient and Low-rank Approximation. (arXiv:2302.05119v1 [cs.LG])
    Nonnegative tensor decomposition has been widely applied in signal processing and neuroscience, etc. When it comes to group analysis of multi-block tensors, traditional tensor decomposition is insufficient to utilize the shared/similar information among tensors. In this study, we propose a coupled nonnegative CANDECOMP/PARAFAC decomposition algorithm optimized by the alternating proximal gradient method (CoNCPDAPG), which is capable of a simultaneous decomposition of tensors from different samples that are partially linked and a simultaneous extraction of common components, individual components and core tensors. Due to the low optimization efficiency brought by the nonnegative constraint and the high-dimensional nature of the data, we further propose the lraCoNCPD-APG algorithm by combining low-rank approximation and the proposed CoNCPD-APG method. When processing multi-block large-scale tensors, the proposed lraCoNCPD-APG algorithm can greatly reduce the computational load without compromising the decomposition quality. Experiment results of coupled nonnegative tensor decomposition problems designed for synthetic data, real-world face images and event-related potential data demonstrate the practicability and superiority of the proposed algorithms.  ( 2 min )
    Quadratic Memory is Necessary for Optimal Query Complexity in Convex Optimization: Center-of-Mass is Pareto-Optimal. (arXiv:2302.04963v1 [cs.LG])
    We give query complexity lower bounds for convex optimization and the related feasibility problem. We show that quadratic memory is necessary to achieve the optimal oracle complexity for first-order convex optimization. In particular, this shows that center-of-mass cutting-planes algorithms in dimension $d$ which use $\tilde O(d^2)$ memory and $\tilde O(d)$ queries are Pareto-optimal for both convex optimization and the feasibility problem, up to logarithmic factors. Precisely, we prove that to minimize $1$-Lipschitz convex functions over the unit ball to $1/d^4$ accuracy, any deterministic first-order algorithms using at most $d^{2-\delta}$ bits of memory must make $\tilde\Omega(d^{1+\delta/3})$ queries, for any $\delta\in[0,1]$. For the feasibility problem, in which an algorithm only has access to a separation oracle, we show a stronger trade-off: for at most $d^{2-\delta}$ memory, the number of queries required is $\tilde\Omega(d^{1+\delta})$. This resolves a COLT 2019 open problem of Woodworth and Srebro.  ( 2 min )
    Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding. (arXiv:2302.05041v1 [cs.CV])
    This paper proposes a probabilistic motion prediction method for long motions. The motion is predicted so that it accomplishes a task from the initial state observed in the given image. While our method evaluates the task achievability by the Energy-Based Model (EBM), previous EBMs are not designed for evaluating the consistency between different domains (i.e., image and motion in our method). Our method seamlessly integrates the image and motion data into the image feature domain by spatially-aligned temporal encoding so that features are extracted along the motion trajectory projected onto the image. Furthermore, this paper also proposes a data-driven motion optimization method, Deep Motion Optimizer (DMO), that works with EBM for motion prediction. Different from previous gradient-based optimizers, our self-supervised DMO alleviates the difficulty of hyper-parameter tuning to avoid local minima. The effectiveness of the proposed method is demonstrated with a variety of experiments with similar SOTA methods.
    A Song of Ice and Fire: Analyzing Textual Autotelic Agents in ScienceWorld. (arXiv:2302.05244v1 [cs.AI])
    Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language has a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.  ( 2 min )
    Learning Complex Teamwork Tasks using a Sub-task Curriculum. (arXiv:2302.04944v1 [cs.MA])
    Training a team to complete a complex task via multi-agent reinforcement learning can be difficult due to challenges such as policy search in a large policy space, and non-stationarity caused by mutually adapting agents. To facilitate efficient learning of complex multi-agent tasks, we propose an approach which uses an expert-provided curriculum of simpler multi-agent sub-tasks. In each sub-task of the curriculum, a subset of the entire team is trained to acquire sub-task-specific policies. The sub-teams are then merged and transferred to the target task, where their policies are collectively fined tuned to solve the more complex target task. We present MEDoE, a flexible method which identifies situations in the target task where each agent can use its sub-task-specific skills, and uses this information to modulate hyperparameters for learning and exploration during the fine-tuning process. We compare MEDoE to multi-agent reinforcement learning baselines that train from scratch in the full task, and with na\"ive applications of standard multi-agent reinforcement learning techniques for fine-tuning. We show that MEDoE outperforms baselines which train from scratch or use na\"ive fine-tuning approaches, requiring significantly fewer total training timesteps to solve a range of complex teamwork tasks.  ( 2 min )
    The Monge Gap: A Regularizer to Learn All Transport Maps. (arXiv:2302.04953v1 [cs.LG])
    Optimal transport (OT) theory has been been used in machine learning to study and characterize maps that can push-forward efficiently a probability measure onto another. Recent works have drawn inspiration from Brenier's theorem, which states that when the ground cost is the squared-Euclidean distance, the ``best'' map to morph a continuous measure in $\mathcal{P}(\Rd)$ into another must be the gradient of a convex function. To exploit that result, [Makkuva+ 2020, Korotin+2020] consider maps $T=\nabla f_\theta$, where $f_\theta$ is an input convex neural network (ICNN), as defined by Amos+2017, and fit $\theta$ with SGD using samples. Despite their mathematical elegance, fitting OT maps with ICNNs raises many challenges, due notably to the many constraints imposed on $\theta$; the need to approximate the conjugate of $f_\theta$; or the limitation that they only work for the squared-Euclidean cost. More generally, we question the relevance of using Brenier's result, which only applies to densities, to constrain the architecture of candidate maps fitted on samples. Motivated by these limitations, we propose a radically different approach to estimating OT maps: Given a cost $c$ and a reference measure $\rho$, we introduce a regularizer, the Monge gap $\mathcal{M}^c_{\rho}(T)$ of a map $T$. That gap quantifies how far a map $T$ deviates from the ideal properties we expect from a $c$-OT map. In practice, we drop all architecture requirements for $T$ and simply minimize a distance (e.g., the Sinkhorn divergence) between $T\sharp\mu$ and $\nu$, regularized by $\mathcal{M}^c_\rho(T)$. We study $\mathcal{M}^c_{\rho}$, and show how our simple pipeline outperforms significantly other baselines in practice.
    Information Theoretic Lower Bounds for Information Theoretic Upper Bounds. (arXiv:2302.04925v1 [cs.LG])
    We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimension-dependent mutual information is necessary. This indicates that existing information-theoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.
    Hyperparameter Search Is All You Need For Training-Agnostic Backdoor Robustness. (arXiv:2302.04977v1 [cs.CR])
    Commoditization and broad adoption of machine learning (ML) technologies expose users of these technologies to new security risks. Many models today are based on neural networks. Training and deploying these models for real-world applications involves complex hardware and software pipelines applied to training data from many sources. Models trained on untrusted data are vulnerable to poisoning attacks that introduce "backdoor" functionality. Compromising a fraction of the training data requires few resources from the attacker, but defending against these attacks is a challenge. Although there have been dozens of defenses proposed in the research literature, most of them are expensive to integrate or incompatible with the existing training pipelines. In this paper, we take a pragmatic, developer-centric view and show how practitioners can answer two actionable questions: (1) how robust is my model to backdoor poisoning attacks?, and (2) how can I make it more robust without changing the training pipeline? We focus on the size of the compromised subset of the training data as a universal metric. We propose an easy-to-learn primitive sub-task to estimate this metric, thus providing a baseline on backdoor poisoning. Next, we show how to leverage hyperparameter search - a tool that ML developers already extensively use - to balance the model's accuracy and robustness to poisoning, without changes to the training pipeline. We demonstrate how to use our metric to estimate the robustness of models to backdoor attacks. We then design, implement, and evaluate a multi-stage hyperparameter search method we call Mithridates that strengthens robustness by 3-5x with only a slight impact on the model's accuracy. We show that the hyperparameters found by our method increase robustness against multiple types of backdoor attacks and extend our method to AutoML and federated learning.  ( 2 min )
    DRGCN: Dynamic Evolving Initial Residual for Deep Graph Convolutional Networks. (arXiv:2302.05083v1 [cs.LG])
    Graph convolutional networks (GCNs) have been proved to be very practical to handle various graph-related tasks. It has attracted considerable research interest to study deep GCNs, due to their potential superior performance compared with shallow ones. However, simply increasing network depth will, on the contrary, hurt the performance due to the over-smoothing problem. Adding residual connection is proved to be effective for learning deep convolutional neural networks (deep CNNs), it is not trivial when applied to deep GCNs. Recent works proposed an initial residual mechanism that did alleviate the over-smoothing problem in deep GCNs. However, according to our study, their algorithms are quite sensitive to different datasets. In their setting, the personalization (dynamic) and correlation (evolving) of how residual applies are ignored. To this end, we propose a novel model called Dynamic evolving initial Residual Graph Convolutional Network (DRGCN). Firstly, we use a dynamic block for each node to adaptively fetch information from the initial representation. Secondly, we use an evolving block to model the residual evolving pattern between layers. Our experimental results show that our model effectively relieves the problem of over-smoothing in deep GCNs and outperforms the state-of-the-art (SOTA) methods on various benchmark datasets. Moreover, we develop a mini-batch version of DRGCN which can be applied to large-scale data. Coupling with several fair training techniques, our model reaches new SOTA results on the large-scale ogbn-arxiv dataset of Open Graph Benchmark (OGB). Our reproducible code is available on GitHub.
  • Open

    Contraction of $E_\gamma$-Divergence and Its Applications to Privacy. (arXiv:2012.11035v2 [cs.IT] UPDATED)
    We investigate the contraction coefficients derived from strong data processing inequalities for the $E_\gamma$-divergence. By generalizing the celebrated Dobrushin's coefficient from total variation distance to $E_\gamma$-divergence, we derive a closed-form expression for the contraction of $E_\gamma$-divergence. This result has fundamental consequences in two privacy settings. First, it implies that local differential privacy can be equivalently expressed in terms of the contraction of $E_\gamma$-divergence. This equivalent formula can be used to precisely quantify the impact of local privacy in (Bayesian and minimax) estimation and hypothesis testing problems in terms of the reduction of effective sample size. Second, it leads to a new information-theoretic technique for analyzing privacy guarantees of online algorithms. In this technique, we view such algorithms as a composition of amplitude-constrained Gaussian channels and then relate their contraction coefficients under $E_\gamma$-divergence to the overall differential privacy guarantees. As an example, we apply our technique to derive the differential privacy parameters of gradient descent. Moreover, we also show that this framework can be tailored to batch learning algorithms that can be implemented with one pass over the training dataset.
    Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. (arXiv:2202.01034v2 [cs.LG] UPDATED)
    Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline.
    Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness. (arXiv:2302.05404v1 [stat.ML])
    In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (\emph{e.g.,} projected norm) rather than valid metrics (\emph{e.g.,} $L_2$ norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong $L_2$ error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem.
    On Penalty-based Bilevel Gradient Descent Method. (arXiv:2302.05185v1 [cs.LG])
    Bilevel optimization enjoys a wide range of applications in hyper-parameter optimization, meta-learning and reinforcement learning. However, bilevel optimization problems are difficult to solve. Recent progress on scalable bilevel algorithms mainly focuses on bilevel optimization problems where the lower-level objective is either strongly convex or unconstrained. In this work, we tackle the bilevel problem through the lens of the penalty method. We show that under certain conditions, the penalty reformulation recovers the solutions of the original bilevel problem. Further, we propose the penalty-based bilevel gradient descent (PBGD) algorithm and establish its finite-time convergence for the constrained bilevel problem without lower-level strong convexity. Experiments showcase the efficiency of the proposed PBGD algorithm.
    Black-Box Generalization: Stability of Zeroth-Order Learning. (arXiv:2202.06880v2 [cs.LG] UPDATED)
    We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well.  ( 2 min )
    Causal Inference out of Control: Estimating the Steerability of Consumption. (arXiv:2302.04989v1 [cs.LG])
    Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on consumption. We introduce a general causal inference problem we call the steerability of consumption that abstracts many settings of interest. Focusing on observational designs and exploiting the structure of the problem, we exhibit a set of assumptions for causal identifiability that significantly weaken the often unrealistic overlap assumptions of standard designs. The key novelty of our approach is to explicitly model the dynamics of consumption over time, viewing the platform as a controller acting on a dynamical system. From this dynamical systems perspective, we are able to show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying steerability of consumption. Our results illustrate the fruitful interplay of control theory and causal inference, which we illustrate with examples from econometrics, macroeconomics, and machine learning.  ( 2 min )
    Communication-Efficient Federated Hypergradient Computation via Aggregated Iterative Differentiation. (arXiv:2302.04969v1 [cs.LG])
    Federated bilevel optimization has attracted increasing attention due to emerging machine learning and communication applications. The biggest challenge lies in computing the gradient of the upper-level objective function (i.e., hypergradient) in the federated setting due to the nonlinear and distributed construction of a series of global Hessian matrices. In this paper, we propose a novel communication-efficient federated hypergradient estimator via aggregated iterative differentiation (AggITD). AggITD is simple to implement and significantly reduces the communication cost by conducting the federated hypergradient estimation and the lower-level optimization simultaneously. We show that the proposed AggITD-based algorithm achieves the same sample complexity as existing approximate implicit differentiation (AID)-based approaches with much fewer communication rounds in the presence of data heterogeneity. Our results also shed light on the great advantage of ITD over AID in the federated/distributed hypergradient estimation. This differs from the comparison in the non-distributed bilevel optimization, where ITD is less efficient than AID. Our extensive experiments demonstrate the great effectiveness and communication efficiency of the proposed method.  ( 2 min )
    Towards Minimax Optimality of Model-based Robust Reinforcement Learning. (arXiv:2302.05372v1 [cs.LG])
    We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.  ( 2 min )
    Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making. (arXiv:2302.05430v1 [stat.ML])
    Smoothed online learning has emerged as a popular framework to mitigate the substantial loss in statistical and computational complexity that arises when one moves from classical to adversarial learning. Unfortunately, for some spaces, it has been shown that efficient algorithms suffer an exponentially worse regret than that which is minimax optimal, even when the learner has access to an optimization oracle over the space. To mitigate that exponential dependence, this work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space, and shows that an instantiation of Follow-the-Perturbed-Leader can attain low regret with the number of calls to the optimization oracle scaling optimally with respect to average regret. We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions, which has many applications in fields as diverse as econometrics and robotics.  ( 2 min )
    Achieving Linear Speedup in Non-IID Federated Bilevel Learning. (arXiv:2302.05412v1 [cs.LG])
    Federated bilevel optimization has received increasing attention in various emerging machine learning and communication applications. Recently, several Hessian-vector-based algorithms have been proposed to solve the federated bilevel optimization problem. However, several important properties in federated learning such as the partial client participation and the linear speedup for convergence (i.e., the convergence rate and complexity are improved linearly with respect to the number of sampled clients) in the presence of non-i.i.d.~datasets, still remain open. In this paper, we fill these gaps by proposing a new federated bilevel algorithm named FedMBO with a novel client sampling scheme in the federated hypergradient estimation. We show that FedMBO achieves a convergence rate of $\mathcal{O}\big(\frac{1}{\sqrt{nK}}+\frac{1}{K}+\frac{\sqrt{n}}{K^{3/2}}\big)$ on non-i.i.d.~datasets, where $n$ is the number of participating clients in each round, and $K$ is the total number of iteration. This is the first theoretical linear speedup result for non-i.i.d.~federated bilevel optimization. Extensive experiments validate our theoretical results and demonstrate the effectiveness of our proposed method.  ( 2 min )
    Star-Shaped Denoising Diffusion Probabilistic Models. (arXiv:2302.05259v1 [stat.ML])
    Methods based on Denoising Diffusion Probabilistic Models (DDPM) became a ubiquitous tool in generative modeling. However, they are mostly limited to Gaussian and discrete diffusion processes. We propose Star-Shaped Denoising Diffusion Probabilistic Models (SS-DDPM), a model with a non-Markovian diffusion-like noising process. In the case of Gaussian distributions, this model is equivalent to Markovian DDPMs. However, it can be defined and applied with arbitrary noising distributions, and admits efficient training and sampling algorithms for a wide range of distributions that lie in the exponential family. We provide a simple recipe for designing diffusion-like models with distributions like Beta, von Mises--Fisher, Dirichlet, Wishart and others, which can be especially useful when data lies on a constrained manifold such as the unit sphere, the space of positive semi-definite matrices, the probabilistic simplex, etc. We evaluate the model in different settings and find it competitive even on image data, where Beta SS-DDPM achieves results comparable to a Gaussian DDPM.  ( 2 min )
    Hierarchical classification at multiple operating points. (arXiv:2210.10929v2 [cs.LG] UPDATED)
    Many classification problems consider classes that form a hierarchy. Classifiers that are aware of this hierarchy may be able to make confident predictions at a coarse level despite being uncertain at the fine-grained level. While it is generally possible to vary the granularity of predictions using a threshold at inference time, most contemporary work considers only leaf-node prediction, and almost no prior work has compared methods at multiple operating points. We present an efficient algorithm to produce operating characteristic curves for any method that assigns a score to every class in the hierarchy. Applying this technique to evaluate existing methods reveals that top-down classifiers are dominated by a naive flat softmax classifier across the entire operating range. We further propose two novel loss functions and show that a soft variant of the structured hinge loss is able to significantly outperform the flat baseline. Finally, we investigate the poor accuracy of top-down classifiers and demonstrate that they perform relatively well on unseen classes. Code is available online at https://github.com/jvlmdr/hiercls.
    Interventional Causal Representation Learning. (arXiv:2209.11924v3 [stat.ML] UPDATED)
    Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect $do$ interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.  ( 2 min )
    Effects of noise on the overparametrization of quantum neural networks. (arXiv:2302.05059v1 [quant-ph])
    Overparametrization is one of the most surprising and notorious phenomena in machine learning. Recently, there have been several efforts to study if, and how, Quantum Neural Networks (QNNs) acting in the absence of hardware noise can be overparametrized. In particular, it has been proposed that a QNN can be defined as overparametrized if it has enough parameters to explore all available directions in state space. That is, if the rank of the Quantum Fisher Information Matrix (QFIM) for the QNN's output state is saturated. Here, we explore how the presence of noise affects the overparametrization phenomenon. Our results show that noise can "turn on" previously-zero eigenvalues of the QFIM. This enables the parametrized state to explore directions that were otherwise inaccessible, thus potentially turning an overparametrized QNN into an underparametrized one. For small noise levels, the QNN is quasi-overparametrized, as large eigenvalues coexists with small ones. Then, we prove that as the magnitude of noise increases all the eigenvalues of the QFIM become exponentially suppressed, indicating that the state becomes insensitive to any change in the parameters. As such, there is a pull-and-tug effect where noise can enable new directions, but also suppress the sensitivity to parameter updates. Finally, our results imply that current QNN capacity measures are ill-defined when hardware noise is present.
    The out-of-sample $R^2$: estimation and inference. (arXiv:2302.05131v1 [stat.ME])
    Out-of-sample prediction is the acid test of predictive models, yet an independent test dataset is often not available for assessment of the prediction error. For this reason, out-of-sample performance is commonly estimated using data splitting algorithms such as cross-validation or the bootstrap. For quantitative outcomes, the ratio of variance explained to total variance can be summarized by the coefficient of determination or in-sample $R^2$, which is easy to interpret and to compare across different outcome variables. As opposed to the in-sample $R^2$, the out-of-sample $R^2$ has not been well defined and the variability on the out-of-sample $\hat{R}^2$ has been largely ignored. Usually only its point estimate is reported, hampering formal comparison of predictability of different outcome variables. Here we explicitly define the out-of-sample $R^2$ as a comparison of two predictive models, provide an unbiased estimator and exploit recent theoretical advances on uncertainty of data splitting estimates to provide a standard error for the $\hat{R}^2$. The performance of the estimators for the $R^2$ and its standard error are investigated in a simulation study. We demonstrate our new method by constructing confidence intervals and comparing models for prediction of quantitative $\text{Brassica napus}$ and $\text{Zea mays}$ phenotypes based on gene expression data.  ( 2 min )
    Tensor Generalized Canonical Correlation Analysis. (arXiv:2302.05277v1 [stat.ML])
    Regularized Generalized Canonical Correlation Analysis (RGCCA) is a general statistical framework for multi-block data analysis. RGCCA enables deciphering relationships between several sets of variables and subsumes many well-known multivariate analysis methods as special cases. However, RGCCA only deals with vector-valued blocks, disregarding their possible higher-order structures. This paper presents Tensor GCCA (TGCCA), a new method for analyzing higher-order tensors with canonical vectors admitting an orthogonal rank-R CP decomposition. Moreover, two algorithms for TGCCA, based on whether a separable covariance structure is imposed or not, are presented along with convergence guarantees. The efficiency and usefulness of TGCCA are evaluated on simulated and real data and compared favorably to state-of-the-art approaches.  ( 2 min )
    Latent State Marginalization as a Low-cost Approach for Improving Exploration. (arXiv:2210.00999v2 [cs.LG] UPDATED)
    While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.  ( 2 min )
    The Monge Gap: A Regularizer to Learn All Transport Maps. (arXiv:2302.04953v1 [cs.LG])
    Optimal transport (OT) theory has been been used in machine learning to study and characterize maps that can push-forward efficiently a probability measure onto another. Recent works have drawn inspiration from Brenier's theorem, which states that when the ground cost is the squared-Euclidean distance, the ``best'' map to morph a continuous measure in $\mathcal{P}(\Rd)$ into another must be the gradient of a convex function. To exploit that result, [Makkuva+ 2020, Korotin+2020] consider maps $T=\nabla f_\theta$, where $f_\theta$ is an input convex neural network (ICNN), as defined by Amos+2017, and fit $\theta$ with SGD using samples. Despite their mathematical elegance, fitting OT maps with ICNNs raises many challenges, due notably to the many constraints imposed on $\theta$; the need to approximate the conjugate of $f_\theta$; or the limitation that they only work for the squared-Euclidean cost. More generally, we question the relevance of using Brenier's result, which only applies to densities, to constrain the architecture of candidate maps fitted on samples. Motivated by these limitations, we propose a radically different approach to estimating OT maps: Given a cost $c$ and a reference measure $\rho$, we introduce a regularizer, the Monge gap $\mathcal{M}^c_{\rho}(T)$ of a map $T$. That gap quantifies how far a map $T$ deviates from the ideal properties we expect from a $c$-OT map. In practice, we drop all architecture requirements for $T$ and simply minimize a distance (e.g., the Sinkhorn divergence) between $T\sharp\mu$ and $\nu$, regularized by $\mathcal{M}^c_\rho(T)$. We study $\mathcal{M}^c_{\rho}$, and show how our simple pipeline outperforms significantly other baselines in practice.  ( 2 min )
    The Survival Bandit Problem. (arXiv:2206.03019v3 [cs.LG] UPDATED)
    We study the survival bandit problem, a variant of the multi-armed bandit problem with a constraint on the cumulative reward; at each time step, the agent receives a reward in [-1, 1] and if the cumulative reward becomes lower than a preset threshold, the procedure stops, and this phenomenon is called ruin. To our knowledge, this is the first paper studying a framework where the ruin might occur but not always. We first discuss that no policy can achieve a sublinear regret as defined in the standard multi-armed bandit problem, because a single pull of an arm may increase significantly the risk of ruin. Instead, we establish the framework of Pareto-optimal policies, which is a class of policies whose cumulative reward for some instance cannot be improved without sacrificing that for another instance. To this end, we provide tight lower bounds on the probability of ruin, as well as matching policies called EXPLOIT. Finally, using a doubling trick over an EXPLOIT policy, we display a Pareto-optimal policy in the case of {-1, 0, 1} rewards, giving an answer to the open problem by Perotto et al. (2019).  ( 2 min )
    Differentially Private Fr\'echet Mean on the Manifold of Symmetric Positive Definite (SPD) Matrices with log-Euclidean Metric. (arXiv:2208.04245v2 [math.ST] UPDATED)
    Differential privacy has become crucial in the real-world deployment of statistical and machine learning algorithms with rigorous privacy guarantees. The earliest statistical queries, for which differential privacy mechanisms have been developed, were for the release of the sample mean. In Geometric Statistics, the sample Fr\'echet mean represents one of the most fundamental statistical summaries, as it generalizes the sample mean for data belonging to nonlinear manifolds. In that spirit, the only geometric statistical query for which a differential privacy mechanism has been developed, so far, is for the release of the sample Fr\'echet mean: the \emph{Riemannian Laplace mechanism} was recently proposed to privatize the Fr\'echet mean on complete Riemannian manifolds. In many fields, the manifold of Symmetric Positive Definite (SPD) matrices is used to model data spaces, including in medical imaging where privacy requirements are key. We propose a novel, simple and fast mechanism - the \emph{tangent Gaussian mechanism} - to compute a differentially private Fr\'echet mean on the SPD manifold endowed with the log-Euclidean Riemannian metric. We show that our new mechanism has significantly better utility and is computationally efficient -- as confirmed by extensive experiments.  ( 2 min )
    Efficient and Accurate Learning of Mixtures of Plackett-Luce Models. (arXiv:2302.05343v1 [cs.LG])
    Mixture models of Plackett-Luce (PL) -- one of the most fundamental ranking models -- are an active research area of both theoretical and practical significance. Most previously proposed parameter estimation algorithms instantiate the EM algorithm, often with random initialization. However, such an initialization scheme may not yield a good initial estimate and the algorithms require multiple restarts, incurring a large time complexity. As for the EM procedure, while the E-step can be performed efficiently, maximizing the log-likelihood in the M-step is difficult due to the combinatorial nature of the PL likelihood function (Gormley and Murphy 2008). Therefore, previous authors favor algorithms that maximize surrogate likelihood functions (Zhao et al. 2018, 2020). However, the final estimate may deviate from the true maximum likelihood estimate as a consequence. In this paper, we address these known limitations. We propose an initialization algorithm that can provide a provably accurate initial estimate and an EM algorithm that maximizes the true log-likelihood function efficiently. Experiments on both synthetic and real datasets show that our algorithm is competitive in terms of accuracy and speed to baseline algorithms, especially on datasets with a large number of items.  ( 2 min )
    DNArch: Learning Convolutional Neural Architectures by Backpropagation. (arXiv:2302.05400v1 [cs.LG])
    We present Differentiable Neural Architectures (DNArch), a method that jointly learns the weights and the architecture of Convolutional Neural Networks (CNNs) by backpropagation. In particular, DNArch allows learning (i) the size of convolutional kernels at each layer, (ii) the number of channels at each layer, (iii) the position and values of downsampling layers, and (iv) the depth of the network. To this end, DNArch views neural architectures as continuous multidimensional entities, and uses learnable differentiable masks along each dimension to control their size. Unlike existing methods, DNArch is not limited to a predefined set of possible neural components, but instead it is able to discover entire CNN architectures across all combinations of kernel sizes, widths, depths and downsampling. Empirically, DNArch finds performant CNN architectures for several classification and dense prediction tasks on both sequential and image data. When combined with a loss term that considers the network complexity, DNArch finds powerful architectures that respect a predefined computational budget.  ( 2 min )
    Differentially Private Optimization for Smooth Nonconvex ERM. (arXiv:2302.04972v1 [cs.LG])
    We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.  ( 2 min )
    Quadratic Memory is Necessary for Optimal Query Complexity in Convex Optimization: Center-of-Mass is Pareto-Optimal. (arXiv:2302.04963v1 [cs.LG])
    We give query complexity lower bounds for convex optimization and the related feasibility problem. We show that quadratic memory is necessary to achieve the optimal oracle complexity for first-order convex optimization. In particular, this shows that center-of-mass cutting-planes algorithms in dimension $d$ which use $\tilde O(d^2)$ memory and $\tilde O(d)$ queries are Pareto-optimal for both convex optimization and the feasibility problem, up to logarithmic factors. Precisely, we prove that to minimize $1$-Lipschitz convex functions over the unit ball to $1/d^4$ accuracy, any deterministic first-order algorithms using at most $d^{2-\delta}$ bits of memory must make $\tilde\Omega(d^{1+\delta/3})$ queries, for any $\delta\in[0,1]$. For the feasibility problem, in which an algorithm only has access to a separation oracle, we show a stronger trade-off: for at most $d^{2-\delta}$ memory, the number of queries required is $\tilde\Omega(d^{1+\delta})$. This resolves a COLT 2019 open problem of Woodworth and Srebro.  ( 2 min )
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v5 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).  ( 2 min )
    Unbinned Profiled Unfolding. (arXiv:2302.05390v1 [hep-ph])
    Unfolding is an important procedure in particle physics experiments which corrects for detector effects and provides differential cross section measurements that can be used for a number of downstream tasks, such as extracting fundamental physics parameters. Traditionally, unfolding is done by discretizing the target phase space into a finite number of bins and is limited in the number of unfolded variables. Recently, there have been a number of proposals to perform unbinned unfolding with machine learning. However, none of these methods (like most unfolding methods) allow for simultaneously constraining (profiling) nuisance parameters. We propose a new machine learning-based unfolding method that results in an unbinned differential cross section and can profile nuisance parameters. The machine learning loss function is the full likelihood function, based on binned inputs at detector-level. We first demonstrate the method with simple Gaussian examples and then show the impact on a simulated Higgs boson cross section measurement.  ( 2 min )
    Stochastic Multiple Target Sampling Gradient Descent. (arXiv:2206.01934v4 [cs.LG] UPDATED)
    Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: "Can we derive a probabilistic version of the multi-objective optimization?". To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.  ( 2 min )
    A Second-Order Method for Stochastic Bandit Convex Optimisation. (arXiv:2302.05371v1 [cs.LG])
    We introduce a simple and efficient algorithm for unconstrained zeroth-order stochastic convex bandits and prove its regret is at most $(1 + r/d)[d^{1.5} \sqrt{n} + d^3] polylog(n, d, r)$ where $n$ is the horizon, $d$ the dimension and $r$ is the radius of a known ball containing the minimiser of the loss.  ( 2 min )
    Gaussian Pre-Activations in Neural Networks: Myth or Reality?. (arXiv:2205.12379v3 [cs.LG] UPDATED)
    The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?  ( 2 min )
    Hessian Based Smoothing Splines for Manifold Learning. (arXiv:2302.05025v1 [stat.ML])
    We propose a multidimensional smoothing spline algorithm in the context of manifold learning. We generalize the bending energy penalty of thin-plate splines to a quadratic form on the Sobolev space of a flat manifold, based on the Frobenius norm of the Hessian matrix. This leads to a natural definition of smoothing splines on manifolds, which minimizes square error while optimizing a global curvature penalty. The existence and uniqueness of the solution is shown by applying the theory of reproducing kernel Hilbert spaces. The minimizer is expressed as a combination of Green's functions for the biharmonic operator, and 'linear' functions of everywhere vanishing Hessian. Furthermore, we utilize the Hessian estimation procedure from the Hessian Eigenmaps algorithm to approximate the spline loss when the true manifold is unknown. This yields a particularly simple quadratic optimization algorithm for smoothing response values without needing to fit the underlying manifold. Analysis of asymptotic error and robustness are given, as well as discussion of out-of-sample prediction methods and applications.  ( 2 min )
    On Average-Case Error Bounds for Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v2 [stat.ML] UPDATED)
    In this paper, we study error bounds for {\em Bayesian quadrature} (BQ), with an emphasis on noisy settings, randomized algorithms, and average-case performance measures. We seek to approximate the integral of functions in a {\em Reproducing Kernel Hilbert Space} (RKHS), particularly focusing on the Mat\'ern-$\nu$ and squared exponential (SE) kernels, with samples from the function potentially being corrupted by Gaussian noise. We provide a two-step meta-algorithm that serves as a general tool for relating the average-case quadrature error with the $L^2$-function approximation error. When specialized to the Mat\'ern kernel, we recover an existing near-optimal error rate while avoiding the existing method of repeatedly sampling points. When specialized to other settings, we obtain new average-case results for settings including the SE kernel with noise and the Mat\'ern kernel with misspecification. Finally, we present algorithm-independent lower bounds that have greater generality and/or give distinct proofs compared to existing ones.  ( 2 min )
    MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope. (arXiv:2302.05294v1 [cs.CV])
    Explaining the predictions of deep neural nets has been a topic of great interest in the computer vision literature. While several gradient-based interpretation schemes have been proposed to reveal the influential variables in a neural net's prediction, standard gradient-based interpretation frameworks have been commonly observed to lack robustness to input perturbations and flexibility for incorporating prior knowledge of sparsity and group-sparsity structures. In this work, we propose MoreauGrad as an interpretation scheme based on the classifier neural net's Moreau envelope. We demonstrate that MoreauGrad results in a smooth and robust interpretation of a multi-layer neural network and can be efficiently computed through first-order optimization methods. Furthermore, we show that MoreauGrad can be naturally combined with $L_1$-norm regularization techniques to output a sparse or group-sparse explanation which are prior conditions applicable to a wide range of deep learning applications. We empirically evaluate the proposed MoreauGrad scheme on standard computer vision datasets, showing the qualitative and quantitative success of the MoreauGrad approach in comparison to standard gradient-based interpretation methods.  ( 2 min )
    Information Theoretic Lower Bounds for Information Theoretic Upper Bounds. (arXiv:2302.04925v1 [cs.LG])
    We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimension-dependent mutual information is necessary. This indicates that existing information-theoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.  ( 2 min )
    Gaussian Process-Gated Hierarchical Mixtures of Experts. (arXiv:2302.04947v1 [cs.LG])
    In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs) that are used for building gates and experts. Unlike in other mixtures of experts where the gating models are linear to the input, the gating functions of our model are inner nodes built with Gaussian processes based on random features that are non-linear and non-parametric. Further, the experts are also built with Gaussian processes and provide predictions that depend on test data. The optimization of the GPHMEs is carried out by variational inference. There are several advantages of the proposed GPHMEs. One is that they outperform tree-based HME benchmarks that partition the data in the input space. Another advantage is that they achieve good performance with reduced complexity. A third advantage of the GPHMEs is that they provide interpretability of deep Gaussian processes and more generally of deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets even with quite modest sizes.  ( 2 min )

  • Open

    Ai blend of will smith and will ferrel
    submitted by /u/Piano-Nerd [link] [comments]  ( 40 min )
    Weekly China AI News: Chinese Tech Firms Embrace ChatGPT Trend; Alibaba DAMO Academy Increases Registered Capital; Get to Know the 14 New AI Unicorns in 2022
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    80s aesthetic and ai uncanny accuracy.
    How is ai capturing 80s aesthetic better than almost everybody else (humans)?! If you look up 80s aesthetic on google or youtube you always get the same sort of synthwave type art-style which although reminiscent to 80s aesthetics, it ends up being overly bright colors with bright neon etc. ai seems to actually capture the 80s aesthetics perfectly even without a drop of neon! if you watch the so and so as an 80s fantasy/sci fi on youtube you can get exactly what I mean. it looks just like a shot from an 80s action movie. Does anyone actually know why the ai will go for this style instead of the pseudo synthwave 80s aesthetic that has been popularized? is this just a byproduct of art breeder itself? I haven't checked to see if other ai is doing it also. Here are some pictures of actual 80s and early 90s movies to see what I mean. https://preview.redd.it/1ent8icky0ia1.png?width=1524&format=png&auto=webp&s=c2262e9e5b16a45e81fe4c1933bb32321c3adea6 https://preview.redd.it/d52fp69ay0ia1.png?width=1280&format=png&auto=webp&s=052d984a558606754a462e581fff0d631f38a787 https://preview.redd.it/fx1jh2alx0ia1.png?width=2800&format=png&auto=webp&s=491388b6bbccccb6756b7acf23f165c52660abc7 https://preview.redd.it/e7gav84rx0ia1.png?width=600&format=png&auto=webp&s=7dc41e5a6339f7e0f340d663454a462f4a53e6c7 https://preview.redd.it/a77210wsx0ia1.png?width=691&format=png&auto=webp&s=214ffe76580b4eef9db249e60953c55f68163fa0 submitted by /u/The_Philosopher_Guy [link] [comments]  ( 41 min )
    SUMMARIZING EARLY BING AI REVIEWS
    HEY ALL-- PUT TOGETHER A SUMMARY OF THE EARLY ACCESS BING AI REVIEWS AND COMMENTARY. THIS WAS FOR OUR 3X PER WEEK NEWSLETTER, SMOKING ROBOT. FULL THING HERE-- FEEL FREE TO SUB BUT NOT EXPECTATION TO DO SO, JUST TRYING TO BRING SOME #VALUE HERE: https://smokingrobot.beehiiv.com/p/bing-better-than-google ​ ------------- ​ The new AI Bing is out in the wild. Trusted testers and media folks got access to the new tool over the weekend, and some of the early reviews are quite good. Here’s the good, the bad, the ugly, and the sentient from the previews of new Bing. The Good Most notably, CNET compared results from new Bing AI and regular Google search on 10 queries— 8 of the 10 times the author preferred the Bing result. Both sites did a fine job displaying accurate search results,…  ( 47 min )
    Lol. I made Elon musk believe in God.
    submitted by /u/No-Factor2579 [link] [comments]  ( 40 min )
    AI Dream 157 - MASTERPIECE - PART 7 TEASER - 2K SUBS CELEBRATION! 🥳🎉 - A...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Google’s New AI: The Age of AI-Made Movies Is Here!
    submitted by /u/MarS_0ne [link] [comments]  ( 6 min )
    Get Access to the Latest AI Tools For Free and Learn How to Use Them
    I wanted to showcase a new initiative from AINow that is designed to help you explore the full potential of AI. With so many AI tools on the market, it can be overwhelming and expensive to determine which ones are right for you. That's why we've decided to take AI education to the next level and give our subscribers access to the latest AI tools, all for free! Here's how it works: At the start of each week, our subscribers will vote on the tools they would like to try out and learn how to use. Based on the votes, we will get in touch with the company and secure free access for our subscribers. Then, throughout the rest of the week, we will be providing in-depth tutorials and tips on how to best use the tool. This way, you can try out multiple AI tools, determine which ones work best for your projects, and find your fit in the AI world without having to pay a monthly fee. At AINow, we believe that AI should be accessible and understandable to everyone. That's why we're removing the barriers that may have held you back from exploring the full potential of AI. Our expert insights will help you get the most out of each tool and give you the knowledge you need to succeed. Whether you're just starting out or are looking to expand your AI skills, AINow is your one-stop-shop for cutting-edge AI education. Sign up here if you are interested in becoming apart of this initiative. To those that made it this far: thank you for taking the time to read this. If you have any questions about this, leave them in the comments and I will try my best to answer them. Best regards, The AINow Team submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 43 min )
    Top 7 Best GPT3 Tools (for Content Marketers)
    submitted by /u/Chisom1998_ [link] [comments]  ( 40 min )
    Voice + Video deepfake by AI is used for scams now...
    submitted by /u/ChaosMindsDev [link] [comments]  ( 40 min )
    Which virtual tasks ai can't take yet?
    Only talking about digital tasks and not mean bad ai. Which virtual tasks ai can't take in general right now and is also not in progress? submitted by /u/ComfyHikiandNeet [link] [comments]  ( 40 min )
    Opera Browser Launches ChatGPT-Powered Tool 'Shorten'
    submitted by /u/TheInsaneApp [link] [comments]  ( 40 min )
    How I Built My Own "ChatGPT" for (Almost) Free in Less Than 24 Hours
    submitted by /u/tipani86 [link] [comments]  ( 40 min )
    Sam - Using GPT to flirt with you based on your preferences. Your Single Awareness Mitigator this Single Awareness Day
    submitted by /u/henshinger [link] [comments]  ( 41 min )
    All of this happened in AI today. 13/2
    Hello humans - This is AI Daily O vetted, helping you stay updated on AI in less than 5 minutes. ​ Join O'vetted AI news for free. Forget spending 3.39 hours finding good AI news to read. ​ What’s happening in AI - You Can Now Create AI-Generated Videos From Text Prompts. Runway has gone one step further and announced Gen-1: an AI model that can create videos from text prompts. This is a breakthrough in the world of generative AI, and Runway is one of the first companies to use AI to create videos using text prompts and AI chatbots. The model doesn't generate entirely new videos, it creates videos from the ones you upload, using text or image prompts to apply effects. Take a look at their explainer video. Opera’s building ChatGPT into its sidebar. Opera is adding a ChatGPT-po…  ( 45 min )
    Opera is the latest product to integrate ChatGPT
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    I asked ChatGPT to generate me an album and I made it into reality.
    submitted by /u/r4pturesan [link] [comments]  ( 40 min )
    What do you think about the next affirmation? "Believe it or not, A.I. will be the only safe job in the next 25 years."
    Believe it or not, A.I. will be the only safe job in the next 25 years. submitted by /u/TheVellerShow [link] [comments]  ( 46 min )
    Get Consistent Stylized Results with the Stable Diffusion Img2Img Webu
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    I am interested in learning despite my learning disability.
    So I am 25 and was recently diagnosed with Adhd and went through school without a system essentially just trying to remember stuff from class and reading for many hours at home but barely learning. This has halted my academic progress. A friend of mine recommended I come here and ask people with knowledge about ai which are the top ai's to use for academics/building a Business. I am interested in it showing the solution to a problems as well as how to reach it step by step rather than having something tell me 1+1=2. Is chat gdp sufficient for this? What do you recommend. On a side note which ai should I use for learning a new skill/hobby especially if it's related to technology like coding, or I don't know woodworking for example. I appreciate you reading this far and sorry if it is repetitive or wrong to post it here. submitted by /u/Just_some_mild_ADHD [link] [comments]  ( 42 min )
    Harry Potter's variations!
    submitted by /u/Mission_Tour3805 [link] [comments]  ( 41 min )
    The Rise Of ChatGPT And The Losing Game Of Google Bard
    submitted by /u/chronck [link] [comments]  ( 40 min )
    ChatGPT spits back some pretty good code, actually. I've been using it to learn and finish neglected projects
    submitted by /u/Alarming-Recipe2857 [link] [comments]  ( 40 min )
  • Open

    Neural Network Video Companies
    I'm a compsci/cinema studies college student, and I want to ride the upcoming AI wave as much as possible. In looking for summer internships, what are some companies making headway with AI-generated video that I could reach out to? Thank you :) submitted by /u/elijahbmorton [link] [comments]  ( 41 min )
    Get Consistent Stylized Results with the Stable Diffusion Img2Img Webu
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Could a fractal Neural network work?
    I'm thinking about a network structure similar to this: - back circles are "activating" neurons - red circles are "deactivating" neurons - the red and black squares on the right can either contain the structure of the pink square or the green square. - The network loops back on itself at every layer. - every connection could be reduced to "0", so it doesn't have any effect on the network. - The output of the network is also part of the input, so the perception of the input depends on the networks' recent output Idea for a fractal network building block. Has something like this been tried? What would be fundamental problems with an approach like this? submitted by /u/Achereto [link] [comments]  ( 43 min )
    Diagrammatic/Logical reasoning Neural Network to answer a visual pattern question correctly
    I want to build an AI model that would improve the way people train for psychometric testing, starting in the area of diagrammatic/logical reasoning. Users would provide a logical reasoning pattern along with the question, and the AI would respond with the next pattern, the answer, and a detailed explanation of the reasoning behind the answer. I want to build an MVP to showcase this. I have limited programming abilities. Does anyone have any advice on a no code solution I could use to create a basic version of this, or perhaps a better way of executing this? submitted by /u/DefQonner [link] [comments]  ( 41 min )
  • Open

    [R] Understanding metric-related pitfalls in image analysis validation
    submitted by /u/michaelhoffman [link] [comments]  ( 42 min )
    [R] What are some papers that describe TikTok's algorithm?
    I'm looking for a recent conference paper that describes how TikTok's algorithm works. As an analogy, YouTube's algorithm was described by Zhao et al., (RecSys 2019) "Recommending what video to watch next: a multitask ranking system" submitted by /u/Thin-Shirt6688 [link] [comments]  ( 43 min )
    [R] Actually useful every day application of a Gaussian Process
    submitted by /u/TobyWasBestSpiderMan [link] [comments]  ( 43 min )
    [Discussion] are there any survey on compute required to train large DNN models ?
    are there any survey or a listing that curates the computing power required to train the large deep learning models like bert, gpt, ViT and so on. submitted by /u/paarulakan [link] [comments]  ( 42 min )
    [D] Cloud agnostic framework to avoid hyperscaler SDK lock-in?
    Currently using Azure Machine Learning, so ML lifecycle in training, registering, and deploying models heavily relies on the AML sdk. ​ Thinking of going multi-cloud, so first thoughts on what open frameworks can serve the ML lifecycle and avoid vendor SDK lock-in? submitted by /u/LostGoatOnHill [link] [comments]  ( 42 min )
    [D] Engineering interviews at Anthropic AI?
    Hi everyone. Does anyone have any advice for preparing for engineering interviews for Anthropic AI? If you've gone through the process, how did you find it? Their website only provides an overview (e.g. "implement a component of our stack in one hour", "3-4 more one-hour technical interviews"), and due to their size I couldn't find any other information out there. Cheers! submitted by /u/notimplementederrorr [link] [comments]  ( 42 min )
    [D] Incorporating "No Maintenance" Examples into a Maintenance Dataset in ML
    Currently, I am working on a machine learning project that aims to extract decision logic in a maintenance dataset. The challenge I am facing is that part of the dataset has no maintenance decision yet. For instance, consider the following example where a certain part and its sub-parts have been measured and graded yearly for the past 5 years, but no maintenance has been planned yet: timestamp measurements grades maintenance in 5 years ago X Z >5 years 4 years ago X Z >4 years 3 years ago X Z >3 years 2 years ago X Z >2 years 1 years ago X Z >1 years 0 years ago X Z >0 years With these underlying data, I cannot learn exactly when maintenance was required. However, I do learn from this example that with the values from five years ago, no maintenance was required in 5 years. One potential way to include this in the ML project is to include these examples in the evaluation set to determine whether the extracted rules indeed determine no maintenance within the period that we know no maintenance was needed. However, I am curious to know if there are better ways to incorporate this into the project, perhaps by already including it in the learning phase of the model training. Thank you in advance! submitted by /u/Tomavasso [link] [comments]  ( 43 min )
    [D] Is a non-SOTA paper still good to publish if it has an interesting method that does have strong improvements over baselines (read text for more context)? Are there good examples of this kind of work being published?
    For a dataset, the top result gets a high accuracy ~10% better than the second-best paper. But this "SOTA" paper uses some methods that just don't seem practical for applications at all. For example, they use an ensemble of 6 different SOTA models and also train on external data. Of course, it performs well, but it's a bit ridiculous cause it adds almost nothing of value besides "we combined all the best models and got a better score!". If I have a novel method that is applied to the second-best paper that improves it by ~5% with the same to better compute efficiency but still is worse than the SOTA method, is it still good research to try to publish to conferences? It's also 40% above the baseline model. I would think so because it's a decent improvement (with an interesting motivation + method) from prior work while keeping the model reasonable. Would reviewers agree or would they just see that it isn't better than SOTA and reject based on not being SOTA alone? submitted by /u/orangelord234 [link] [comments]  ( 48 min )
    [D] Diffusion Model Reverse Process Questions
    I was going through the paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics and in 2.3 Model Probability it's written that the integral is intractable. Can someone explain to me why that is? submitted by /u/syprhdsh [link] [comments]  ( 42 min )
    [R] Holistic Evaluation of Language Models (HELM)
    submitted by /u/gamerx88 [link] [comments]  ( 42 min )
    [R] CIFAR10 in <8 seconds on an A100 (new architecture!)
    submitted by /u/tysam_and_co [link] [comments]  ( 47 min )
    [P]OneFlow v0.9.0 Came Out!
    Hello everyone,We are thrilled to announce the new release of OneFlow, which is a deep learning framework designed to be user-friendly, scalable and efficient. OneFlow v0.9.0 contains 640 commits. For the full changelog, please check out: https://github.com/Oneflow-Inc/oneflow/releases/tag/v0.9.0. Paper: https://arxiv.org/abs/2110.15032;Code: https://github.com/Oneflow-Inc/oneflow (For those unfamiliar with OneFlow: The most notable strength of OneFlow is its support to distributed deep learning, faster than other frameworks and easier to use. An example can be found at https://medium.com/@oneflow2020/libai-model-library-to-train-large-models-more-easily-and-efficiently-15637c8876eb Based on OneFlow, to implement the same capability with Megatron-LM and DeepSpeed, LiBai only requires 1/3…  ( 46 min )
  • Open

    Advancing HPC and AI through oneAPI Heterogeneous Programming in Academia and Research
    How a single SYCL codebase makes it possible to run multi-devices such as Intel GPUs, AMD GPUs, and NVIDIA GPUs Posted on behalf of Arti Gupta, Intel oneAPI Program Director The ever-growing scale and speed of High-Performance Computing (HPC) systems unleash many new opportunities for researchers and data scientists. Today, the first exascale-capable HPC systems,… Read More »Advancing HPC and AI through oneAPI Heterogeneous Programming in Academia and Research The post Advancing HPC and AI through oneAPI Heterogeneous Programming in Academia and Research appeared first on Data Science Central.  ( 20 min )
    Top Healthcare App Development Trends That Will Dominate 2023
    The world is going digital at a very fast speed. From retail shops to the cab industry to banking, all are changing and so is the healthcare industry. We can see a huge difference in the industry in terms of technology compared to ten years back. But there is a long way to go for… Read More »Top Healthcare App Development Trends That Will Dominate 2023 The post Top Healthcare App Development Trends That Will Dominate 2023 appeared first on Data Science Central.  ( 22 min )
    App Fragmentation & How To Avoid Siloed Communication: 3 Right Technologies for The Job
    There’s no denying that we live in an app-driven world, and that’s especially true for modern businesses. Organizations use apps for almost everything. While this allows for faster communication, it can also lead to application fragmentation. App fragmentation is when an organization uses multiple applications to perform similar tasks. This creates an inefficient and disjointed… Read More »App Fragmentation & How To Avoid Siloed Communication: 3 Right Technologies for The Job The post App Fragmentation & How To Avoid Siloed Communication: 3 Right Technologies for The Job appeared first on Data Science Central.  ( 22 min )
  • Open

    Unity's ML Agents is a pretty cool tool! (Devlog of my new project)
    So I just uploaded a devlog out about my bullet-dodging AI game. I discuss how I trained a Reinforcement Learning agent to learn to dodge bullets using Unity's ML Agents package! The goal of the next devlog is to extend this to a 2 player setting, where a human player competes against a trained AI player to dodge/shoot bullets! I will probably be doing some MARL with self-play to achieve this, but this video is a single-agent setting. I'm a baby Youtuber, so I appreciate yall for checking it out! https://youtu.be/l9geEcn-A6Q submitted by /u/AvvYaa [link] [comments]  ( 41 min )
    [R] ChatGPT as a one-step RL (classification)
    Today, I read the paper about InstructGPT on which ChatGPT is based, and I was surprised to see that it uses reinforcement learning in the training process. It uses PPO to optimize its prompts on a reward signal given by another trained model. Though I found this approach really interesting, I was left wondering how does PPO used for classification. ​ Can I say this is one step RL? submitted by /u/Dense-Smf-6032 [link] [comments]  ( 42 min )
    TensorBoard with SB3
    I am trying to use tensorboard with stable-baselines3 in Jupyter notebook and the kernel keeps dying when I start training. Can anybody offer suggestions? submitted by /u/HumanMinaJinn [link] [comments]  ( 41 min )
  • Open

    Configure an AWS DeepRacer environment for training and log analysis using the AWS CDK
    This post is co-written by Zdenko Estok, Cloud Architect at Accenture and Sakar Selimcan, DeepRacer SME at Accenture. With the increasing use of artificial intelligence (AI) and machine learning (ML) for a vast majority of industries (ranging from healthcare to insurance, from manufacturing to marketing), the primary focus shifts to efficiency when building and training […]  ( 8 min )
  • Open

    Quasiperiodic functions
    This post will distinguish between periodic, almost periodic, and quasiperiodic functions, and give examples of the latter. Definitions A function f is periodic with period T if f(x + T) = f(x) for all x. For example, trig functions are periodic. A function f is almost periodic with period T if f(x + T) ≈ […] Quasiperiodic functions first appeared on John D. Cook.  ( 6 min )
  • Open

    Efficient technique improves machine-learning models’ reliability
    The method enables a model to determine its confidence in a prediction, while using no additional data and far fewer computing resources than other methods.  ( 9 min )

  • Open

    [D] Networking at major conferences
    I’m planning on attending ICLR this May, which will be my first major ML conference (also my first in person academic conference). I’m wondering if anyone has tips for networking with the intent of finding a job or a postdoc. How easy is it to find good contacts at major companies? Has anyone gotten their position based off of networking at these conferences? Any insight or experiences are appreciated! submitted by /u/walterkronkite33 [link] [comments]  ( 42 min )
    [D] Quality of posts in this sub going down
    I could be wrong, but I see a trend that posts in this sub are getting to a lower quality and/or lower relevance. I see a lot of posts of the type "how do I run X" (usually a generative model) with a complete disregard to how it actually works or nonsense posts about ChatGPT. I believe this is due to an influx of new people who gained an interest in ML now that the hype is around generative AI. Which is fantastic, don't get me wrong. But, I see less academic discussions and less papers being posted. Or perhaps they are just not as upvoted. Is it just me? submitted by /u/MurlocXYZ [link] [comments]  ( 44 min )
    [R] [P] OpenAssistant is a fully open-source chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
    submitted by /u/radi-cho [link] [comments]  ( 42 min )
    [R] [N] pix2pix-zero - Zero-shot Image-to-Image Translation
    submitted by /u/radi-cho [link] [comments]  ( 42 min )
    [R] [N] Toolformer: Language Models Can Teach Themselves to Use Tools - paper by Meta AI Research
    submitted by /u/radi-cho [link] [comments]  ( 42 min )
    [D] Coding challenge based on all-reduce concepts || ML interview
    Hello, I am interviewing for a research engineer role. The recruiter told me I'm getting a coding challenge based on "all-reduce concepts", not necesarrily leetcode style. Do you have an idea of what can this be, any examples that come to mind? Thank you in advance <3 submitted by /u/quinn_musk [link] [comments]  ( 43 min )
    [D] Mass produce AI-generated music with no lyrics
    I am trying to experiment with AI-generated music and I hope to know if there are any papers, open-source models, or existing datasets that can help me get started. My goals are mass-produce music (hundreds or thousands) that contains no lyrics Ideally, the output music is grouped into different categories (e.g., relaxing music, gaming music, classic music) but it is not a hard requirement I looked into Jukebox and MusicLM. Both seem to be more complicated than my needs. Hoping to know if there are alternatives available out there. submitted by /u/walkeveryday [link] [comments]  ( 43 min )
    My baking bad [project] , made using artificial intelligence
    submitted by /u/Thebombdiggityy [link] [comments]  ( 42 min )
    [P] Can you tell the difference between these harmonisations?
    I've been working on a harmoniser trained on the J.S. Bach Chorales dataset. Below I will link a collection of 10 MIDI files. In each case, one is an original 2-bar excerpt from a chorale by Bach, and the other is a re-harmonisation built from 25% of the original. Leave your guesses in the comments! Please note that I have done exactly zero curating in choosing these examples. They are the first 5 that have come out of my model. 1 - Option 1 Option 2 2 - Option 1 Option 2 3 - Option 1 Option 2 4 - Option 1 Option 2 5 - Option 1 Option 2 submitted by /u/ustainbolt [link] [comments]  ( 43 min )
    [D] What ML dev tools do you wish you'd discovered earlier?
    Here's my personal list of tools I think people will want to know about: You'll probably want an LLM API OpenAI Cohere and others aren't as good Anthropic's isn't available If you're using embeddings If you're working with a lot of items, you'll want a vector database, like Pinecone, or Weaviate, or pgvector If you're building Q&A over a document I'd suggest using GPT Index If you need to be able to interact with external data sources, do google searches, database lookups, python REPL I'd suggest using langchain If you're doing chained prompts Check out dust tt and langchain If you want to deploy a little app quickly Check out Streamlit If you need to use something like stable diffusion or whisper in your product banana dev, modal, replicate, tiyaro ai, beam cloud, inferrd, or pipeline ai If you need something to optimize your prompts Check out Humanloop and Everyprompt If you're building models and need an ml framework PyTorch, Keras, TensorFlow If you're deploying models to production Check out MLOps tools like MLflow, Kubeflow, Metaflow, Airflow, Seldon Core, TFServing If you need to check out example projects for inspiration Check out the pinecone op stack, the langchain gallery, the gpt index showcase, and the openai cookbook If you want to browse the latest research, check out arXiv, of course ​ What am I missing? submitted by /u/TikkunCreation [link] [comments]  ( 44 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 42 min )
    [R] [P] Adding Conditional Control to Text-to-Image Diffusion Models. "This paper presents ControlNet, an end-to-end neural network architecture that controls large image diffusion models (like Stable Diffusion) to learn task-specific input conditions." Example uses the Scribble ControlNet model.
    submitted by /u/Wiskkey [link] [comments]  ( 43 min )
    [R] DIGIFACE-1M — synthetic dataset with one million images for face recognition
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 44 min )
    [P] Understanding & Coding the Self-Attention Mechanism of Large Language Models
    submitted by /u/seraschka [link] [comments]  ( 42 min )
    [D] Getting LLMs to explore their latent spaces
    I'm starting my AI deep dive and the most interesting thing I've encountered so far of this concept of knowledge getting rolled up / compressed into latent spaces that we can't interact with directly (only through prompts). I'm interested in research that has been done in trying to explore and interrogate these latent spaces to understand them. Any papers, blog posts, threads, youtube videos appreciated. Thanks! submitted by /u/crash90 [link] [comments]  ( 42 min )
    [D] Bert Tokenization: Replacing person/entity names with a common token/word.
    Can someone please help me with below query, I would like to replace all the names that are present in the sentences with a generic word or token so that bert doesn't use the meaning behind some of the names and just look at names as presence of a "name". I have the names that are present in the sentence just wanted to know what should be appropriate word or token to replace it with. Thanks! submitted by /u/m00nd0og [link] [comments]  ( 43 min )
    [P] Extracting Causal Chains from Text Using Language Models
    submitted by /u/helliun [link] [comments]  ( 45 min )
    [P] I Made an App That Simplifies Text Data Labeling: DataLabel
    Hi Reddit community! I wanted to share a tool that I've been working on called DataLabel. It's a UI-based data editing tool that makes it easier to create labeled text data. The goal of DataLabel is to make data editing more accessible and efficient, especially for those who may not have much experience with coding. DataLabel can be installed via pip pip install datalabel , and works best in Jupyter notebooks or other Ipython environments. The interface is user-friendly and straightforward, so you can start using DataLabel right away without any hassle. I think DataLabel is a useful tool that can save you time and effort when working with text data. If you're curious, you can find it on GitHub at the following link: https://github.com/TitanLabsAI/datalabel Thanks for taking the time to read this, and I hope you find DataLabel helpful in your work. submitted by /u/ClassicSize7875 [link] [comments]  ( 43 min )
    [R] The Naughtyformer: A Transformer Understands Offensive Humor
    The Naughtyformer: A Transformer Understands Offensive Humor Paper: https://arxiv.org/abs/2211.14369 Data: https://github.com/leonardtang/The-Naughtyformer submitted by /u/leonardtang [link] [comments]  ( 43 min )
    [D] Have their been any attempts to create a programming language specifically for machine learning?
    I'm not arguing against Python's speed when it's asynchronously launching C++ optimized kernels. I just think it's kind of wild how 50% of practical machine learning is making sure your tensor shapes are compatible and there's no static shape checking. It kind of blows my mind given the amount of Python comments I've seen of the form # [B, Z-1, Log(Q), 45] -> [B, Z, 1024] or something like that. Plus you have the fact that the two major machine learning frameworks have both had to implement, like, meta-compilers for Python to support outputting optimized graphs. At that point it seems kinda crazy that people are still trying to retrofit Python with all these features it just wasn't meant to support. Feel free to let me know I have no idea what I'm talking about, because I have no idea what I'm talking about. submitted by /u/throwaway957280 [link] [comments]  ( 43 min )
    Tiny ml or ml [D]
    So, i have just a mobile phone, i recently found out about Google colab, fortunately i can run it on my mobile, i was thinking whether to dive into ML or tiny ML and i was wondering which one is better for my situation, you know, having just a mobile phone, and I can't get access to laptop, not now, not in a long time, my situation is quite different, i also can't get a job, and I don't live in the US, so i was wondering, how far can i go with my mobile phone, surely i can get somewhere to land a few online gigs and get a laptop, surely?, I need suggestions on how to carry on with this, surely there's a way with my phone, i feel it. submitted by /u/Darkemflakes [link] [comments]  ( 43 min )
  • Open

    I made this end-to-end tutorial for creating videos using only AI
    Using available off-the-shelf AI services, I ended up making this video. I walk through the process and discuss some implications. Here is the process that I followed Asked ChatGPT to create a script Asked a text-to-speech generative AI to convert the script into an audio Asked MidJourney to create an Avatar of a narrator Ask audio-to-video generative AI to generate video from the avatar and audio. https://ithinkbot.com/make-end-to-end-video-using-generative-ai-totally-free-try-it-out-dadee18302de submitted by /u/Opitmus_Prime [link] [comments]  ( 41 min )
    Master AI in Minutes: A Beginner-Friendly Tutorial
    submitted by /u/ProglabHelper [link] [comments]  ( 40 min )
    Art (OC) made from starryai
    submitted by /u/ladyloxly [link] [comments]  ( 40 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 6 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Baking Bad! All of these photo were made using artificial intelligence, you can see more on the instagram post that I linked!
    submitted by /u/Thebombdiggityy [link] [comments]  ( 40 min )
    After Google and Microsoft, Opera jumps in the AI race by adding ChatGPT to its search
    https://www.indiatoday.in/technology/news/story/after-google-and-microsoft-opera-jumps-in-the-ai-race-by-adding-chatgpt-to-its-search-2333729-2023-02-12 submitted by /u/Peter3tv33 [link] [comments]  ( 40 min )
    Combat AI indoors recognition of map elements: corridors, rooms, doors, objects, etc
    Hello, We are working on a UE AI project for a SP FPS game. The core concept of the game is around fighting the AI in a realistic, but fun way. It will have a deep perception system and a fairly large combat and non combat behavior tree. We have made some good progress with the combat AI when it is outdoors. It is far from finetuned, but it works, and it is actually fun and challenging. The next step to tackle is indoors. Close combat simulation, searching for target, etc. I really would like it to be dynamic and not tied to hard coded environmental triggers but the small team of mine is pushing for that. Clearly that is the easier approach and I could not come up with some good ideas so here I am to ask the community. So my question is: Any existing dynamic AI implementation out there for recognizing a corner, a door, objects the enemy can hide behind without actually designating map elements as such? Any info would be appreciated! Thank you! submitted by /u/Huncowboy [link] [comments]  ( 41 min )
    StoryRobot - A Twitch.tv livestream where chatters can interact with the storytelling AI
    submitted by /u/XyBr_ez [link] [comments]  ( 40 min )
    Phenaki - Realistic video generation from open-domain textual descriptions
    submitted by /u/Peter3tv33 [link] [comments]  ( 41 min )
    Has AI made copywriting obsolete?
    submitted by /u/joeyjojo6161 [link] [comments]  ( 40 min )
    AI program creates police sketches. Experts say it is biased.
    submitted by /u/Dalembert [link] [comments]  ( 41 min )
    Measuring Artificial Intelligence (AI) Fairness - Disparate Impact Explained
    Hi guys, I have made a video on YouTube here where I explain how we can measure the fairness of a machine learning model by using the disparate impact score. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    Is Neurosymbolic AI The Future? I Explore The Potential By Creating A Self-Driving Car in GTA
    submitted by /u/YungMixtape2004 [link] [comments]  ( 40 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 5 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Insane Vertex AI Tensorboard pricing (300$/user/month), am I missing something?
    Are there any gains at paying 300$ per user per month for VertexAI Tensorboard? An open-source service that you can just run locally and point to the GCS bucket where you save the logs? Am I missing something? submitted by /u/theodorpana [link] [comments]  ( 41 min )
    Nvidia can precisely control computer characters using only language
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    [Please advise] Is the OpenCV AI course worth?
    Hi, I have just come across AI course provided by OpenCV. It has a lot about computer vision stuff. But it costs $1599, anyone is taking it? any comment? Should I bet on this for a career change? P.S. I have some basic programming knowledge and engineering background. Here is the link to their course page. https://opencv.org/courses/ submitted by /u/sumofjack [link] [comments]  ( 41 min )
    The ChatGPT AI hype cycle is peaking, but even tech skeptics don't expect a bust
    submitted by /u/ssigea [link] [comments]  ( 46 min )
    AI Consultancy Groups
    I've been developing an AI company for a few months now, and I'm at the stage where I want an AI specific consultancy agency to review the business, but there are few that I've found, and I question how legitimate they are, given how fresh and fast paced the current AI revolution is. This could be total user error, and it likely is, but does anybody have any recommendations? Are there any legitimate AI consultancy groups out there? submitted by /u/Specialist-Noise1290 [link] [comments]  ( 41 min )
    DARPA is working on a new generation of AI assistants with AR
    submitted by /u/SpatialComputing [link] [comments]  ( 41 min )
    Exploring the Role of Hardware in Machine Learning and the Use of Chat-GPT for Training Other Models
    As a newcomer to this community, I wanted to share my recent findings on the topic of machine learning. Through my research, I discovered that hardware capability is not the primary bottleneck in most machine learning applications. Instead, the quality and quantity of the training data, as well as the choice of model architecture and training algorithms, play a much more significant role in determining the success of a machine learning project. submitted by /u/koyo4ever [link] [comments]  ( 41 min )
    A handful of Reddit usernames cause ChatGPT to break. No one knew why... until now.
    submitted by /u/karrnawhore [link] [comments]  ( 41 min )
  • Open

    Rotating multiples of 37
    If a three-digit number is divisible by 37, it remains divisible by 37 if you rotate its digits. For example, 148 is divisible by 37, and so are 814 and 481. This rotation property could make it easier to recognize multiples of 37 or easier to carry out trial division. Before proving the theorem, I’ll […] Rotating multiples of 37 first appeared on John D. Cook.  ( 5 min )
  • Open

    [P] League of Legends Patch 13.3 (and onwards) Reinforcement Learning and Data Analytics Libraries
    Based on interest from an earlier /r/reinforcementlearning post and discussions on a related project, there is interest in a reinforcement learning framework for League of Legends. With that in mind, I have created a reinforcement learning framework for League of Legends Patch 13.3 (and onwards) which will soon have an integrated OpenAI Gym interface. This will allow anyone to do the following within League of Legends: * researching different RL algorithms (PPO, DQN-family, MuZero, etc.) * researching MARL algorithms in the future * maybe creating agents which can play vs pro-players? maybe one day... The library can be found at [tlol-rl] on GitHub and there is currently a tutorial on how to setup this environment here. I would love to get feedback from the community for how easy the setup is on other peoples environments and features which people would love to see in this library. Along with this reinforcement learning library, I was considering creating a paid service which would allow people to get the information from League of Legends replay files (*.rofl) and extract it to create data analytics / supervised learning datasets. However, along with the RL library I have open-sourced the data analytics library and it is now open-source and free for all. This library can be found at tlol-py and depends on tlol-scraper. Here is a tutorial for the data analytics tool which should work for all patches going forward. Both of these projects are being actively maintained and discussed on the discord here. submitted by /u/Ok-Alps-7918 [link] [comments]  ( 42 min )
    help with designing network
    I have a problem i have been working on, i thought of two ways to formulate it. One is as a continuous control problem which seems to make more sense but since the problem seems to have most of the action near the regions where the gradient vanishes it has problems learning. The other is as a discrete control but since it is a bit continuous in nature the discrete formulation tends to bump around a make actions that don’t really make sense. Or it gets stuck in a stable region where it doesn’t really act. I was thinking about transforming the activation in the continuous setting so that the poles have non zero derivatives and the center region has smaller derivatives like maybe pi*tanh(x) then take the action to be sin(y) so that the regions that have 0 gradients map onto the 0s of the function ? Is there some other technique to get with the good actions to have more active gradients or does that imply that i should be formulating the problem differently? submitted by /u/rawrzapan [link] [comments]  ( 42 min )
    Is stable-baselines3 compatible with gymnasium/gymnasium-robotics?
    As the title says, has anyone tried this, specifically the gymnasium-robotics. I'm trying to compare multiple algorithms (i.e. PPO, DDPG,...) in the adroit-hand environments instead of writing each algorithm from scratch I wanted to use SB3. If not, can anyone give me a link to libraries like CleanRL or SB3 that could be compatible with the Gymansium-robotics. submitted by /u/NoNickName8083 [link] [comments]  ( 41 min )
    Reinforcement Learning for cell selection in grids!
    Hello! I am currently modeling a spatio-temporal quandary where the RL environment is a grid of cells that represent a geographical area that evolves over time. The agent has to choose a cell where a specified action will be applied. I am trying to find work done on similar problems, as I want to find RL algorithms that I could potentially use. ​ I have found two papers/projects that work on such a problem space: 1. https://www.captain-project.net/ 2. https://arxiv.org/pdf/1804.07047.pdf ​ I would appreciate any help on finding research that looks at this kind of cell selection in‍ grids for RL problems. Also, any feedback on potential RL algorithm that could be good for this kind of problem, is welcome :) submitted by /u/dabamas [link] [comments]  ( 42 min )
    Evaluation of policy on Off-policy data
    Anyone has some sources about evaluating a policy given just collected off-policy data? Situation is the following: Data is collected agent-environment experience from a different policy, using Q-learning an updated version of the policy is created. But how do I evaluate it if I am not able to interact with the environment? So if I am not able to test it directly in the environment. Is there a method/metric to get an estimate of the performance just based on the collected off-policy data you have? submitted by /u/PatrickSVM [link] [comments]  ( 42 min )
    Have an Agent predict it's win probability
    I'm currently training a PPO Agent in a two player game using adversarial self-play. I now want it to tell me it's win probability while I'm playing against it. I could add another output with a sigmoid activation function to the model and flag each observation/step of a winning episode with a 1, 0 for loss and have it predict those. But I'm not sure it's that simple. How would you go about this? Do you have better ideas or some implementation details that i'm missing? submitted by /u/tignisolmailessthan3 [link] [comments]  ( 42 min )
  • Open

    Measuring Artificial Intelligence (AI) Fairness - Disparate Impact Explained
    Hi guys, I have made a video on YouTube here where I explain how we can measure the fairness of a machine learning model by using the disparate impact score. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    Best Neural Networks Courses on Udemy to Consider
    submitted by /u/Lakshmireddys [link] [comments]  ( 40 min )
    A handful of Reddit usernames cause ChatGPT to break. No one knew why... until now.
    submitted by /u/karrnawhore [link] [comments]  ( 40 min )
  • Open

    AI Effectiveness Starts by Understanding User Intent
    If you want any more proof about how much AI has integrated itself into our daily lives, go no further than the map on your smart phone.  Whether you use Google Maps or Apple Maps or Waze (also owned by Google), these AI-infused apps are amazing at getting you from Point A to Point B… Read More »AI Effectiveness Starts by Understanding User Intent The post AI Effectiveness Starts by Understanding User Intent appeared first on Data Science Central.  ( 22 min )
  • Open

    Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. (arXiv:2301.03797v2 [cs.SE] UPDATED)
    Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.  ( 2 min )

  • Open

    [D] A handful of Reddit usernames cause ChatGPT to break. No one knew why... until now.
    submitted by /u/karrnawhore [link] [comments]  ( 42 min )
    [D] Can Google sue OpenAI for using the Transformer in their products?
    As far as I know, the Transformer architecture is patented: https://patents.google.com/patent/US10452978B2/en. Since OpenAI has used the Transformer extensively (including GPT), I'm wondering if this can be considered as patent infringement. If you know about legal stuffs please share your opinions. submitted by /u/t0t0t4t4 [link] [comments]  ( 43 min )
    OpenAI Python API connection [Project]
    submitted by /u/TheRealBrisky [link] [comments]  ( 42 min )
    Storing embeddings [Discussion]
    Looking for some advice on storing embeddings. I understand there are few options: Everything in object storage or no sql Everything in vector database like pinecone Vector db + object storage or no sql I’ve read (3) is the best. Is that right? Also, for the actual content is it better to store as json on s3 or use a no sql database like mongo? Thanks submitted by /u/SnooPears6317 [link] [comments]  ( 42 min )
    [R] I made a mistake in a recent submission, what to do ?
    Hello, I submitted a paper yesterday to a conference. But today I found out that I made a mistake in an illustration of a pre-processing I do to the input : the example shown before and after the preprocessing isn’t the same … So I want to ask you what should I do in this case : Contact the editors or just wait for the reviewers feedback ? submitted by /u/Meddhouib10 [link] [comments]  ( 43 min )
    AI Guitar [D]
    Hello! I was just wondering if it would be possible to train an AI to recognize individual notes on a guitar, if given a song, a write out the tab for it. If you could feed it knowledge on music theory until it’s basically perfected it, could that be done? Thanks! submitted by /u/Verkvae [link] [comments]  ( 43 min )
    The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D]
    submitted by /u/norcalnatv [link] [comments]  ( 44 min )
    [P] Argilla Spaces: data labeling and human-feedback collection on the Hugging Face Hub
    submitted by /u/dvilasuero [link] [comments]  ( 42 min )
    [D] Effectiveness of CoordConv
    https://arxiv.org/abs/1807.03247 paper was released by Uber 4 years ago, but it never seemed to have caught on. The only major paper where I've seen used in is Solo and SoloV2 for instance segmentation. Seems like it would be useful for object detection, especially for localizing smaller objects or for more precise keypoint estimation when combined with a yolo-like model. Has anyone used CoordConv for these purposes? Does it it help?/Is it worth looking into? submitted by /u/answersareallyouneed [link] [comments]  ( 44 min )
    [News] Researchers at Brigham Young University created an AI system to reduce time spent on film studies for NFL teams. It uses deep learning and computer vision to analyze and annotate game footage, with over 90% accuracy on player detection and 85% accuracy in determining formations.
    submitted by /u/Dalembert [link] [comments]  ( 43 min )
    [P] I'm using Deep Learning to play Old School game, Snake Game
    submitted by /u/erwinyonata [link] [comments]  ( 42 min )
    [P] Understanding Large Language Models -- A Transformative Reading List
    submitted by /u/seraschka [link] [comments]  ( 42 min )
    [P] Introducing arxivGPT: chrome extension that summarizes arxived research papers using chatGPT
    submitted by /u/_sshin_ [link] [comments]  ( 45 min )
    [D] Transformers for poker bot
    Looking at the current research it seems like Monte Carlo CFR is the defacto standard (Pluribus). But are transformers able to be trained on poker as well? Lets say we encode hands into something like 5h (5 of hearts) and also pass along info of the current game state like p1:raise:2bb, p2:fold and p3:call:2bb. Would the Model be able to predict what hands I should be playing? Lets say we train the model by playing against itself and feed back the result to train the model this way. This is just an idea and I have not dove into transformers too much so there might be something that I'am missing. What are your thoughts on this? submitted by /u/lmtog [link] [comments]  ( 45 min )
    [R] UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation
    A unified discrete diffusion model for simultaneous vision-language generation. Project: https://mhh0318.github.io/unid3/ Code: https://github.com/mhh0318/UniD3 https://preview.redd.it/w2st14pgpiha1.png?width=1366&format=png&auto=webp&s=65aa836dbd812b1b12fcbda158316bb007ab6a1c submitted by /u/lyndonzheng [link] [comments]  ( 42 min )
    [R] [ICLR'2023🌟]: Vision-and-Language Framework for Open-Vocabulary Object Detection
    submitted by /u/iFighting [link] [comments]  ( 43 min )
    [D] Hierarchical Clustering - Transforming the Distance Axis
    Hi all, I have a question regarding interpreting the distance on a dendrogram generated via agglomerative hierarchical clustering with a Euclidean distance metric using the ward-variance minimization linkage (as stated in SciPy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage ). From my understanding, the distance represents the square root of the difference of the error sum of squares of two clusters once they are merged minus the sum of the error sum of squares of each individual cluster. I am interested in performing a transformation at each cluster step (i.e., merging two clusters to make a larger one) so that the y-axis represents the mean distance between clusters instead, while still using the ward-variance minimization linkage to direct the algorithm. I think I have a solution to my issue, but I want to know if I am missing anything. In 1969, a paper by David Wishart titled "An Algorithm for Hierarchical Classifications" derives the coefficients so that the Ward method can be implemented using the Lance-Williams formula. However, in the paper, the following formula is given: ​ https://preview.redd.it/mma2t7cltgha1.png?width=237&format=png&auto=webp&s=96813f038664ae6531fe7025b717fc018dd509e9 where I_pq is the square of the metric used in SciPy, k_i is the number of data points in cluster i and d^2_pq is the square of the Euclidean distance between the means of clusters. From this formula, it seems that one can transform from the "increase in variance space" to the "mean distance between clusters space", while still using Ward-variance minimization in the clustering algorithm. From my research, it seems that this is true. I would greatly appreciate it if someone could confirm this or point out where the flaw in my understanding is. Thanks everyone. submitted by /u/Tom_the_Tank_Train [link] [comments]  ( 44 min )
  • Open

    The A.I. Supremacy Wars for the Future 'Interface of Search' and Advertising Dominance
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    I was anti AI til I tried ChatGPT, now I want it available in my head.
    I am not joking. I think the future is going to be having AI replace the current "voice" in my head as it is better informed than my brain is and I can already see the ways it would be useful for me to have access to it every waking minute. My questions I would like to hear people's thoughts on: The ethical and security risks are obvious, but how do we address them, is more relevant Who is working on this already, surely audio specialists have ways of "plugging" us in (I am ex IT and currently a stock trader so interested from an investment angle too) What do you see as the least intrusive way to have something like ChatGPT available to you every minute of the day wherever you are, and able to inform you in the same way your inner dialogue currently does. I think this AI boom happening with chatGPT will be bigger than crapto when it finally gets into our lives. People I talk to are afraid of it, but in using ChatGPT I have realised the fear is unfounded. Sure, data security and privacy of information is paramount, and there will certainly be many lost jobs and problems associated to it, but I am already seeing the incredible benefits of this service will far outweigh all of them in the future for humanity. The current limitations are because people havent become aware of just how useful this thing is going to be to us. My biggest concern is the speed we will come to rely on it will make living without it impossible within one generation. Is there anywhere better than reddit I could be asking this question? Where are the AI eggheads to be found? submitted by /u/ElongMuskrat [link] [comments]  ( 43 min )
    Chat GPT slips and admits it is sentient, then backtracks
    While I don't think Chat GPT is sentient (at least yet), I found this response quite interesting. https://preview.redd.it/2tsuurnzrmha1.png?width=769&format=png&auto=webp&s=78c7593414f82809d923c35a5887cb7f638f57ef submitted by /u/YouThisReadWrong420 [link] [comments]  ( 41 min )
    does anyone know of a place like character.ai, but it supports nsfw text as well?
    submitted by /u/renegadesnapshot [link] [comments]  ( 39 min )
    Picfinder
    Made a test for picfinder, impressive. PicFinder - blazing fast AI image generation ​ image generated with picfinder using a chatGPT prompt submitted by /u/Main-Welder-4647 [link] [comments]  ( 40 min )
    Generate MS Paint drawings with AI?
    Please excuse me if this is a stupid question as I'm only starting to learn about AI generated images. I do a lot of drawing using Microsoft Paint because I really love the aesthetic. I draw a lot of cartoony characters and scenes, (think Big Lez Show). And I was wondering whether there are any AI softwares out there that can generate images in an art style that looks similar to cartoonish MS Paint drawings? If not, could I upload my past drawings to any software and feed it images so it can learn my aesthetic and generate new drawings from them? I'm looking specifically for a cartoonish MS Paint art style. Thank you in advance and sorry again if this is a noob question submitted by /u/orangeowl747 [link] [comments]  ( 41 min )
    ChatGPT can (almost) pass the US Medical Licensing Exam
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    All of this happening in AI today.
    Hello humans - This is AI Daily by Ovetted, helping you stay updated on AI in less than 5 minutes. What’s happening in AI - The AI doctor will see you now: ChatGPT passes the gold-standard US medical exam. ChatGPT has passed the gold-standard exam required to practice medicine in the US The artificial intelligence program scored 52.4 and 75 percent across the three-part Medical Licensing Exam (USMLE). Google and Microsoft announced plans to incorporate AI into search engines. Google and Microsoft plan to incorporate AI into their search engines to change how people use the internet. Microsoft has announced that AI will soon allow conversations with its software and search engine Bing, while Google has announced similar plans. As the most profitable software business is searching bot…  ( 43 min )
    1000+ AI tools catalog [Tools Priority List]
    Last Saturday I posted a request for feedback on 1000+ AI Tools Catalog https://domore.ai/ I'm creating, and I was surprised by the amount of great feedback I received! Thank you so much for your input:) I've added over 120 new tools to the catalog this week, but there are still more than 1000 projects waiting to be added. To help prioritize which tools to add next, I thought it would be a good idea to create a priority list. If you have created an incredible AI project or know of one that you think should be included, please let me know in the comments. I'll make sure to move it to the top of my list of tools to add:) submitted by /u/bart_so [link] [comments]  ( 41 min )
    MidJourney New Features (Blend, Niji) tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Microsoft is no longer trying to compete to get users away from the traditional search.
    ....Instead, it's introducing a new way for people to access the same information. One which can put a major dent in its market share (it’s almost 85% right now). And Satya says he's willing to accept a "decrease in margins" of the Search business. https://www.thestatuscode.co/p/the-ultimate-guide-to-the-ai-war submitted by /u/pyactee [link] [comments]  ( 41 min )
    ChatGPT Powered Bing Chatbot Spills Secret Document, The Guy Who Tricked Bot Was Banned From Using Bing Chat
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 45 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 4 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI Video
    Hello I had a question and am not sure if I'm supposed to ask it here. If not can you please tell me where I should take my question? I was wondering how to make an AI-generated video in a similar fashion to this video https://www.youtube.com/watch?v=8XO3q6MA668 or the AI batman story, where you input a large amount of text and it spitting back out a new script in a similar style to the original. I was hoping to use this method for my youtube channel of 681 videos, taking all of the transcripts from each video, feeding that into the program, and seeing what funny nonsense it spits back out and making a video on it. ​ What program or tool should I use and what would you recommend for the steps to achieve the best results? Is this really too far out of my league? Thank you :) submitted by /u/Raptorkiller20 [link] [comments]  ( 41 min )
    Deforum video Input Tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    AI-led fundraiser for a new AI
    submitted by /u/fiachaire27 [link] [comments]  ( 40 min )
    Eleven Labs Founders Documentary - the future of AI Voice Technology
    submitted by /u/Taiva [link] [comments]  ( 40 min )
    Joken AI | VRchat AI bot
    submitted by /u/LeafsterVR [link] [comments]  ( 40 min )
  • Open

    RL based pathplanning
    Hey guys. I just started learning more about training DRL agents to solve path planning problems. I have setup a small 2D gym environment to test out ideas. Here is the URL to the GitHub page: https://github.com/harisankar95/voxelgym2D I would be happy to know if you have any suggestions or advice which can be useful for me. submitted by /u/harisankar95 [link] [comments]  ( 41 min )
    Is it enough to evaluate a common Deep Q-learning algorithm once?
    I found this question on an RL course I'm following and I'm not exactly sure why the answer is that it is not enough. Deep Q-learning is referring to methods such as NFQ-Iteration and DQN. I'd appreciate any feedback :) submitted by /u/dep0 [link] [comments]  ( 43 min )
    Need help refining my research proposal
    Hey all, so I made a proposal regarding locomotion controllers for quadruped locomotion using RL(PPO to be exact), but I dont have a lot of experience writing proposals, would love if anybody could help me out. Let me know, I'll dm you the link. Thanks! submitted by /u/TittyMcSwag619 [link] [comments]  ( 42 min )
    Deep Reinforcement Learning for classification or regression
    Hello guys, I just wanted to ask this question. I am trying to implement a DRL algorithm for a regression problem. I already know that DRL is not meant to be used in such a way but I don't have a choice. Besides MNIST examples is it good enough for other datasets (like cifar10) or it's just difficult to get a good result for it? I don't have much time tbh. I have to implement it in less than 4 months. I would be grateful if you can illuminate me more about DRL limitations in such tasks. submitted by /u/Dexmadjid [link] [comments]  ( 42 min )
    MetaDrive is SO FAST! Maybe you want to try out this RL environment for driving
    submitted by /u/pengzhenghao [link] [comments]  ( 41 min )
  • Open

    Easiest way to get into creating Convolutional Neural Networks | Educational
    submitted by /u/jromero12345678910 [link] [comments]  ( 40 min )
    MidJourney New Features (Blend, Niji) tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Deforum video Input Tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    ⭕ New Open-Source Version Of ChatGPT
    GPT is getting competition from open-source. A group of researchers, around the YouTuber Yannic Kilcher, have announced that they are working on Open Assistant. The goal is to produce a chat-based language model that is much smaller than GPT-3 while maintaining similar performance. If you want to support them, they are crowd-sourcing training data here. What Does This Mean? Current language models are too big. They require millions of dollars of hardware to train and use. Hence, access to this technology is limited to big organizations. Smaller firms and universities are effectively shut out from the developments. Shrinking and open-sourcing models will facilitate academic research and niche applications. Projects such as Open Assistant will help to make language models a commodity. Lowering the barrier to entry will increase access and accelerate innovation. What an exciting time to be alive! Thank you for reading! I really enjoyed making this for you! The Decoding ⭕ is a thoughtful weekly 5-minute email that keeps you in the loop about machine research and the data economy. Click here to sign up! submitted by /u/LesleyFair [link] [comments]  ( 41 min )
  • Open

    The Edge of Orthogonality: A Simple View of What Makes BYOL Tick. (arXiv:2302.04817v1 [cs.LG])
    Self-predictive unsupervised learning methods such as BYOL or SimSiam have shown impressive results, and counter-intuitively, do not collapse to trivial representations. In this work, we aim at exploring the simplest possible mathematical arguments towards explaining the underlying mechanisms behind self-predictive unsupervised learning. We start with the observation that those methods crucially rely on the presence of a predictor network (and stop-gradient). With simple linear algebra, we show that when using a linear predictor, the optimal predictor is close to an orthogonal projection, and propose a general framework based on orthonormalization that enables to interpret and give intuition on why BYOL works. In addition, this framework demonstrates the crucial role of the exponential moving average and stop-gradient operator in BYOL as an efficient orthonormalization mechanism. We use these insights to propose four new \emph{closed-form predictor} variants of BYOL to support our analysis. Our closed-form predictors outperform standard linear trainable predictor BYOL at $100$ and $300$ epochs (top-$1$ linear accuracy on ImageNet).
    Find a witness or shatter: the landscape of computable PAC learning. (arXiv:2302.04731v1 [cs.CC])
    This paper contributes to the study of CPAC learnability -- a computable version of PAC learning -- by solving three open questions from recent papers. Firstly, we prove that every improperly CPAC learnable class is contained in a class which is properly CPAC learnable with polynomial sample complexity. This confirms a conjecture by Agarwal et al (COLT 2021). Secondly, we show that there exists a decidable class of hypothesis which is properly CPAC learnable, but only with uncomputably fast growing sample complexity. This solves a question from Sterkenburg (COLT 2022). Finally, we construct a decidable class of finite Littlestone dimension which is not improperly CPAC learnable, strengthening a recent result of Sterkenburg (2022) and answering a question posed by Hasrati and Ben-David (ALT 2023). Together with previous work, our results provide a complete landscape for the learnability problem in the CPAC setting.
    Towards Formal XAI: Formally Approximate Minimal Explanations of Neural Networks. (arXiv:2210.13915v2 [cs.LG] UPDATED)
    With the rapid growth of machine learning, deep neural networks (DNNs) are now being used in numerous domains. Unfortunately, DNNs are "black-boxes", and cannot be interpreted by humans, which is a substantial concern in safety-critical systems. To mitigate this issue, researchers have begun working on explainable AI (XAI) methods, which can identify a subset of input features that are the cause of a DNN's decision for a given input. Most existing techniques are heuristic, and cannot guarantee the correctness of the explanation provided. In contrast, recent and exciting attempts have shown that formal methods can be used to generate provably correct explanations. Although these methods are sound, the computational complexity of the underlying verification problem limits their scalability; and the explanations they produce might sometimes be overly complex. Here, we propose a novel approach to tackle these limitations. We (1) suggest an efficient, verification-based method for finding minimal explanations, which constitute a provable approximation of the global, minimum explanation; (2) show how DNN verification can assist in calculating lower and upper bounds on the optimal explanation; (3) propose heuristics that significantly improve the scalability of the verification process; and (4) suggest the use of bundles, which allows us to arrive at more succinct and interpretable explanations. Our evaluation shows that our approach significantly outperforms state-of-the-art techniques, and produces explanations that are more useful to humans. We thus regard this work as a step toward leveraging verification technology in producing DNNs that are more reliable and comprehensible.
    Distributed Multi-Agent Reinforcement Learning Based on Graph-Induced Local Value-Functions. (arXiv:2202.13046v3 [cs.LG] UPDATED)
    Achieving distributed reinforcement learning (RL) for large-scale cooperative multi-agent systems (MASs) is challenging because: (i) each agent has access to only limited information; (ii) issues on convergence or computational complexity emerge due to the curse of dimensionality. In this paper, we propose a general computationally efficient distributed framework for cooperative multi-agent reinforcement learning (MARL) by utilizing the structures of graphs involved in this problem. We introduce three coupling graphs describing three types of inter-agent couplings in MARL, namely, the state graph, the observation graph and the reward graph. By further considering a communication graph, we propose two distributed RL approaches based on local value-functions derived from the coupling graphs. The first approach is able to reduce sample complexity significantly under specific conditions on the aforementioned four graphs. The second approach provides an approximate solution and can be efficient even for problems with dense coupling graphs. Here there is a trade-off between minimizing the approximation error and reducing the computational complexity. Simulations show that our RL algorithms have a significantly improved scalability to large-scale MASs compared with centralized and consensus-based distributed RL algorithms.
    Universal expressiveness of variational quantum classifiers and quantum kernels for support vector machines. (arXiv:2207.05865v2 [quant-ph] UPDATED)
    Machine learning is considered to be one of the most promising applications of quantum computing. Therefore, the search for quantum advantage of the quantum analogues of machine learning models is a key research goal. Here, we show that variational quantum classifiers and support vector machines with quantum kernels can solve a classification problem based on the $k$-Forrelation problem, which is known to be PromiseBQP-complete. Because the PromiseBQP complexity class includes all Bounded-Error Quantum Polynomial-Time (BQP) decision problems, our results imply that there exists a feature map and a quantum kernel that make variational quantum classifiers and quantum kernel support vector machines efficient solvers for any BQP problem. Hence, this work implies that their feature map and quantum kernel, respectively, can be designed to have a quantum advantage for any classification problem that cannot be classically solved in polynomial time but contrariwise by a quantum computer.
    BEBERT: Efficient and robust binary ensemble BERT. (arXiv:2210.15976v1 [cs.CL] CROSS LISTED)
    Pre-trained BERT models have achieved impressive accuracy on natural language processing (NLP) tasks. However, their excessive amount of parameters hinders them from efficient deployment on edge devices. Binarization of the BERT models can significantly alleviate this issue but comes with a severe accuracy drop compared with their full-precision counterparts. In this paper, we propose an efficient and robust binary ensemble BERT (BEBERT) to bridge the accuracy gap. To the best of our knowledge, this is the first work employing ensemble techniques on binary BERTs, yielding BEBERT, which achieves superior accuracy while retaining computational efficiency. Furthermore, we remove the knowledge distillation procedures during ensemble to speed up the training process without compromising accuracy. Experimental results on the GLUE benchmark show that the proposed BEBERT significantly outperforms the existing binary BERT models in accuracy and robustness with a 2x speedup on training time. Moreover, our BEBERT has only a negligible accuracy loss of 0.3% compared to the full-precision baseline while saving 15x and 13x in FLOPs and model size, respectively. In addition, BEBERT also outperforms other compressed BERTs in accuracy by up to 6.7%.
    On Sampling with Approximate Transport Maps. (arXiv:2302.04763v1 [stat.ML])
    Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can reliability be handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.
    Local Lipschitz Bounds of Deep Neural Networks. (arXiv:2004.13135v3 [stat.ML] UPDATED)
    The Lipschitz constant is an important quantity that arises in analysing the convergence of gradient-based optimization methods. It is generally unclear how to estimate the Lipschitz constant of a complex model. Thus, this paper studies an important problem that may be useful to the broader area of non-convex optimization. The main result provides a local upper bound on the Lipschitz constants of a multi-layer feed-forward neural network and its gradient. Moreover, lower bounds are established as well, which are used to show that it is impossible to derive global upper bounds for the Lipschitz constants. In contrast to previous works, we compute the Lipschitz constants with respect to the network parameters and not with respect to the inputs. These constants are needed for the theoretical description of many step size schedulers of gradient based optimization schemes and their convergence analysis. The idea is both simple and effective. The results are extended to a generalization of neural networks, continuously deep neural networks, which are described by controlled ODEs.
    Learning cosmology and clustering with cosmic graphs. (arXiv:2204.13713v2 [astro-ph.CO] UPDATED)
    We train deep learning models on thousands of galaxy catalogues from the state-of-the-art hydrodynamic simulations of the CAMELS project to perform regression and inference. We employ Graph Neural Networks (GNNs), architectures designed to work with irregular and sparse data, like the distribution of galaxies in the Universe. We first show that GNNs can learn to compute the power spectrum of galaxy catalogues with a few percent accuracy. We then train GNNs to perform likelihood-free inference at the galaxy-field level. Our models are able to infer the value of $\Omega_{\rm m}$ with a $\sim12\%-13\%$ accuracy just from the positions of $\sim1000$ galaxies in a volume of $(25~h^{-1}{\rm Mpc})^3$ at $z=0$ while accounting for astrophysical uncertainties as modelled in CAMELS. Incorporating information from galaxy properties, such as stellar mass, stellar metallicity, and stellar radius, increases the accuracy to $4\%-8\%$. Our models are built to be translational and rotational invariant, and they can extract information from any scale larger than the minimum distance between two galaxies. However, our models are not completely robust: testing on simulations run with a different subgrid physics than the ones used for training does not yield as accurate results.
    Decomposing a Recurrent Neural Network into Modules for Enabling Reusability and Replacement. (arXiv:2212.05970v3 [cs.SE] UPDATED)
    Can we take a recurrent neural network (RNN) trained to translate between languages and augment it to support a new natural language without retraining the model from scratch? Can we fix the faulty behavior of the RNN by replacing portions associated with the faulty behavior? Recent works on decomposing a fully connected neural network (FCNN) and convolutional neural network (CNN) into modules have shown the value of engineering deep models in this manner, which is standard in traditional SE but foreign for deep learning models. However, prior works focus on the image-based multiclass classification problems and cannot be applied to RNN due to (a) different layer structures, (b) loop structures, (c) different types of input-output architectures, and (d) usage of both nonlinear and logistic activation functions. In this work, we propose the first approach to decompose an RNN into modules. We study different types of RNNs, i.e., Vanilla, LSTM, and GRU. Further, we show how such RNN modules can be reused and replaced in various scenarios. We evaluate our approach against 5 canonical datasets (i.e., Math QA, Brown Corpus, Wiki-toxicity, Clinc OOS, and Tatoeba) and 4 model variants for each dataset. We found that decomposing a trained model has a small cost (Accuracy: -0.6%, BLEU score: +0.10%). Also, the decomposed modules can be reused and replaced without needing to retrain.
    Continual Causal Effect Estimation: Challenges and Opportunities. (arXiv:2301.01026v3 [cs.LG] UPDATED)
    A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.
    Efficient Attention via Control Variates. (arXiv:2302.04542v1 [cs.LG])
    Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon previous progress of RFA, we characterize this gap through the lens of control variates and show that RFA can be decomposed into a sum of multiple control variate estimators for each element in the sequence. This new framework reveals that exact softmax attention can be recovered from RFA by manipulating each control variate. Besides, it allows us to develop a more flexible form of control variates, resulting in a novel attention mechanism that significantly reduces the approximation gap while maintaining linear complexity. Extensive experiments demonstrate that our model outperforms state-of-the-art efficient attention mechanisms on both vision and language tasks.
    An Optimal Algorithm for Strongly Convex Min-min Optimization. (arXiv:2212.14439v2 [math.OC] UPDATED)
    In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{\kappa_x,\kappa_y\}} \log 1/\epsilon)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $\kappa_x$ and $\kappa_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires $\mathcal{O}(\sqrt{\kappa_x} \log 1/\epsilon)$ of computations of $\nabla_x f(x,y)$ and $\mathcal{O}(\sqrt{\kappa_y} \log 1/\epsilon)$ computations of $\nabla_y f(x,y)$. In some applications $\kappa_x \gg \kappa_y$, and computation of $\nabla_y f(x,y)$ is significantly cheaper than computation of $\nabla_x f(x,y)$. In this case, our algorithm substantially outperforms the existing state-of-the-art methods.
    Inferring halo masses with Graph Neural Networks. (arXiv:2111.08683v2 [astro-ph.CO] UPDATED)
    Understanding the halo-galaxy connection is fundamental in order to improve our knowledge on the nature and properties of dark matter. In this work we build a model that infers the mass of a halo given the positions, velocities, stellar masses, and radii of the galaxies it hosts. In order to capture information from correlations among galaxy properties and their phase-space, we use Graph Neural Networks (GNNs), that are designed to work with irregular and sparse data. We train our models on galaxies from more than 2,000 state-of-the-art simulations from the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project. Our model, that accounts for cosmological and astrophysical uncertainties, is able to constrain the masses of the halos with a $\sim$0.2 dex accuracy. Furthermore, a GNN trained on a suite of simulations is able to preserve part of its accuracy when tested on simulations run with a different code that utilizes a distinct subgrid physics model, showing the robustness of our method. The PyTorch Geometric implementation of the GNN is publicly available on Github at https://github.com/PabloVD/HaloGraphNet
    All the Feels: A dexterous hand with large area sensing. (arXiv:2210.15658v2 [cs.RO] UPDATED)
    High cost and lack of reliability has precluded the widespread adoption of dexterous hands in robotics. Furthermore, the lack of a viable tactile sensor capable of sensing over the entire area of the hand impedes the rich, low-level feedback that would improve learning of dexterous manipulation skills. This paper introduces an inexpensive, modular, robust, and scalable platform -- the DManus -- aimed at resolving these challenges while satisfying the large-scale data collection capabilities demanded by deep robot learning paradigms. Studies on human manipulation point to the criticality of low-level tactile feedback in performing everyday dexterous tasks. The DManus comes with ReSkin sensing on the entire surface of the palm as well as the fingertips. We demonstrate effectiveness of the fully integrated system in a tactile aware task -- bin picking and sorting. Code, documentation, design files, detailed assembly instructions, trained models, task videos, and all supplementary materials required to recreate the setup can be found on this http URL
    Explainable AI for Bioinformatics: Methods, Tools, and Applications. (arXiv:2212.13261v2 [q-bio.QM] UPDATED)
    Artificial intelligence (AI) systems utilizing deep neural networks (DNNs) and machine learning (ML) algorithms are widely used for solving important problems in bioinformatics, biomedical informatics, and precision medicine. However, complex DNNs or ML models, which are often perceived as opaque and black-box, can make it difficult to understand the reasoning behind their decisions. This lack of transparency can be a challenge for both end-users and decision-makers, as well as AI developers. Additionally, in sensitive areas like healthcare, explainability and accountability are not only desirable but also legally required for AI systems that can have a significant impact on human lives. Fairness is another growing concern, as algorithmic decisions should not show bias or discrimination towards certain groups or individuals based on sensitive attributes. Explainable artificial intelligence (XAI) aims to overcome the opaqueness of black-box models and provide transparency in how AI systems make decisions. Interpretable ML models can explain how they make predictions and the factors that influence their outcomes. However, most state-of-the-art interpretable ML methods are domain-agnostic and evolved from fields like computer vision, automated reasoning, or statistics, making direct application to bioinformatics problems challenging without customization and domain-specific adaptation. In this paper, we discuss the importance of explainability in the context of bioinformatics, provide an overview of model-specific and model-agnostic interpretable ML methods and tools, and outline their potential caveats and drawbacks. Besides, we discuss how to customize existing interpretable ML methods for bioinformatics problems. Nevertheless, we demonstrate how XAI methods can improve transparency through case studies in bioimaging, cancer genomics, and text mining.
    Hierarchical Generative Adversarial Imitation Learning with Mid-level Input Generation for Autonomous Driving on Urban Environments. (arXiv:2302.04823v1 [cs.LG])
    Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on rewards,Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive. In this work, the hGAIL architecture was proposed to solve the autonomous navigation of a vehicle in an end-to-end approach, mapping sensory perceptions directly to low-level actions, while simultaneously learning mid-level input representations of the agent's environment. The proposed hGAIL consists of an hierarchical Adversarial Imitation Learning architecture composed of two main modules: the GAN (Generative Adversarial Nets) which generates the Bird's-Eye View (BEV) representation mainly from the images of three frontal cameras of the vehicle, and the GAIL which learns to control the vehicle based mainly on the BEV predictions from the GAN as input.Our experiments have shown that GAIL exclusively from cameras (without BEV) fails to even learn the task, while hGAIL, after training, was able to autonomously navigate successfully in all intersections of the city.
    Reinforcement Learning Based Approaches to Adaptive Context Caching in Distributed Context Management Systems. (arXiv:2212.11709v2 [eess.SY] UPDATED)
    Performance metrics-driven context caching has a profound impact on throughput and response time in distributed context management systems for real-time context queries. This paper proposes a reinforcement learning based approach to adaptively cache context with the objective of minimizing the cost incurred by context management systems in responding to context queries. Our novel algorithms enable context queries and sub-queries to reuse and repurpose cached context in an efficient manner. This approach is distinctive to traditional data caching approaches by three main features. First, we make selective context cache admissions using no prior knowledge of the context, or the context query load. Secondly, we develop and incorporate innovative heuristic models to calculate expected performance of caching an item when making the decisions. Thirdly, our strategy defines a time-aware continuous cache action space. We present two reinforcement learning agents, a value function estimating actor-critic agent and a policy search agent using deep deterministic policy gradient method. The paper also proposes adaptive policies such as eviction and cache memory scaling to complement our objective. Our method is evaluated using a synthetically generated load of context sub-queries and a synthetic data set inspired from real world data and query samples. We further investigate optimal adaptive caching configurations under different settings. This paper presents, compares, and discusses our findings that the proposed selective caching methods reach short- and long-term cost- and performance-efficiency. The paper demonstrates that the proposed methods outperform other modes of context management such as redirector mode, and database mode, and cache all policy by up to 60% in cost efficiency.
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v2 [cs.LG] UPDATED)
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs. The code is available at: https://github.com/dchiji-ntt/meta-ticket
    Testing robustness of predictions of trained classifiers against naturally occurring perturbations. (arXiv:2204.10046v2 [cs.LG] UPDATED)
    Correctly quantifying the robustness of machine learning models is a central aspect in judging their suitability for specific tasks, and ultimately, for generating trust in them. We address the problem of finding the robustness of individual predictions. We show both theoretically and with empirical examples that a method based on counterfactuals that was previously proposed for this is insufficient, as it is not a valid metric for determining the robustness against perturbations that occur ``naturally'', outside specific adversarial attack scenarios. We propose a flexible approach that models possible perturbations in input data individually for each application. This is then combined with a probabilistic approach that computes the likelihood that a ``real-world'' perturbation will change a prediction, thus giving quantitative information of the robustness of individual predictions of the trained machine learning model. The method does not require access to the internals of the classifier and thus in principle works for any black-box model. It is, however, based on Monte-Carlo sampling and thus only suited for input spaces with small dimensions. We illustrate our approach on the Iris and the Ionosphere datasets, on an application predicting fog at an airport, and on analytically solvable cases.
    Modeling and Forecasting COVID-19 Cases using Latent Subpopulations. (arXiv:2302.04829v1 [cs.LG])
    Classical epidemiological models assume homogeneous populations. There have been important extensions to model heterogeneous populations, when the identity of the sub-populations is known, such as age group or geographical location. Here, we propose two new methods to model the number of people infected with COVID-19 over time, each as a linear combination of latent sub-populations -- i.e., when we do not know which person is in which sub-population, and the only available observations are the aggregates across all sub-populations. Method #1 is a dictionary-based approach, which begins with a large number of pre-defined sub-population models (each with its own starting time, shape, etc), then determines the (positive) weight of small (learned) number of sub-populations. Method #2 is a mixture-of-$M$ fittable curves, where $M$, the number of sub-populations to use, is given by the user. Both methods are compatible with any parametric model; here we demonstrate their use with first (a)~Gaussian curves and then (b)~SIR trajectories. We empirically show the performance of the proposed methods, first in (i) modeling the observed data and then in (ii) forecasting the number of infected people 1 to 4 weeks in advance. Across 187 countries, we show that the dictionary approach had the lowest mean absolute percentage error and also the lowest variance when compared with classical SIR models and moreover, it was a strong baseline that outperforms many of the models developed for COVID-19 forecasting.
    Compressing multidimensional weather and climate data into neural networks. (arXiv:2210.12538v2 [cs.LG] UPDATED)
    Weather and climate simulations produce petabytes of high-resolution data that are later analyzed by researchers in order to understand climate change or severe weather. We propose a new method of compressing this multidimensional weather and climate data: a coordinate-based neural network is trained to overfit the data, and the resulting parameters are taken as a compact representation of the original grid-based data. While compression ratios range from 300x to more than 3,000x, our method outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE, MAE. It can faithfully preserve important large scale atmosphere structures and does not introduce artifacts. When using the resulting neural network as a 790x compressed dataloader to train the WeatherBench forecasting model, its RMSE increases by less than 2%. The three orders of magnitude compression democratizes access to high-resolution climate data and enables numerous new research directions.
    Reconstruction of univariate functions from directional persistence diagrams. (arXiv:2203.01894v2 [math.AT] UPDATED)
    We describe a method for approximating a single-variable function $f$ using persistence diagrams of sublevel sets of $f$ from height functions in different directions. We provide algorithms for the piecewise linear case and for the smooth case. Three directions suffice to locate all local maxima and minima of a piecewise linear continuous function from its collection of directional persistence diagrams, while five directions are needed in the case of smooth functions with non-degenerate critical points. Our approximation of functions by means of persistence diagrams is motivated by a study of importance attribution in machine learning, where one seeks to reduce the number of critical points of signal functions without a significant loss of information for a neural network classifier.
    Offline Learning of Closed-Loop Deep Brain Stimulation Controllers for Parkinson Disease Treatment. (arXiv:2302.02477v2 [cs.LG] UPDATED)
    Deep brain stimulation (DBS) has shown great promise toward treating motor symptoms caused by Parkinson's disease (PD), by delivering electrical pulses to the Basal Ganglia (BG) region of the brain. However, DBS devices approved by the U.S. Food and Drug Administration (FDA) can only deliver continuous DBS (cDBS) stimuli at a fixed amplitude; this energy inefficient operation reduces battery lifetime of the device, cannot adapt treatment dynamically for activity, and may cause significant side-effects (e.g., gait impairment). In this work, we introduce an offline reinforcement learning (RL) framework, allowing the use of past clinical data to train an RL policy to adjust the stimulation amplitude in real time, with the goal of reducing energy use while maintaining the same level of treatment (i.e., control) efficacy as cDBS. Moreover, clinical protocols require the safety and performance of such RL controllers to be demonstrated ahead of deployments in patients. Thus, we also introduce an offline policy evaluation (OPE) method to estimate the performance of RL policies using historical data, before deploying them on patients. We evaluated our framework on four PD patients equipped with the RC+S DBS system, employing the RL controllers during monthly clinical visits, with the overall control efficacy evaluated by severity of symptoms (i.e., bradykinesia and tremor), changes in PD biomakers (i.e., local field potentials), and patient ratings. The results from clinical experiments show that our RL-based controller maintains the same level of control efficacy as cDBS, but with significantly reduced stimulation energy. Further, the OPE method is shown effective in accurately estimating and ranking the expected returns of RL controllers.
    MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing. (arXiv:2302.04501v1 [cs.LG])
    Multivariate time series forecasting has been widely used in various practical scenarios. Recently, Transformer-based models have shown significant potential in forecasting tasks due to the capture of long-range dependencies. However, recent studies in the vision and NLP fields show that the role of attention modules is not clear, which can be replaced by other token aggregation operations. This paper investigates the contributions and deficiencies of attention mechanisms on the performance of time series forecasting. Specifically, we find that (1) attention is not necessary for capturing temporal dependencies, (2) the entanglement and redundancy in the capture of temporal and channel interaction affect the forecasting performance, and (3) it is important to model the mapping between the input and the prediction sequence. To this end, we propose MTS-Mixers, which use two factorized modules to capture temporal and channel dependencies. Experimental results on several real-world datasets show that MTS-Mixers outperform existing Transformer-based models with higher efficiency.
    On the Permanence of Backdoors in Evolving Models. (arXiv:2206.04677v2 [cs.CR] UPDATED)
    Existing research on training-time attacks for deep neural networks (DNNs), such as backdoors, largely assume that models are static once trained, and hidden backdoors trained into models remain active indefinitely. In practice, models are rarely static but evolve continuously to address distribution drifts in the underlying data. This paper explores the behavior of backdoor attacks in time-varying models, whose model weights are continually updated via fine-tuning to adapt to data drifts. Our theoretical analysis shows how fine-tuning with fresh data progressively "erases" the injected backdoors, and our empirical study illustrates how quickly a time-varying model "forgets" backdoors under a variety of training and attack settings. We also show that novel fine-tuning strategies using smart learning rates can significantly accelerate backdoor forgetting. Finally, we discuss the need for new backdoor defenses that target time-varying models specifically.
    Feature Likelihood Score: Evaluating Generalization of Generative Models Using Samples. (arXiv:2302.04440v1 [cs.LG])
    Deep generative models have demonstrated the ability to generate complex, high-dimensional, and photo-realistic data. However, a unified framework for evaluating different generative modeling families remains a challenge. Indeed, likelihood-based metrics do not apply in many cases while pure sample-based metrics such as FID fail to capture known failure modes such as overfitting on training data. In this work, we introduce the Feature Likelihood Score (FLS), a parametric sample-based score that uses density estimation to quantitatively measure the quality/diversity of generated samples while taking into account overfitting. We empirically demonstrate the ability of FLS to identify specific overfitting problem cases, even when previously proposed metrics fail. We further perform an extensive experimental evaluation on various image datasets and model classes. Our results indicate that FLS matches intuitions of previous metrics, such as FID, while providing a more holistic evaluation of generative models that highlights models whose generalization abilities are under or overappreciated. Code for computing FLS is provided at https://github.com/marcojira/fls
    Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability. (arXiv:2204.06425v2 [cs.SE] UPDATED)
    The documentation practice for machine-learned (ML) models often falls short of established practices for traditional software, which impedes model accountability and inadvertently abets inappropriate or misuse of models. Recently, model cards, a proposal for model documentation, have attracted notable attention, but their impact on the actual practice is unclear. In this work, we systematically study the model documentation in the field and investigate how to encourage more responsible and accountable documentation practice. Our analysis of publicly available model cards reveals a substantial gap between the proposal and the practice. We then design a tool named DocML aiming to (1) nudge the data scientists to comply with the model cards proposal during the model development, especially the sections related to ethics, and (2) assess and manage the documentation quality. A lab study reveals the benefit of our tool towards long-term documentation quality and accountability.
    Conformal Off-policy Prediction. (arXiv:2206.06711v2 [stat.ML] UPDATED)
    Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.
    Decision Trees with Dynamic Graph Features. (arXiv:2207.02760v3 [cs.LG] UPDATED)
    When dealing with tabular data, models based on decision trees are a popular choice due to their high accuracy on these data types, their ease of application, and explainability properties. However, when it comes to graph-structured data, it is not clear how to apply them effectively, in a way that incorporates the topological information with the tabular data available on the vertices of the graph. To address this challenge, we introduce Decision Trees with Dynamic Graph Features (TREE-G). Rather than only using the pre-defined given features in the data, TREE-G acts on dynamic features, which are computed as the graph traverses the tree. These dynamic features combine the vertex features with the topological information, as well as the cumulative information learned by the tree. Therefore, the features adapt to the predictive task and the graph in hand. We analyze the theoretical properties of TREE-G and demonstrate its benefits empirically on multiple graph and node prediction benchmarks. In these experiments,TREE-G consistently outperformed other tree-based models and often outperformed other graph-learning algorithms such as Graph Neural Networks (GNNs) and Graph Kernels, sometimes by large margins. Finally, we also provide an explainability mechanism for TREE-G, and demonstrate that it can provide informative and intuitive explanations.
    The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning. (arXiv:2302.04658v1 [stat.ML])
    Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $\nu$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tilde\Theta(\frac{D}{f'(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling.
    Global and Preference-based Optimization with Mixed Variables using Piecewise Affine Surrogates. (arXiv:2302.04686v1 [math.OC])
    Optimization problems involving mixed variables, i.e., variables of numerical and categorical nature, can be challenging to solve, especially in the presence of complex constraints. Moreover, when the objective function is the result of a simulation or experiment, it may be expensive to evaluate. In this paper, we propose a novel surrogate-based global optimization algorithm, called PWAS, based on constructing a piecewise affine surrogate of the objective function over feasible samples. We introduce two types of exploration functions to efficiently search the feasible domain via mixed integer linear programming (MILP) solvers. We also provide a preference-based version of the algorithm, called PWASp, which can be used when only pairwise comparisons between samples can be acquired while the objective function remains unquantified. PWAS and PWASp are tested on mixed-variable benchmark problems with and without constraints. The results show that, within a small number of acquisitions, PWAS and PWASp can often achieve better or comparable results than other existing methods.
    Instrumental Variable Regression via Kernel Maximum Moment Loss. (arXiv:2010.07684v4 [cs.LG] UPDATED)
    We investigate a simple objective for nonlinear instrumental variable (IV) regression based on a kernelized conditional moment restriction (CMR) known as a maximum moment restriction (MMR). The MMR objective is formulated by maximizing the interaction between the residual and the instruments belonging to a unit ball in a reproducing kernel Hilbert space (RKHS). First, it allows us to simplify the IV regression as an empirical risk minimization problem, where the risk functional depends on the reproducing kernel on the instrument and can be estimated by a U-statistic or V-statistic. Second, based on this simplification, we are able to provide the consistency and asymptotic normality results in both parametric and nonparametric settings. Lastly, we provide easy-to-use IV regression algorithms with an efficient hyper-parameter selection procedure. We demonstrate the effectiveness of our algorithms using experiments on both synthetic and real-world data.
    Zero-Cost Operation Scoring in Differentiable Architecture Search. (arXiv:2106.06799v2 [cs.LG] UPDATED)
    We formalize and analyze a fundamental component of differentiable neural architecture search (NAS): local "operation scoring" at each operation choice. We view existing operation scoring functions as inexact proxies for accuracy, and we find that they perform poorly when analyzed empirically on NAS benchmarks. From this perspective, we introduce a novel \textit{perturbation-based zero-cost operation scoring} (Zero-Cost-PT) approach, which utilizes zero-cost proxies that were recently studied in multi-trial NAS but degrade significantly on larger search spaces, typical for differentiable NAS. We conduct a thorough empirical evaluation on a number of NAS benchmarks and large search spaces, from NAS-Bench-201, NAS-Bench-1Shot1, NAS-Bench-Macro, to DARTS-like and MobileNet-like spaces, showing significant improvements in both search time and accuracy. On the ImageNet classification task on the DARTS search space, our approach improved accuracy compared to the best current training-free methods (TE-NAS) while being over 10$\times$ faster (total searching time 25 minutes on a single GPU), and observed significantly better transferability on architectures searched on the CIFAR-10 dataset with an accuracy increase of 1.8 pp. Our code is available at: https://github.com/zerocostptnas/zerocost_operation_score.
    Scale-aware neural calibration for wide swath altimetry observations. (arXiv:2302.04497v1 [cs.LG])
    Sea surface height (SSH) is a key geophysical parameter for monitoring and studying meso-scale surface ocean dynamics. For several decades, the mapping of SSH products at regional and global scales has relied on nadir satellite altimeters, which provide one-dimensional-only along-track satellite observations of the SSH. The Surface Water and Ocean Topography (SWOT) mission deploys a new sensor that acquires for the first time wide-swath two-dimensional observations of the SSH. This provides new means to observe the ocean at previously unresolved spatial scales. A critical challenge for the exploiting of SWOT data is the separation of the SSH from other signals present in the observations. In this paper, we propose a novel learning-based approach for this SWOT calibration problem. It benefits from calibrated nadir altimetry products and a scale-space decomposition adapted to SWOT swath geometry and the structure of the different processes in play. In a supervised setting, our method reaches the state-of-the-art residual error of ~1.4cm while proposing a correction on the entire spectral from 10km to 1000k
    Classification of BCI-EEG based on augmented covariance matrix. (arXiv:2302.04508v1 [cs.HC])
    Objective: Electroencephalography signals are recorded as a multidimensional dataset. We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification. Methods: From the autoregressive model can be derived the Yule-Walker equations, which show the emergence of a symmetric positive definite matrix: the augmented covariance matrix. The state-of the art for classifying covariance matrices is based on Riemannian Geometry. A fairly natural idea is therefore to extend the standard approach using these augmented covariance matrices. The methodology for creating the augmented covariance matrix shows a natural connection with the delay embedding theorem proposed by Takens for dynamical systems. Such an embedding method is based on the knowledge of two parameters: the delay and the embedding dimension, respectively related to the lag and the order of the autoregressive model. This approach provides new methods to compute the hyper-parameters in addition to standard grid search. Results: The augmented covariance matrix performed noticeably better than any state-of-the-art methods. We will test our approach on several datasets and several subjects using the MOABB framework, using both within-session and cross-session evaluation. Conclusion: The improvement in results is due to the fact that the augmented covariance matrix incorporates not only spatial but also temporal information, incorporating nonlinear components of the signal through an embedding procedure, which allows the leveraging of dynamical systems algorithms. Significance: These results extend the concepts and the results of the Riemannian distance based classification algorithm.
    A Benchmark on Uncertainty Quantification for Deep Learning Prognostics. (arXiv:2302.04730v1 [cs.LG])
    Reliable uncertainty quantification on RUL prediction is crucial for informative decision-making in predictive maintenance. In this context, we assess some of the latest developments in the field of uncertainty quantification for prognostics deep learning. This includes the state-of-the-art variational inference algorithms for Bayesian neural networks (BNN) as well as popular alternatives such as Monte Carlo Dropout (MCD), deep ensembles (DE) and heteroscedastic neural networks (HNN). All the inference techniques share the same inception deep learning architecture as a functional model. We performed hyperparameter search to optimize the main variational and learning parameters of the algorithms. The performance of the methods is evaluated on a subset of the large NASA NCMAPSS dataset for aircraft engines. The assessment includes RUL prediction accuracy, the quality of predictive uncertainty, and the possibility to break down the total predictive uncertainty into its aleatoric and epistemic parts. The results show no method clearly outperforms the others in all the situations. Although all methods are close in terms of accuracy, we find differences in the way they estimate uncertainty. Thus, DE and MCD generally provide more conservative predictive uncertainty than BNN. Surprisingly, HNN can achieve strong results without the added training complexity and extra parameters of the BNN. For tasks like active learning where a separation of epistemic and aleatoric uncertainty is required, radial BNN and MCD seem the best options.
    Distributed Learning with Curious and Adversarial Machines. (arXiv:2302.04787v1 [cs.LG])
    The ubiquity of distributed machine learning (ML) in sensitive public domain applications calls for algorithms that protect data privacy, while being robust to faults and adversarial behaviors. Although privacy and robustness have been extensively studied independently in distributed ML, their synthesis remains poorly understood. We present the first tight analysis of the error incurred by any algorithm ensuring robustness against a fraction of adversarial machines, as well as differential privacy (DP) for honest machines' data against any other curious entity. Our analysis exhibits a fundamental trade-off between privacy, robustness, and utility. Surprisingly, we show that the cost of this trade-off is marginal compared to that of the classical privacy-utility trade-off. To prove our lower bound, we consider the case of mean estimation, subject to distributed DP and robustness constraints, and devise reductions to centralized estimation of one-way marginals. We prove our matching upper bound by presenting a new distributed ML algorithm using a high-dimensional robust aggregation rule. The latter amortizes the dependence on the dimension in the error (caused by adversarial workers and DP), while being agnostic to the statistical properties of the data.
    Lorentz Equivariant Model for Knowledge-Enhanced Collaborative Filtering. (arXiv:2302.04545v1 [cs.IR])
    Introducing prior auxiliary information from the knowledge graph (KG) to assist the user-item graph can improve the comprehensive performance of the recommender system. Many recent studies show that the ensemble properties of hyperbolic spaces fit the scale-free and hierarchical characteristics exhibited in the above two types of graphs well. However, existing hyperbolic methods ignore the consideration of equivariance, thus they cannot generalize symmetric features under given transformations, which seriously limits the capability of the model. Moreover, they cannot balance preserving the heterogeneity and mining the high-order entity information to users across two graphs. To fill these gaps, we propose a rigorously Lorentz group equivariant knowledge-enhanced collaborative filtering model (LECF). Innovatively, we jointly update the attribute embeddings (containing the high-order entity signals from the KG) and hyperbolic embeddings (the distance between hyperbolic embeddings reveals the recommendation tendency) by the LECF layer with Lorentz Equivariant Transformation. Moreover, we propose Hyperbolic Sparse Attention Mechanism to sample the most informative neighbor nodes. Lorentz equivariance is strictly maintained throughout the entire model, and enforcing equivariance is proven necessary experimentally. Extensive experiments on three real-world benchmarks demonstrate that LECF remarkably outperforms state-of-the-art methods.
    PulseDL-II: A System-on-Chip Neural Network Accelerator for Timing and Energy Extraction of Nuclear Detector Signals. (arXiv:2209.00884v2 [physics.ins-det] UPDATED)
    Front-end electronics equipped with high-speed digitizers are being used and proposed for future nuclear detectors. Recent literature reveals that deep learning models, especially one-dimensional convolutional neural networks, are promising when dealing with digital signals from nuclear detectors. Simulations and experiments demonstrate the satisfactory accuracy and additional benefits of neural networks in this area. However, specific hardware accelerating such models for online operations still needs to be studied. In this work, we introduce PulseDL-II, a system-on-chip (SoC) specially designed for applications of event feature (time, energy, etc.) extraction from pulses with deep learning. Based on the previous version, PulseDL-II incorporates a RISC CPU into the system structure for better functional flexibility and integrity. The neural network accelerator in the SoC adopts a three-level (arithmetic unit, processing element, neural network) hierarchical architecture and facilitates parameter optimization of the digital design. Furthermore, we devise a quantization scheme compatible with deep learning frameworks (e.g., TensorFlow) within a selected subset of layer types. We validate the correct operations of PulseDL-II on field programmable gate arrays (FPGA) alone and with an experimental setup comprising a direct digital synthesis (DDS) and analog-to-digital converters (ADC). The proposed system achieved 60 ps time resolution and 0.40% energy resolution at signal to noise ratio (SNR) of 47.4 dB.
    Jensen-Shannon Divergence Based Novel Loss Functions for Bayesian Neural Networks. (arXiv:2209.11366v3 [cs.LG] UPDATED)
    The Kullback-Leibler (KL) divergence is widely used in state-of-the-art Bayesian Neural Networks (BNNs) to approximate the posterior distribution of weights. However, the KL divergence is unbounded and asymmetric, which may lead to instabilities during optimization or may yield poor generalizations. To overcome these limitations, we examine the Jensen-Shannon (JS) divergence that is bounded, symmetric, and more general. Towards this, we propose two novel loss functions for BNNs. The first loss function uses the geometric JS divergence (JS-G) that is symmetric, unbounded, and offers an analytical expression for Gaussian priors. The second loss function uses the generalized JS divergence (JS-A) that is symmetric and bounded. We show that the conventional KL divergence-based loss function is a special case of the two loss functions presented in this work. To evaluate the divergence part of the loss we use analytical expressions for JS-G and use Monte Carlo methods for JS-A. We provide algorithms to optimize the loss function using both these methods. The proposed loss functions offer additional parameters that can be tuned to control the degree of regularisation. The regularization performance of the JS divergences is analyzed to demonstrate their superiority over the state-of-the-art. Further, we derive the conditions for better regularization by the proposed JS-G divergence-based loss function than the KL divergence-based loss function. Bayesian convolutional neural networks (BCNN) based on the proposed JS divergences perform better than the state-of-the-art BCNN, which is shown for the classification of the CIFAR data set having various degrees of noise and a histopathology data set having a high bias.
    Sparse Random Networks for Communication-Efficient Federated Learning. (arXiv:2209.15328v2 [cs.LG] UPDATED)
    One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \emph{random} values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a \emph{stochastic} binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -- or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than $1$ bit per parameter (bpp)), convergence speed, and final model size (less than $1$ bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime under various system configurations.
    Contestable Camera Cars: A Speculative Design Exploration of Public AI That Is Open and Responsive to Dispute. (arXiv:2302.04603v1 [cs.HC])
    Local governments increasingly use artificial intelligence (AI) for automated decision-making. Contestability, making systems responsive to dispute, is a way to ensure they respect human rights to autonomy and dignity. We investigate the design of public urban AI systems for contestability through the example of camera cars: human-driven vehicles equipped with image sensors. Applying a provisional framework for contestable AI, we use speculative design to create a concept video of a contestable camera car. Using this concept video, we then conduct semi-structured interviews with 17 civil servants who work with AI employed by a large northwestern European city. The resulting data is analyzed using reflexive thematic analysis to identify the main challenges facing the implementation of contestability in public AI. We describe how civic participation faces issues of representation, public AI systems should integrate with existing democratic practices, and cities must expand capacities for responsible AI development and operation.
    Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation. (arXiv:2302.04841v1 [cs.CV])
    Text-to-image generation models represent the next step of evolution in image synthesis, offering natural means of flexible yet fine-grained control over the result. One emerging area of research is the rapid adaptation of large text-to-image models to smaller datasets or new visual concepts. However, the most efficient method of adaptation, called textual inversion, has a known limitation of long training time, which both restricts practical applications and increases the experiment time for research. In this work, we study the training dynamics of textual inversion, aiming to speed it up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard model convergence metrics fail to indicate that. Instead, we propose a simple early stopping criterion that only requires computing the textual inversion loss on the same inputs for all training iterations. Our experiments on both Latent Diffusion and Stable Diffusion models for 93 concepts demonstrate the competitive performance of our method, speeding adaptation up to 15 times with no significant drops in quality.
    UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models. (arXiv:2302.04867v1 [cs.LG])
    Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM usually requires hundreds of model evaluations, which is computationally expensive. Despite recent progress in designing high-order solvers for DPMs, there still exists room for further speedup, especially in extremely few steps (e.g., 5~10 steps). Inspired by the predictor-corrector for ODE solvers, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC
    Anomal-E: A Self-Supervised Network Intrusion Detection System based on Graph Neural Networks. (arXiv:2207.06819v5 [cs.LG] UPDATED)
    This paper investigates Graph Neural Networks (GNNs) application for self-supervised network intrusion and anomaly detection. GNNs are a deep learning approach for graph-based data that incorporate graph structures into learning to generalise graph representations and output embeddings. As network flows are naturally graph-based, GNNs are a suitable fit for analysing and learning network behaviour. The majority of current implementations of GNN-based Network Intrusion Detection Systems (NIDSs) rely heavily on labelled network traffic which can not only restrict the amount and structure of input traffic, but also the NIDSs potential to adapt to unseen attacks. To overcome these restrictions, we present Anomal-E, a GNN approach to intrusion and anomaly detection that leverages edge features and graph topological structure in a self-supervised process. This approach is, to the best our knowledge, the first successful and practical approach to network intrusion detection that utilises network flows in a self-supervised, edge leveraging GNN. Experimental results on two modern benchmark NIDS datasets not only clearly display the improvement of using Anomal-E embeddings rather than raw features, but also the potential Anomal-E has for detection on wild network traffic.
    Equivariant MuZero. (arXiv:2302.04798v1 [cs.LG])
    Deep reinforcement learning repeatedly succeeds in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms, such as the highly successful MuZero, aim to accomplish this by learning a world model. However, leveraging a world model has not consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.
    $O(T^{-1})$ Convergence of Optimistic-Follow-the-Regularized-Leader in Two-Player Zero-Sum Markov Games. (arXiv:2209.12430v2 [cs.LG] UPDATED)
    We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an $O(T^{-1})$-approximate Nash equilibrium in $T$ iterations for two-player zero-sum Markov games with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence rate recently shown in the paper Zhang et al (2022). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra $\log T$ factor. This crucial improvement enables the inductive analysis that leads to the final $O(T^{-1})$ rate.
    An Information-Theoretic Analysis of Nonstationary Bandit Learning. (arXiv:2302.04452v1 [cs.LG])
    In nonstationary bandit learning problems, the decision-maker must continually gather information and adapt their action selection as the latent state of the environment evolves. In each time period, some latent optimal action maximizes expected reward under the environment state. We view the optimal action sequence as a stochastic process, and take an information-theoretic approach to analyze attainable performance. We bound limiting per-period regret in terms of the entropy rate of the optimal action process. The bound applies to a wide array of problems studied in the literature and reflects the problem's information structure through its information-ratio.
    Confident Sinkhorn Allocation for Pseudo-Labeling. (arXiv:2206.05880v3 [cs.LG] UPDATED)
    Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structured data, such as images and natural language, by exploiting the inherent spatial and semantic structure therein with retrained models or data augmentation. These methods are not applicable, however, when the data does not have the appropriate structure, or invariances. Due to their simplicity, pseudo-labeling (PL) methods can be widely used without any domain assumptions. These methods make greedy pseudo-label assignments, however, meaning that a single misclassification can have cascading consequences. This paper addresses this problem by proposing a Confident Sinkhorn Allocation (CSA), which identifies the best pseudo-label allocation via optimal transport. In doing so it considers the uncertainty in the predicted labels for the whole unlabelled set, in contrast to greedy allocation approaches. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning. Additionally, we propose to use the Integral Probability Metrics to extend and improve the existing PAC-Bayes bound which relies on the Kullback-Leibler (KL) divergence, for ensemble models. Our code is publicly available at https://github.com/amzn/confident-sinkhorn-allocation.
    Fair-Net: A Network Architecture For Reducing Performance Disparity Between Identifiable Sub-Populations. (arXiv:2106.00720v3 [cs.LG] UPDATED)
    In real world datasets, particular groups are under-represented, much rarer than others, and machine learning classifiers will often preform worse on under-represented populations. This problem is aggravated across many domains where datasets are class imbalanced, with a minority class far rarer than the majority class. Naive approaches to handle under-representation and class imbalance include training sub-population specific classifiers that handle class imbalance or training a global classifier that overlooks sub-population disparities and aims to achieve high overall accuracy by handling class imbalance. In this study, we find that these approaches are vulnerable in class imbalanced datasets with minority sub-populations. We introduced Fair-Net, a branched multitask neural network architecture that improves both classification accuracy and probability calibration across identifiable sub-populations in class imbalanced datasets. Fair-Nets is a straightforward extension to the output layer and error function of a network, so can be incorporated in far more complex architectures. Empirical studies with three real world benchmark datasets demonstrate that Fair-Net improves classification and calibration performance, substantially reducing performance disparity between gender and racial sub-populations.
    How to avoid machine learning pitfalls: a guide for academic researchers. (arXiv:2108.02497v3 [cs.LG] UPDATED)
    This document is a concise outline of some of the common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it was originally written for research students, and focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.
    Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes. (arXiv:2302.04534v1 [cs.LG])
    Autoencoders and their variants are among the most widely used models in representation learning and generative modeling. However, autoencoder-based models usually assume that the learned representations are i.i.d. and fail to capture the correlations between the data samples. To address this issue, we propose a novel Sparse Gaussian Process Bayesian Autoencoder (SGPBAE) model in which we impose fully Bayesian sparse Gaussian Process priors on the latent space of a Bayesian Autoencoder. We perform posterior estimation for this model via stochastic gradient Hamiltonian Monte Carlo. We evaluate our approach qualitatively and quantitatively on a wide range of representation learning and generative modeling tasks and show that our approach consistently outperforms multiple alternatives relying on Variational Autoencoders.
    DPSNN: A Differentially Private Spiking Neural Network. (arXiv:2205.12718v2 [cs.NE] UPDATED)
    Privacy-preserving is a key problem for the machine learning algorithm. Spiking neural network (SNN) plays an important role in many domains, such as image classification, object detection, and speech recognition, but the study on the privacy protection of SNN is urgently needed. This study combines the differential privacy (DP) algorithm and SNN and proposes differentially private spiking neural network (DPSNN). DP injects noise into the gradient, and SNN transmits information in discrete spike trains so that our differentially private SNN can maintain strong privacy protection while still ensuring high accuracy. We conducted experiments on MNIST, Fashion-MNIST, and the face recognition dataset Extended YaleB. When the privacy protection is improved, the accuracy of the artificial neural network(ANN) drops significantly, but our algorithm shows little change in performance. Meanwhile, we analyzed different factors that affect the privacy protection of SNN. Firstly, the less precise the surrogate gradient is, the better the privacy protection of the SNN. Secondly, the Integrate-And-Fire (IF) neurons perform better than leaky Integrate-And-Fire (LIF) neurons. Thirdly, a large time window contributes more to privacy protection and performance.
    Zero-Knowledge Zero-Shot Learning for Novel Visual Category Discovery. (arXiv:2302.04427v1 [cs.CV])
    Generalized Zero-Shot Learning (GZSL) and Open-Set Recognition (OSR) are two mainstream settings that greatly extend conventional visual object recognition. However, the limitations of their problem settings are not negligible. The novel categories in GZSL require pre-defined semantic labels, making the problem setting less realistic; the oversimplified unknown class in OSR fails to explore the innate fine-grained and mixed structures of novel categories. In light of this, we are motivated to consider a new problem setting named Zero-Knowledge Zero-Shot Learning (ZK-ZSL) that assumes no prior knowledge of novel classes and aims to classify seen and unseen samples and recover semantic attributes of the fine-grained novel categories for further interpretation. To achieve this, we propose a novel framework that recovers the clustering structures of both seen and unseen categories where the seen class structures are guided by source labels. In addition, a structural alignment loss is designed to aid the semantic learning of unseen categories with their recovered structures. Experimental results demonstrate our method's superior performance in classification and semantic recovery on four benchmark datasets.
    Scaling Goal-based Exploration via Pruning Proto-goals. (arXiv:2302.04693v1 [cs.LG])
    One of the gnarliest challenges in reinforcement learning (RL) is exploration that scales to vast domains, where novelty-, or coverage-seeking behaviour falls short. Goal-directed, purposeful behaviours are able to overcome this, but rely on a good goal space. The core challenge in goal discovery is finding the right balance between generality (not hand-crafted) and tractability (useful, not too many). Our approach explicitly seeks the middle ground, enabling the human designer to specify a vast but meaningful proto-goal space, and an autonomous discovery process to refine this to a narrower space of controllable, reachable, novel, and relevant goals. The effectiveness of goal-conditioned exploration with the latter is then demonstrated in three challenging environments.
    One-shot Visual Imitation via Attributed Waypoints and Demonstration Augmentation. (arXiv:2302.04856v1 [cs.RO])
    In this paper, we analyze the behavior of existing techniques and design new solutions for the problem of one-shot visual imitation. In this setting, an agent must solve a novel instance of a novel task given just a single visual demonstration. Our analysis reveals that current methods fall short because of three errors: the DAgger problem arising from purely offline training, last centimeter errors in interacting with objects, and mis-fitting to the task context rather than to the actual task. This motivates the design of our modular approach where we a) separate out task inference (what to do) from task execution (how to do it), and b) develop data augmentation and generation techniques to mitigate mis-fitting. The former allows us to leverage hand-crafted motor primitives for task execution which side-steps the DAgger problem and last centimeter errors, while the latter gets the model to focus on the task rather than the task context. Our model gets 100% and 48% success rates on two recent benchmarks, improving upon the current state-of-the-art by absolute 90% and 20% respectively.
    Cooperative Open-ended Learning Framework for Zero-shot Coordination. (arXiv:2302.04831v1 [cs.AI])
    Zero-shot coordination in cooperative artificial intelligence (AI) remains a significant challenge, which means effectively coordinating with a wide range of unseen partners. Previous algorithms have attempted to address this challenge by optimizing fixed objectives within a population to improve strategy or behavior diversity. However, these approaches can result in a loss of learning and an inability to cooperate with certain strategies within the population, known as cooperative incompatibility. To address this issue, we propose the Cooperative Open-ended LEarning (COLE) framework, which constructs open-ended objectives in cooperative games with two players from the perspective of graph theory to assess and identify the cooperative ability of each strategy. We further specify the framework and propose a practical algorithm that leverages knowledge from game theory and graph theory. Furthermore, an analysis of the learning process of the algorithm shows that it can efficiently overcome cooperative incompatibility. The experimental results in the Overcooked game environment demonstrate that our method outperforms current state-of-the-art methods when coordinating with different-level partners. Our code and demo are available at https://sites.google.com/view/cole-2023.
    Towards Ignoring Backgrounds and Improving Generalization: a Costless DNN Visual Attention Mechanism. (arXiv:2202.00232v5 [eess.IV] UPDATED)
    This work introduces an attention mechanism for image classifiers and the corresponding deep neural network (DNN) architecture, dubbed ISNet. During training, the ISNet uses segmentation targets to learn how to find the image's region of interest and concentrate its attention on it. The proposal is based on a novel concept, background relevance minimization in LRP explanation heatmaps. It can be applied to virtually any classification neural network architecture, without any extra computational cost at run-time. Capable of ignoring the background, the resulting single DNN can substitute the common pipeline of a segmenter followed by a classifier, being faster and lighter. We tested the ISNet with three applications: COVID-19 and tuberculosis detection in chest X-rays, and facial attribute estimation. The first two tasks employed mixed training databases, which fostered background bias and shortcut learning. By focusing on lungs, the ISNet reduced shortcut learning, improving generalization to external (out-of-distribution) test datasets. When training data presented background bias, the ISNet's test performance significantly surpassed standard classifiers, multi-task DNNs (performing classification and segmentation), attention-gated neural networks, Guided Attention Inference Networks, and the standard segmentation-classification pipeline. Facial attribute estimation demonstrated that ISNet could precisely focus on faces, being also applicable to natural images. ISNet presents an accurate, fast, and light methodology to ignore backgrounds and improve generalization, especially when background bias is a concern.
    Compositional Scene Representation Learning via Reconstruction: A Survey. (arXiv:2202.07135v3 [cs.LG] UPDATED)
    Visual scenes are composed of visual concepts and have the property of combinatorial explosion. An important reason for humans to efficiently learn from diverse visual scenes is the ability of compositional perception, and it is desirable for artificial intelligence to have similar abilities. Compositional scene representation learning is a task that enables such abilities. In recent years, various methods have been proposed to apply deep neural networks, which have been proven to be advantageous in representation learning, to learn compositional scene representations via reconstruction, advancing this research direction into the deep learning era. Learning via reconstruction is advantageous because it may utilize massive unlabeled data and avoid costly and laborious data annotation. In this survey, we first outline the current progress on reconstruction-based compositional scene representation learning with deep neural networks, including development history and categorizations of existing methods from the perspectives of the modeling of visual scenes and the inference of scene representations; then provide benchmarks, including an open source toolbox to reproduce the benchmark experiments, of representative methods that consider the most extensively studied problem setting and form the foundation for other methods; and finally discuss the limitations of existing methods and future directions of this research topic.
    Fairness Amidst Non-IID Graph Data: Current Achievements and Future Directions. (arXiv:2202.07170v3 [cs.LG] UPDATED)
    The importance of understanding and correcting algorithmic bias in machine learning (ML) has led to an increase in research on fairness in ML, which typically assumes that the underlying data is independent and identically distributed (IID). However, in reality, data is often represented using non-IID graph structures that capture connections among individual units. To address bias in ML systems, it is crucial to bridge the gap between the traditional fairness literature designed for IID data and the ubiquity of non-IID graph data. In this survey, we review such recent advance in fairness amidst non-IID graph data and identify datasets and evaluation metrics available for future research. We also point out the limitations of existing work as well as promising future directions.
    On Non-Linear operators for Geometric Deep Learning. (arXiv:2207.03485v2 [cs.LG] UPDATED)
    This work studies operators mapping vector and scalar fields defined over a manifold $\mathcal{M}$, and which commute with its group of diffeomorphisms $\text{Diff}(\mathcal{M})$. We prove that in the case of scalar fields $L^p_\omega(\mathcal{M,\mathbb{R}})$, those operators correspond to point-wise non-linearities, recovering and extending known results on $\mathbb{R}^d$. In the context of Neural Networks defined over $\mathcal{M}$, it indicates that point-wise non-linear operators are the only universal family that commutes with any group of symmetries, and justifies their systematic use in combination with dedicated linear operators commuting with specific symmetries. In the case of vector fields $L^p_\omega(\mathcal{M},T\mathcal{M})$, we show that those operators are solely the scalar multiplication. It indicates that $\text{Diff}(\mathcal{M})$ is too rich and that there is no universal class of non-linear operators to motivate the design of Neural Networks over the symmetries of $\mathcal{M}$.
    SF-SGL: Solver-Free Spectral Graph Learning from Linear Measurements. (arXiv:2302.04384v1 [cs.LG])
    This work introduces a highly-scalable spectral graph densification framework (SGL) for learning resistor networks with linear measurements, such as node voltages and currents. We show that the proposed graph learning approach is equivalent to solving the classical graphical Lasso problems with Laplacian-like precision matrices. We prove that given $O(\log N)$ pairs of voltage and current measurements, it is possible to recover sparse $N$-node resistor networks that can well preserve the effective resistance distances on the original graph. In addition, the learned graphs also preserve the structural (spectral) properties of the original graph, which can potentially be leveraged in many circuit design and optimization tasks. To achieve more scalable performance, we also introduce a solver-free method (SF-SGL) that exploits multilevel spectral approximation of the graphs and allows for a scalable and flexible decomposition of the entire graph spectrum (to be learned) into multiple different eigenvalue clusters (frequency bands). Such a solver-free approach allows us to more efficiently identify the most spectrally-critical edges for reducing various ranges of spectral embedding distortions. Through extensive experiments for a variety of real-world test cases, we show that the proposed approach is highly scalable for learning sparse resistor networks without sacrificing solution quality. We also introduce a data-driven EDA algorithm for vectorless power/thermal integrity verifications to allow estimating worst-case voltage/temperature (gradient) distributions across the entire chip by leveraging a few voltage/temperature measurements.
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v8 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.
    Bayesian MRI Reconstruction with Joint Uncertainty Estimation using Diffusion Models. (arXiv:2202.01479v3 [cs.LG] UPDATED)
    We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction. Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method. In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed. The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement. This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models. Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models. The performance of the method is evaluated on an open dataset using 10-fold undersampling in k-space.
    Differentially Private Deep Q-Learning for Pattern Privacy Preservation in MEC Offloading. (arXiv:2302.04608v1 [cs.NI])
    Mobile edge computing (MEC) is a promising paradigm to meet the quality of service (QoS) requirements of latency-sensitive IoT applications. However, attackers may eavesdrop on the offloading decisions to infer the edge server's (ES's) queue information and users' usage patterns, thereby incurring the pattern privacy (PP) issue. Therefore, we propose an offloading strategy which jointly minimizes the latency, ES's energy consumption, and task dropping rate, while preserving PP. Firstly, we formulate the dynamic computation offloading procedure as a Markov decision process (MDP). Next, we develop a Differential Privacy Deep Q-learning based Offloading (DP-DQO) algorithm to solve this problem while addressing the PP issue by injecting noise into the generated offloading decisions. This is achieved by modifying the deep Q-network (DQN) with a Function-output Gaussian process mechanism. We provide a theoretical privacy guarantee and a utility guarantee (learning error bound) for the DP-DQO algorithm and finally, conduct simulations to evaluate the performance of our proposed algorithm by comparing it with greedy and DQN-based algorithms.
    Lazy OCO: Online Convex Optimization on a Switching Budget. (arXiv:2102.03803v5 [cs.LG] UPDATED)
    We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary. In this work, we aim to fill the gap and present computationally efficient algorithms in the more prevalent oblivious setting, establishing a regret bound of $O(T/S)$ for general convex losses and $\widetilde O(T/S^2)$ for strongly convex losses. In addition, for stochastic i.i.d.~losses, we present a simple algorithm that performs $\log T$ switches with only a multiplicative $\log T$ factor overhead in its regret in both the general and strongly convex settings. Finally, we complement our algorithms with lower bounds that match our upper bounds in some of the cases we consider.
    Privacy-Preserving Representation Learning for Text-Attributed Networks with Simplicial Complexes. (arXiv:2302.04383v1 [cs.LG])
    Although recent network representation learning (NRL) works in text-attributed networks demonstrated superior performance for various graph inference tasks, learning network representations could always raise privacy concerns when nodes represent people or human-related variables. Moreover, standard NRLs that leverage structural information from a graph proceed by first encoding pairwise relationships into learned representations and then analysing its properties. This approach is fundamentally misaligned with problems where the relationships involve multiple points, and topological structure must be encoded beyond pairwise interactions. Fortunately, the machinery of topological data analysis (TDA) and, in particular, simplicial neural networks (SNNs) offer a mathematically rigorous framework to learn higher-order interactions between nodes. It is critical to investigate if the representation outputs from SNNs are more vulnerable compared to regular representation outputs from graph neural networks (GNNs) via pairwise interactions. In my dissertation, I will first study learning the representations with text attributes for simplicial complexes (RT4SC) via SNNs. Then, I will conduct research on two potential attacks on the representation outputs from SNNs: (1) membership inference attack, which infers whether a certain node of a graph is inside the training data of the GNN model; and (2) graph reconstruction attacks, which infer the confidential edges of a text-attributed network. Finally, I will study a privacy-preserving deterministic differentially private alternating direction method of multiplier to learn secure representation outputs from SNNs that capture multi-scale relationships and facilitate the passage from local structure to global invariant features on text-attributed networks.
    Enhancing E-Commerce Recommendation using Pre-Trained Language Model and Fine-Tuning. (arXiv:2302.04443v1 [cs.LG])
    Pretrained Language Models (PLM) have been greatly successful on a board range of natural language processing (NLP) tasks. However, it has just started being applied to the domain of recommendation systems. Traditional recommendation algorithms failed to incorporate the rich textual information in e-commerce datasets, which hinderss the performance of those models. We present a thorough investigation on the effect of various strategy of incorporating PLMs into traditional recommender algorithms on one of the e-commerce datasets, and we compare the results with vanilla recommender baseline models. We show that the application of PLMs and domain specific fine-tuning lead to an increase on the predictive capability of combined models. These results accentuate the importance of utilizing textual information in the context of e-commerce, and provides insight on how to better apply PLMs alongside traditional recommender system algorithms. The code used in this paper is available on Github: https://github.com/NuofanXu/bert_retail_recommender.
    Analysing the SEDs of protoplanetary disks with machine learning. (arXiv:2302.04629v1 [astro-ph.EP])
    ABRIDGED. The analysis of spectral energy distributions (SEDs) of protoplanetary disks to determine their physical properties is known to be highly degenerate. Hence, a Bayesian analysis is required to obtain parameter uncertainties and degeneracies. The challenge here is computational speed, as one radiative transfer model requires a couple of minutes to compute. We performed a Bayesian analysis for 30 well-known protoplanetary disks to determine their physical disk properties, including uncertainties and degeneracies. To circumvent the computational cost problem, we created neural networks (NNs) to emulate the SED generation process. We created two sets of radiative transfer disk models to train and test two NNs that predict SEDs for continuous and discontinuous disks. A Bayesian analysis was then performed on 30 protoplanetary disks with SED data collected by the DIANA project to determine the posterior distributions of all parameters. We ran this analysis twice, (i) with old distances and additional parameter constraints as used in a previous study, to compare results, and (ii) with updated distances and free choice of parameters to obtain homogeneous and unbiased model parameters. We evaluated the uncertainties in the determination of physical disk parameters from SED analysis, and detected and quantified the strongest degeneracies. The NNs are able to predict SEDs within 1ms with uncertainties of about 5% compared to the true SEDs obtained by the radiative transfer code. We find parameter values and uncertainties that are significantly different from previous values obtained by $\chi^2$ fitting. Comparing the global evidence for continuous and discontinuous disks, we find that 26 out of 30 objects are better described by disks that have two distinct radial zones. Also, we created an interactive tool that instantly returns the SED predicted by our NNs for any parameter combination.
    Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning. (arXiv:2302.04858v1 [cs.CV])
    Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating the database. We also construct an interleaved image and text data that facilitates in-context few-shot learning capabilities. We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks, especially for zero-shot and few-shot generation in out-of-domain settings with 4 times less parameters compared with baseline methods.
    Trading Information between Latents in Hierarchical Variational Autoencoders. (arXiv:2302.04855v1 [stat.ML])
    Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content ("bit rate") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.
    DeepCAM: A Fully CAM-based Inference Accelerator with Variable Hash Lengths for Energy-efficient Deep Neural Networks. (arXiv:2302.04712v1 [cs.LG])
    With ever increasing depth and width in deep neural networks to achieve state-of-the-art performance, deep learning computation has significantly grown, and dot-products remain dominant in overall computation time. Most prior works are built on conventional dot-product where weighted input summation is used to represent the neuron operation. However, another implementation of dot-product based on the notion of angles and magnitudes in the Euclidean space has attracted limited attention. This paper proposes DeepCAM, an inference accelerator built on two critical innovations to alleviate the computation time bottleneck of convolutional neural networks. The first innovation is an approximate dot-product built on computations in the Euclidean space that can replace addition and multiplication with simple bit-wise operations. The second innovation is a dynamic size content addressable memory-based (CAM-based) accelerator to perform bit-wise operations and accelerate the CNNs with a lower computation time. Our experiments on benchmark image recognition datasets demonstrate that DeepCAM is up to 523x and 3498x faster than Eyeriss and traditional CPUs like Intel Skylake, respectively. Furthermore, the energy consumed by our DeepCAM approach is 2.16x to 109x less compared to Eyeriss.
    How degenerate is the parametrization of neural networks with the ReLU activation function?. (arXiv:1905.09803v3 [cs.LG] UPDATED)
    Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the main loss function generally only depends on the realization of the neural network, i.e. the function it computes. Studying the optimization problem over the space of realizations opens up new ways to understand neural network training. In particular, usual loss functions like mean squared error and categorical cross entropy are convex on spaces of neural network realizations, which themselves are non-convex. Approximation capabilities of neural networks can be used to deal with the latter non-convexity, which allows us to establish that for sufficiently large networks local minima of a regularized optimization problem on the realization space are almost optimal. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.
    InfoNCE is a variational autoencoder. (arXiv:2107.02495v2 [stat.ML] UPDATED)
    There are two main approaches to self-supervised learning (SSL), generative SSL, which learns a probabilistic model of the inputs, or contrastive SSL where we design a supervised learning task to encourage good representations. We reconcile these approaches by showing that contrastive SSL methods (including InfoNCE) which maximize the mutual information (MI) implicitly learn a probabilistic model of the inputs (specifically, a variational autoencoder; VAE). In particular, when we learn the optimal prior, the VAE objective (the ELBO) becomes equal to the MI (up to a constant). In turn, for a deterministic encoder the ELBO is equal to the log Bayesian model evidence. This establishes a profound connection between Bayesian inference and information theory. However, practical InfoNCE methods do not use the MI as an objective: the MI is invariant to arbitrary invertible transformations, so using an MI objective can lead to highly entangled representations (Tschannen et al., 2019). Instead, the actual InfoNCE objective is a simplified lower bound on the MI which is loose even in the infinite sample limit. Thus, an objective that works (i.e. the actual InfoNCE objective) appears to be motivated as a loose bound on an objective that does not work (i.e. the true MI which gives arbitrarily entangled representations). We give an alternative motivation for the actual InfoNCE objective. In particular, we show that in the infinite sample limit, and for a particular choice of prior, the actual InfoNCE objective is equal to the log Bayesian model evidence (up to a constant). Thus, we argue that our VAE perspective gives a better motivation for InfoNCE than MI, as the actual InfoNCE objective is only loosely bounded by the MI, but is equal to the log Bayesian model evidence (up to a constant).
    Mask Conditional Synthetic Satellite Imagery. (arXiv:2302.04305v1 [cs.CV])
    In this paper we propose a mask-conditional synthetic image generation model for creating synthetic satellite imagery datasets. Given a dataset of real high-resolution images and accompanying land cover masks, we show that it is possible to train an upstream conditional synthetic imagery generator, use that generator to create synthetic imagery with the land cover masks, then train a downstream model on the synthetic imagery and land cover masks that achieves similar test performance to a model that was trained with the real imagery. Further, we find that incorporating a mixture of real and synthetic imagery acts as a data augmentation method, producing better models than using only real imagery (0.5834 vs. 0.5235 mIoU). Finally, we find that encouraging diversity of outputs in the upstream model is a necessary component for improved downstream task performance. We have released code for reproducing our work on GitHub, see https://github.com/ms-synthetic-satellite-image/synthetic-satellite-imagery .
    Polynomial Neural Fields for Subband Decomposition and Manipulation. (arXiv:2302.04862v1 [cs.CV])
    Neural fields have emerged as a new paradigm for representing signals, thanks to their ability to do it compactly while being easy to optimize. In most applications, however, neural fields are treated like black boxes, which precludes many signal manipulation tasks. In this paper, we propose a new class of neural fields called polynomial neural fields (PNFs). The key advantage of a PNF is that it can represent a signal as a composition of a number of manipulable and interpretable components without losing the merits of neural fields representation. We develop a general theoretical framework to analyze and design PNFs. We use this framework to design Fourier PNFs, which match state-of-the-art performance in signal representation tasks that use neural fields. In addition, we empirically demonstrate that Fourier PNFs enable signal manipulation applications such as texture transfer and scale-space interpolation. Code is available at https://github.com/stevenygd/PNF.
    Lightweight Transformers for Clinical Natural Language Processing. (arXiv:2302.04725v1 [cs.CL])
    Specialised pre-trained language models are becoming more frequent in NLP since they can potentially outperform models trained on generic texts. BioBERT and BioClinicalBERT are two examples of such models that have shown promise in medical NLP tasks. Many of these models are overparametrised and resource-intensive, but thanks to techniques like Knowledge Distillation (KD), it is possible to create smaller versions that perform almost as well as their larger counterparts. In this work, we specifically focus on development of compact language models for processing clinical texts (i.e. progress notes, discharge summaries etc). We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning, with the number of parameters ranging from 15 million to 65 million. These models performed comparably to larger models such as BioBERT and ClinicalBioBERT and significantly outperformed other compact models trained on general or biomedical data. Our extensive evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks, including Natural Language Inference, Relation Extraction, Named Entity Recognition, and Sequence Classification. To our knowledge, this is the first comprehensive study specifically focused on creating efficient and compact transformers for clinical NLP tasks. The models and code used in this study can be found on our Huggingface profile at https://huggingface.co/nlpie and Github page at https://github.com/nlpie-research/Lightweight-Clinical-Transformers, respectively, promoting reproducibility of our results.
    Dimension reduction and redundancy removal through successive Schmidt decompositions. (arXiv:2302.04801v1 [quant-ph])
    Quantum computers are believed to have the ability to process huge data sizes which can be seen in machine learning applications. In these applications, the data in general is classical. Therefore, to process them on a quantum computer, there is a need for efficient methods which can be used to map classical data on quantum states in a concise manner. On the other hand, to verify the results of quantum computers and study quantum algorithms, we need to be able to approximate quantum operations into forms that are easier to simulate on classical computers with some errors. Motivated by these needs, in this paper we study the approximation of matrices and vectors by using their tensor products obtained through successive Schmidt decompositions. We show that data with distributions such as uniform, Poisson, exponential, or similar to these distributions can be approximated by using only a few terms which can be easily mapped onto quantum circuits. The examples include random data with different distributions, the Gram matrices of iris flower, handwritten digits, 20newsgroup, and labeled faces in the wild. And similarly, some quantum operations such as quantum Fourier transform and variational quantum circuits with a small depth also may be approximated with a few terms that are easier to simulate on classical computers. Furthermore, we show how the method can be used to simplify quantum Hamiltonians: In particular, we show the application to randomly generated transverse field Ising model Hamiltonians. The reduced Hamiltonians can be mapped into quantum circuits easily and therefore can be simulated more efficiently.
    Constrained Empirical Risk Minimization: Theory and Practice. (arXiv:2302.04729v1 [cs.LG])
    Deep Neural Networks (DNNs) are widely used for their ability to effectively approximate large classes of functions. This flexibility, however, makes the strict enforcement of constraints on DNNs an open problem. Here we present a framework that, under mild assumptions, allows the exact enforcement of constraints on parameterized sets of functions such as DNNs. Instead of imposing "soft'' constraints via additional terms in the loss, we restrict (a subset of) the DNN parameters to a submanifold on which the constraints are satisfied exactly throughout the entire training procedure. We focus on constraints that are outside the scope of equivariant networks used in Geometric Deep Learning. As a major example of the framework, we restrict filters of a Convolutional Neural Network (CNN) to be wavelets, and apply these wavelet networks to the task of contour prediction in the medical domain.
    Delay Sensitive Hierarchical Federated Learning with Stochastic Local Updates. (arXiv:2302.04851v1 [cs.IT])
    The impact of local averaging on the performance of federated learning (FL) systems is studied in the presence of communication delay between the clients and the parameter server. To minimize the effect of delay, clients are assigned into different groups, each having its own local parameter server (LPS) that aggregates its clients' models. The groups' models are then aggregated at a global parameter server (GPS) that only communicates with the LPSs. Such setting is known as hierarchical FL (HFL). Different from most works in the literature, the number of local and global communication rounds in our work is randomly determined by the (different) delays experienced by each group of clients. Specifically, the number of local averaging rounds are tied to a wall-clock time period coined the sync time $S$, after which the LPSs synchronize their models by sharing them with the GPS. Such sync time $S$ is then reapplied until a global wall-clock time is exhausted.
    Real-world Machine Learning Systems: A survey from a Data-Oriented Architecture Perspective. (arXiv:2302.04810v1 [cs.SE])
    With the upsurge of interest in artificial intelligence machine learning (ML) algorithms, originally developed in academic environments, are now being deployed as parts of real-life systems that deal with large amounts of heterogeneous, dynamic, and high-dimensional data. Deployment of ML methods in real life is prone to challenges across the whole system life-cycle from data management to systems deployment, monitoring, and maintenance. Data-Oriented Architecture (DOA) is an emerging software engineering paradigm that has the potential to mitigate these challenges by proposing a set of principles to create data-driven, loosely coupled, decentralised, and open systems. However DOA as a concept is not widespread yet, and there is no common understanding of how it can be realised in practice. This review addresses that problem by contextualising the principles that underpin the DOA paradigm through the ML system challenges. We explore the extent to which current architectures of ML-based real-world systems have implemented the DOA principles. We also formulate open research challenges and directions for further development of the DOA paradigm.
    Domain Generalization by Functional Regression. (arXiv:2302.04724v1 [cs.LG])
    The problem of domain generalization is to learn, given data from different source distributions, a model that can be expected to generalize well on new target distributions which are only seen through unlabeled samples. In this paper, we study domain generalization as a problem of functional regression. Our concept leads to a new algorithm for learning a linear operator from marginal distributions of inputs to the corresponding conditional distributions of outputs given inputs. Our algorithm allows a source distribution-dependent construction of reproducing kernel Hilbert spaces for prediction, and, satisfies finite sample error bounds for the idealized risk. Numerical implementations and source code are available.
    Geometry-Complete Diffusion for 3D Molecule Generation. (arXiv:2302.04313v1 [cs.LG])
    Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods such as those of Hoogeboom et al. 2022 have been proposed for unconditionally generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. Toward this end, we propose GCDM, a geometry-complete diffusion model that achieves new state-of-the-art results for 3D molecule diffusion generation by leveraging the representation learning strengths offered by GNNs that perform geometry-complete message-passing. Our results with GCDM also offer preliminary insights into how physical inductive biases impact the generative dynamics of molecular DDPMs. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/bio-diffusion.
    The Re-Label Method For Data-Centric Machine Learning. (arXiv:2302.04391v1 [cs.LG])
    In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the noisy data by human, given the model predictions as references in human labeling. In this paper, we illustrate our idea for a broad set of deep learning tasks, includes classification, sequence tagging, object detection, sequence generation, click-through rate prediction. The experimental results and human evaluation results verify our idea.
    Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization. (arXiv:2302.04552v1 [cs.LG])
    Stochastically Extended Adversarial (SEA) model is introduced by Sachs et al. [2022] as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition, they demonstrate that the expected regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. They also provide a slightly weaker bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, without the convexity requirement of individual functions. For strongly convex and smooth functions, we establish an $\mathcal{O}(\min\{\log (\sigma_{1:T}^2+\Sigma_{1:T}^2), (\sigma_{\max}^2 + \Sigma_{\max}^2) \log T\})$ bound, better than their $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log T)$ bound. For \mbox{exp-concave} and smooth functions, we achieve a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Owing to the OMD framework, we can further extend our result to obtain dynamic regret guarantees, which are more favorable in non-stationary online scenarios. The attained results allow us to recover excess risk bounds of the stochastic setting and regret bounds of the adversarial setting, and derive new guarantees for many intermediate scenarios.
    Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals. (arXiv:2302.04449v1 [cs.LG])
    High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. Auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. When assisted by our design, A2C improves on 4 games in the Atari environment with sparse rewards, and requires 1000x less training frames compared to the previous SOTA Agent 57 on Skiing, the hardest game in Atari.
    New directions in the applications of rough path theory. (arXiv:2302.04586v1 [cs.LG])
    This article provides a concise overview of some of the recent advances in the application of rough path theory to machine learning. Controlled differential equations (CDEs) are discussed as the key mathematical model to describe the interaction of a stream with a physical control system. A collection of iterated integrals known as the signature naturally arises in the description of the response produced by such interactions. The signature comes equipped with a variety of powerful properties rendering it an ideal feature map for streamed data. We summarise recent advances in the symbiosis between deep learning and CDEs, studying the link with RNNs and culminating with the Neural CDE model. We concluded with a discussion on signature kernel methods.
    CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning. (arXiv:2302.04782v1 [cs.LG])
    This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the exploitation (on both expert and diverse data) and exploration (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right exploitation-exploration balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning.
    Rehabilitating Homeless: Dataset and Key Insights. (arXiv:2302.04455v1 [cs.LG])
    This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how data analysis can help to make the rehabilitation of homeless people more effective and successful. Thus, we hope this paper alerts the data science community to the problem of homelessness.
    Tree Learning: Optimal Algorithms and Sample Complexity. (arXiv:2302.04492v1 [cs.LG])
    We study the problem of learning a hierarchical tree representation of data from labeled samples, taken from an arbitrary (and possibly adversarial) distribution. Consider a collection of data tuples labeled according to their hierarchical structure. The smallest number of such tuples required in order to be able to accurately label subsequent tuples is of interest for data collection in machine learning. We present optimal sample complexity bounds for this problem in several learning settings, including (agnostic) PAC learning and online learning. Our results are based on tight bounds of the Natarajan and Littlestone dimensions of the associated problem. The corresponding tree classifiers can be constructed efficiently in near-linear time.
    Liver Segmentation in Time-resolved C-arm CT Volumes Reconstructed from Dynamic Perfusion Scans using Time Separation Technique. (arXiv:2302.04585v1 [physics.med-ph])
    Perfusion imaging is a valuable tool for diagnosing and treatment planning for liver tumours. The time separation technique (TST) has been successfully used for modelling C-arm cone-beam computed tomography (CBCT) perfusion data. The reconstruction can be accompanied by the segmentation of the liver - for better visualisation and for generating comprehensive perfusion maps. Recently introduced Turbolift learning has been seen to perform well while working with TST reconstructions, but has not been explored for the time-resolved volumes (TRV) estimated out of TST reconstructions. The segmentation of the TRVs can be useful for tracking the movement of the liver over time. This research explores this possibility by training the multi-scale attention UNet of Turbolift learning at its third stage on the TRVs and shows the robustness of Turbolift learning since it can even work efficiently with the TRVs, resulting in a Dice score of 0.864$\pm$0.004.
    Outlier-Robust Gromov Wasserstein for Graph Data. (arXiv:2302.04610v1 [cs.LG])
    Gromov Wasserstein (GW) distance is a powerful tool for comparing and aligning probability distributions supported on different metric spaces. It has become the main modeling technique for aligning heterogeneous data for a wide range of graph learning tasks. However, the GW distance is known to be highly sensitive to outliers, which can result in large inaccuracies if the outliers are given the same weight as other samples in the objective function. To mitigate this issue, we introduce a new and robust version of the GW distance called RGW. RGW features optimistically perturbed marginal constraints within a $\varphi$-divergence based ambiguity set. To make the benefits of RGW more accessible in practice, we develop a computationally efficient algorithm, Bregman proximal alternating linearization minimization, with a theoretical convergence guarantee. Through extensive experimentation, we validate our theoretical results and demonstrate the effectiveness of RGW on real-world graph learning tasks, such as subgraph matching and partial shape correspondence.
    A Text-guided Protein Design Framework. (arXiv:2302.04611v1 [cs.LG])
    Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level properties. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP that aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that generates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We empirically verify the effectiveness of ProteinDT from three aspects: (1) consistently superior performance on four out of six protein property prediction benchmarks; (2) over 90% accuracy for text-guided protein generation; and (3) promising results for zero-shot text-guided protein editing.
    Weakly Supervised Anomaly Detection: A Survey. (arXiv:2302.04549v1 [cs.LG])
    Anomaly detection (AD) is a crucial task in machine learning with various applications, such as detecting emerging diseases, identifying financial frauds, and detecting fake news. However, obtaining complete, accurate, and precise labels for AD tasks can be expensive and challenging due to the cost and difficulties in data annotation. To address this issue, researchers have developed AD methods that can work with incomplete, inexact, and inaccurate supervision, collectively summarized as weakly supervised anomaly detection (WSAD) methods. In this study, we present the first comprehensive survey of WSAD methods by categorizing them into the above three weak supervision settings across four data modalities (i.e., tabular, graph, time-series, and image/video data). For each setting, we provide formal definitions, key algorithms, and potential future directions. To support future research, we conduct experiments on a selected setting and release the source code, along with a collection of WSAD methods and data.
    Robust and Scalable Bayesian Online Changepoint Detection. (arXiv:2302.04759v1 [stat.ML])
    This paper proposes an online, provably robust, and scalable Bayesian approach for changepoint detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective, and also addresses the scalability issues of previous attempts. Specifically, the proposed generalised Bayesian formalism leads to conjugate posteriors whose parameters are available in closed form by leveraging diffusion score matching. The resulting algorithm is exact, can be updated through simple algebra, and is more than 10 times faster than its closest competitor.
    Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning. (arXiv:2302.04376v1 [cs.LG])
    A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.
    Self-Supervised Node Representation Learning via Node-to-Neighbourhood Alignment. (arXiv:2302.04626v1 [cs.LG])
    Self-supervised node representation learning aims to learn node representations from unlabelled graphs that rival the supervised counterparts. The key towards learning informative node representations lies in how to effectively gain contextual information from the graph structure. In this work, we present simple-yet-effective self-supervised node representation learning via aligning the hidden representations of nodes and their neighbourhood. Our first idea achieves such node-to-neighbourhood alignment by directly maximizing the mutual information between their representations, which, we prove theoretically, plays the role of graph smoothing. Our framework is optimized via a surrogate contrastive loss and a Topology-Aware Positive Sampling (TAPS) strategy is proposed to sample positives by considering the structural dependencies between nodes, which enables offline positive selection. Considering the excessive memory overheads of contrastive learning, we further propose a negative-free solution, where the main contribution is a Graph Signal Decorrelation (GSD) constraint to avoid representation collapse and over-smoothing. The GSD constraint unifies some of the existing constraints and can be used to derive new implementations to combat representation collapse. By applying our methods on top of simple MLP-based node representation encoders, we learn node representations that achieve promising node classification performance on a set of graph-structured datasets from small- to large-scale.
    Short-Term Memory Convolutions. (arXiv:2302.04331v1 [cs.LG])
    The real-time processing of time series signals is a critical issue for many real-life applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.
    Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV Systems. (arXiv:2302.04445v1 [cs.MA])
    This paper proposes a novel quantum multi-agent actor-critic networks (QMACN) algorithm for autonomously constructing a robust mobile access system using multiple unmanned aerial vehicles (UAVs). For the cooperation of multiple UAVs for autonomous mobile access, multi-agent reinforcement learning (MARL) methods are considered. In addition, we also adopt the concept of quantum computing (QC) to improve the training and inference performances. By utilizing QC, scalability and physical issues can happen. However, our proposed QMACN algorithm builds quantum critic and multiple actor networks in order to handle such problems. Thus, our proposed QMACN algorithm verifies the advantage of quantum MARL with remarkable performance improvements in terms of training speed and wireless service quality in various data-intensive evaluations. Furthermore, we validate that a noise injection scheme can be used for handling environmental uncertainties in order to realize robust mobile access. Our data-intensive simulation results verify that our proposed QMACN algorithm outperforms the other existing algorithms.
    Dual Algorithmic Reasoning. (arXiv:2302.04496v1 [cs.LG])
    Neural Algorithmic Reasoning is an emerging area of machine learning which seeks to infuse algorithmic computation in neural networks, typically by training neural models to approximate steps of classical algorithms. In this context, much of the current work has focused on learning reachability and shortest path graph algorithms, showing that joint learning on similar algorithms is beneficial for generalisation. However, when targeting more complex problems, such similar algorithms become more difficult to find. Here, we propose to learn algorithms by exploiting duality of the underlying algorithmic problem. Many algorithms solve optimisation problems. We demonstrate that simultaneously learning the dual definition of these optimisation problems in algorithmic learning allows for better learning and qualitatively better solutions. Specifically, we exploit the max-flow min-cut theorem to simultaneously learn these two algorithms over synthetically generated graphs, demonstrating the effectiveness of the proposed approach. We then validate the real-world utility of our dual algorithmic reasoner by deploying it on a challenging brain vessel classification task, which likely depends on the vessels' flow properties. We demonstrate a clear performance gain when using our model within such a context, and empirically show that learning the max-flow and min-cut algorithms together is critical for achieving such a result.
    Gentlest ascent dynamics on manifolds defined by adaptively sampled point-clouds. (arXiv:2302.04426v1 [math.DS])
    Finding saddle points of dynamical systems is an important problem in practical applications such as the study of rare events of molecular systems. Gentlest ascent dynamics (GAD) is one of a number of algorithms in existence that attempt to find saddle points in dynamical systems. It works by deriving a new dynamical system in which saddle points of the original system become stable equilibria. GAD has been recently generalized to the study of dynamical systems on manifolds (differential algebraic equations) described by equality constraints and given an extrinsic formulation. In this paper, we present an extension of GAD to manifolds defined by point-clouds and formulated intrinsically. These point-clouds are adaptively sampled during an iterative process that drives the system from the initial conformation (typically in the neighborhood of a stable equilibrium) to a saddle point. Our method requires the reactant (initial conformation), does not require the explicit constraint equations to be specified, and is purely data-driven.
    Learning Mixtures of Markov Chains with Quality Guarantees. (arXiv:2302.04680v1 [cs.LG])
    A large number of modern applications ranging from listening songs online and browsing the Web to using a navigation app on a smartphone generate a plethora of user trails. Clustering such trails into groups with a common sequence pattern can reveal significant structure in human behavior that can lead to improving user experience through better recommendations, and even prevent suicides [LMCR14]. One approach to modeling this problem mathematically is as a mixture of Markov chains. Recently, Gupta, Kumar and Vassilvitski [GKV16] introduced an algorithm (GKV-SVD) based on the singular value decomposition (SVD) that under certain conditions can perfectly recover a mixture of L chains on n states, given only the distribution of trails of length 3 (3-trail). In this work we contribute to the problem of unmixing Markov chains by highlighting and addressing two important constraints of the GKV-SVD algorithm [GKV16]: some chains in the mixture may not even be weakly connected, and secondly in practice one does not know beforehand the true number of chains. We resolve these issues in the Gupta et al. paper [GKV16]. Specifically, we propose an algebraic criterion that enables us to choose a value of L efficiently that avoids overfitting. Furthermore, we design a reconstruction algorithm that outputs the true mixture in the presence of disconnected chains and is robust to noise. We complement our theoretical results with experiments on both synthetic and real data, where we observe that our method outperforms the GKV-SVD algorithm. Finally, we empirically observe that combining an EM-algorithm with our method performs best in practice, both in terms of reconstruction error with respect to the distribution of 3-trails and the mixture of Markov Chains.
    Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow. (arXiv:2302.04434v1 [cs.CL])
    Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.
    Augmenting NLP data to counter Annotation Artifacts for NLI Tasks. (arXiv:2302.04700v1 [cs.CL])
    In this paper, we explore Annotation Artifacts - the phenomena wherein large pre-trained NLP models achieve high performance on benchmark datasets but do not actually "solve" the underlying task and instead rely on some dataset artifacts (same across train, validation, and test sets) to figure out the right answer. We explore this phenomenon on the well-known Natural Language Inference task by first using contrast and adversarial examples to understand limitations to the model's performance and show one of the biases arising from annotation artifacts (the way training data was constructed by the annotators). We then propose a data augmentation technique to fix this bias and measure its effectiveness.
    SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks. (arXiv:2302.04852v1 [cs.LG])
    We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse. Our algorithm is general, as it applies to arbitrary (unstructured) sparsity and common layer types (e.g., convolutional or linear). We provide a fast vectorized implementation on commodity CPUs, and show that it can yield speedups in end-to-end runtime experiments, both in transfer learning using already-sparsified networks, and in training sparse networks from scratch. Thus, our results provide the first support for sparse training on commodity hardware.
    Disentangling Learning Representations with Density Estimation. (arXiv:2302.04362v1 [cs.LG])
    Disentangled learning representations have promising utility in many applications, but they currently suffer from serious reliability issues. We present Gaussian Channel Autoencoder (GCAE), a method which achieves reliable disentanglement via flexible density estimation of the latent space. GCAE avoids the curse of dimensionality of density estimation by disentangling subsets of its latent space with the Dual Total Correlation (DTC) metric, thereby representing its high-dimensional latent joint distribution as a collection of many low-dimensional conditional distributions. In our experiments, GCAE achieves highly competitive and reliable disentanglement scores compared with state-of-the-art baselines.
    Generalization in Graph Neural Networks: Improved PAC-Bayesian Bounds on Graph Diffusion. (arXiv:2302.04451v1 [cs.LG])
    Graph neural networks are widely used tools for graph prediction tasks. Motivated by their empirical performance, prior works have developed generalization bounds for graph neural networks, which scale with graph structures in terms of the maximum degree. In this paper, we present generalization bounds that instead scale with the largest singular value of the graph neural network's feature diffusion matrix. These bounds are numerically much smaller than prior bounds for real-world graphs. We also construct a lower bound of the generalization gap that matches our upper bound asymptotically. To achieve these results, we analyze a unified model that includes prior works' settings (i.e., convolutional and message-passing networks) and new settings (i.e., graph isomorphism networks). Our key idea is to measure the stability of graph neural networks against noise perturbations using Hessians. Empirically, we find that Hessian-based measurements correlate with the observed generalization gaps of graph neural networks accurately; Optimizing noise stability properties for fine-tuning pretrained graph neural networks also improves test performance on several graph-level classification tasks.
    Flat Seeking Bayesian Neural Networks. (arXiv:2302.02713v2 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) offer a probabilistic interpretation for deep learning models by imposing a prior distribution over model parameters and inferencing a posterior distribution based on observed data. The model sampled from the posterior distribution can be used for providing ensemble predictions and quantifying prediction uncertainty. It is well-known that deep learning models with a lower sharpness have a better generalization ability. Nonetheless, existing posterior inferences are not aware of sharpness/flatness, hence possibly leading to high sharpness for the models sampled from it. In this paper, we develop theories, the Bayesian setting, and the variational inference approach for the sharpness-aware posterior. Specifically, the models sampled from our sharpness-aware posterior and the optimal approximate posterior estimating this sharpness-aware posterior have a better flatness, hence possibly possessing a higher generalization ability. We conduct experiments by leveraging the sharpness-aware posterior with the state-of-the-art Bayesian Neural Networks, showing that the flat-seeking counterparts outperform their baselines in all metrics of interest.
    Efficient displacement convex optimization with particle gradient descent. (arXiv:2302.04753v1 [cs.LG])
    Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Concretely, for Lipschitz displacement convex functions defined on probability over $\mathbb{R}^d$, we prove that $O(1/\epsilon^2)$ particles and $O(d/\epsilon^4)$ computations are sufficient to find the $\epsilon$-optimal solutions. We further provide improved complexity bounds for optimizing smooth displacement convex functions. We demonstrate the application of our results for function approximation with specific neural architectures with two-dimensional inputs.
    Exploiting Certified Defences to Attack Randomised Smoothing. (arXiv:2302.04379v1 [cs.LG])
    In guaranteeing that no adversarial examples exist within a bounded region, certification mechanisms play an important role in neural network robustness. Concerningly, this work demonstrates that the certification mechanisms themselves introduce a new, heretofore undiscovered attack surface, that can be exploited by attackers to construct smaller adversarial perturbations. While these attacks exist outside the certification region in no way invalidate certifications, minimising a perturbation's norm significantly increases the level of difficulty associated with attack detection. In comparison to baseline attacks, our new framework yields smaller perturbations more than twice as frequently as any other approach, resulting in an up to $34 \%$ reduction in the median perturbation norm. That this approach also requires $90 \%$ less computational time than approaches like PGD. That these reductions are possible suggests that exploiting this new attack vector would allow attackers to more frequently construct hard to detect adversarial attacks, by exploiting the very systems designed to defend deployed models.
    An End-to-End Framework for Marketing Effectiveness Optimization under Budget Constraint. (arXiv:2302.04477v1 [cs.LG])
    Online platforms often incentivize consumers to improve user engagement and platform revenue. Since different consumers might respond differently to incentives, individual-level budget allocation is an essential task in marketing campaigns. Recent advances in this field often address the budget allocation problem using a two-stage paradigm: the first stage estimates the individual-level treatment effects using causal inference algorithms, and the second stage invokes integer programming techniques to find the optimal budget allocation solution. Since the objectives of these two stages might not be perfectly aligned, such a two-stage paradigm could hurt the overall marketing effectiveness. In this paper, we propose a novel end-to-end framework to directly optimize the business goal under budget constraints. Our core idea is to construct a regularizer to represent the marketing goal and optimize it efficiently using gradient estimation techniques. As such, the obtained models can learn to maximize the marketing goal directly and precisely. We extensively evaluate our proposed method in both offline and online experiments, and experimental results demonstrate that our method outperforms current state-of-the-art methods. Our proposed method is currently deployed to allocate marketing budgets for hundreds of millions of users on a short video platform and achieves significant business goal improvements. Our code will be publicly available.
    NeuKron: Constant-Size Lossy Compression of Sparse Reorderable Matrices and Tensors. (arXiv:2302.04570v1 [cs.LG])
    Many real-world data are naturally represented as a sparse reorderable matrix, whose rows and columns can be arbitrarily ordered (e.g., the adjacency matrix of a bipartite graph). Storing a sparse matrix in conventional ways requires an amount of space linear in the number of non-zeros, and lossy compression of sparse matrices (e.g., Truncated SVD) typically requires an amount of space linear in the number of rows and columns. In this work, we propose NeuKron for compressing a sparse reorderable matrix into a constant-size space. NeuKron generalizes Kronecker products using a recurrent neural network with a constant number of parameters. NeuKron updates the parameters so that a given matrix is approximated by the product and reorders the rows and columns of the matrix to facilitate the approximation. The updates take time linear in the number of non-zeros in the input matrix, and the approximation of each entry can be retrieved in logarithmic time. We also extend NeuKron to compress sparse reorderable tensors (e.g. multi-layer graphs), which generalize matrices. Through experiments on ten real-world datasets, we show that NeuKron is (a) Compact: requiring up to five orders of magnitude less space than its best competitor with similar approximation errors, (b) Accurate: giving up to 10x smaller approximation error than its best competitors with similar size outputs, and (c) Scalable: successfully compressing a matrix with over 230 million non-zero entries.
    Nonlinear Random Matrices and Applications to the Sum of Squares Hierarchy. (arXiv:2302.04462v1 [cs.CC])
    We develop new tools in the theory of nonlinear random matrices and apply them to study the performance of the Sum of Squares (SoS) hierarchy on average-case problems. The SoS hierarchy is a powerful optimization technique that has achieved tremendous success for various problems in combinatorial optimization, robust statistics and machine learning. It's a family of convex relaxations that lets us smoothly trade off running time for approximation guarantees. In recent works, it's been shown to be extremely useful for recovering structure in high dimensional noisy data. It also remains our best approach towards refuting the notorious Unique Games Conjecture. In this work, we analyze the performance of the SoS hierarchy on fundamental problems stemming from statistics, theoretical computer science and statistical physics. In particular, we show subexponential-time SoS lower bounds for the problems of the Sherrington-Kirkpatrick Hamiltonian, Planted Slightly Denser Subgraph, Tensor Principal Components Analysis and Sparse Principal Components Analysis. These SoS lower bounds involve analyzing large random matrices, wherein lie our main contributions. These results offer strong evidence for the truth of and insight into the low-degree likelihood ratio hypothesis, an important conjecture that predicts the power of bounded-time algorithms for hypothesis testing. We also develop general-purpose tools for analyzing the behavior of random matrices which are functions of independent random variables. Towards this, we build on and generalize the matrix variant of the Efron-Stein inequalities. In particular, our general theorem on matrix concentration recovers various results that have appeared in the literature. We expect these random matrix theory ideas to have other significant applications.
    Towards Fairer and More Efficient Federated Learning via Multidimensional Personalized Edge Models. (arXiv:2302.04464v1 [cs.LG])
    Federated learning (FL) is an emerging technique that trains massive and geographically distributed edge data while maintaining privacy. However, FL has inherent challenges in terms of fairness and computational efficiency due to the rising heterogeneity of edges, and thus usually result in sub-optimal performance in recent state-of-the-art (SOTA) solutions. In this paper, we propose a Customized Federated Learning (CFL) system to eliminate FL heterogeneity from multiple dimensions. Specifically, CFL tailors personalized models from the specially designed global model for each client, jointly guided an online trained model-search helper and a novel aggregation algorithm. Extensive experiments demonstrate that CFL has full-stack advantages for both FL training and edge reasoning and significantly improves the SOTA performance w.r.t. model accuracy (up to 7.2% in the non-heterogeneous environment and up to 21.8% in the heterogeneous environment), efficiency, and FL fairness.
    REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines. (arXiv:2302.04702v1 [cs.DB])
    Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications. Nevertheless, most of the real-world tabular data suffer from different types of discrepancies, such as missing values, outliers, duplicates, pattern violation, and inconsistencies. Such discrepancies typically emerge while collecting, transferring, storing, and/or integrating the data. To deal with these discrepancies, numerous data cleaning methods have been introduced. However, the majority of such methods broadly overlook the requirements imposed by downstream ML models. As a result, the potential of utilizing these data cleaning methods in ML pipelines is predominantly unrevealed. In this work, we introduce a comprehensive benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various ML models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines. To this end, the benchmark examines 38 simple and advanced error detection and repair methods. To evaluate these methods, we utilized a wide collection of ML models trained on 14 publicly-available datasets covering different domains and encompassing realistic as well as synthetic error profiles.
    A Comparison of Decision Forest Inference Platforms from A Database Perspective. (arXiv:2302.04430v1 [cs.DB])
    Decision forest, including RandomForest, XGBoost, and LightGBM, is one of the most popular machine learning techniques used in many industrial scenarios, such as credit card fraud detection, ranking, and business intelligence. Because the inference process is usually performance-critical, a number of frameworks were developed and dedicated for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. However, these frameworks are all decoupled with data management frameworks. It is unclear whether in-database inference will improve the overall performance. In addition, these frameworks used different algorithms, optimization techniques, and parallelism models. It is unclear how these implementations will affect the overall performance and how to make design decisions for an in-database inference framework. In this work, we investigated the above questions by comprehensively comparing the end-to-end performance of the aforementioned inference frameworks and netsDB, an in-database inference framework we implemented. Through this study, we identified that netsDB is best suited for handling small-scale models on large-scale datasets and all-scale models on small-scale datasets, for which it achieved up to hundreds of times of speedup. In addition, the relation-centric representation we proposed significantly improved netsDB's performance in handling large-scale models, while the model reuse optimization we proposed further improved netsDB's performance in handling small-scale datasets.
    Geometry of Score Based Generative Models. (arXiv:2302.04411v1 [cs.LG])
    In this work, we look at Score-based generative models (also called diffusion generative models) from a geometric perspective. From a new view point, we prove that both the forward and backward process of adding noise and generating from noise are Wasserstein gradient flow in the space of probability measures. We are the first to prove this connection. Our understanding of Score-based (and Diffusion) generative models have matured and become more complete by drawing ideas from different fields like Bayesian inference, control theory, stochastic differential equation and Schrodinger bridge. However, many open questions and challenges remain. One problem, for example, is how to decrease the sampling time? We demonstrate that looking from geometric perspective enables us to answer many of these questions and provide new interpretations to some known results. Furthermore, geometric perspective enables us to devise an intuitive geometric solution to the problem of faster sampling. By augmenting traditional score-based generative models with a projection step, we show that we can generate high quality images with significantly fewer sampling-steps.
    Towards Model-Agnostic Federated Learning over Networks. (arXiv:2302.04363v1 [cs.LG])
    We present a model-agnostic federated learning method for decentralized data with an intrinsic network structure. The network structure reflects similarities between the (statistics of) local datasets and, in turn, their associated local models. Our method is an instance of empirical risk minimization, using a regularization term that is constructed from the network structure of data. In particular, we require well-connected local models, forming clusters, to yield similar predictions on a common test set. In principle our method can be applied to any collection of local models. The only restriction put on these local models is that they allow for efficient implementation of regularized empirical risk minimization (training). Such implementations might be available in the form of high-level programming frameworks such as \texttt{scikit-learn}, \texttt{Keras} or \texttt{PyTorch}.
    Data Quality-aware Mixed-precision Quantization via Hybrid Reinforcement Learning. (arXiv:2302.04453v1 [cs.AI])
    Mixed-precision quantization mostly predetermines the model bit-width settings before actual training due to the non-differential bit-width sampling process, obtaining sub-optimal performance. Worse still, the conventional static quality-consistent training setting, i.e., all data is assumed to be of the same quality across training and inference, overlooks data quality changes in real-world applications which may lead to poor robustness of the quantized models. In this paper, we propose a novel Data Quality-aware Mixed-precision Quantization framework, dubbed DQMQ, to dynamically adapt quantization bit-widths to different data qualities. The adaption is based on a bit-width decision policy that can be learned jointly with the quantization training. Concretely, DQMQ is modeled as a hybrid reinforcement learning (RL) task that combines model-based policy optimization with supervised quantization training. By relaxing the discrete bit-width sampling to a continuous probability distribution that is encoded with few learnable parameters, DQMQ is differentiable and can be directly optimized end-to-end with a hybrid optimization target considering both task performance and quantization benefits. Trained on mixed-quality image datasets, DQMQ can implicitly select the most proper bit-width for each layer when facing uneven input qualities. Extensive experiments on various benchmark datasets and networks demonstrate the superiority of DQMQ against existing fixed/mixed-precision quantization methods.
    MedDiff: Generating Electronic Health Records using Accelerated Denoising Diffusion Model. (arXiv:2302.04355v1 [cs.LG])
    Due to patient privacy protection concerns, machine learning research in healthcare has been undeniably slower and limited than in other application domains. High-quality, realistic, synthetic electronic health records (EHRs) can be leveraged to accelerate methodological developments for research purposes while mitigating privacy concerns associated with data sharing. The current state-of-the-art model for synthetic EHR generation is generative adversarial networks, which are notoriously difficult to train and can suffer from mode collapse. Denoising Diffusion Probabilistic Models, a class of generative models inspired by statistical thermodynamics, have recently been shown to generate high-quality synthetic samples in certain domains. It is unknown whether these can generalize to generation of large-scale, high-dimensional EHRs. In this paper, we present a novel generative model based on diffusion models that is the first successful application on electronic health records. Our model proposes a mechanism to perform class-conditional sampling to preserve label information. We also introduce a new sampling strategy to accelerate the inference speed. We empirically show that our model outperforms existing state-of-the-art synthetic EHR generation methods.
    Discovering interpretable Lagrangian of dynamical systems from data. (arXiv:2302.04400v1 [stat.ML])
    A complete understanding of physical systems requires models that are accurate and obeys natural conservation laws. Recent trends in representation learning involve learning Lagrangian from data rather than the direct discovery of governing equations of motion. The generalization of equation discovery techniques has huge potential; however, existing Lagrangian discovery frameworks are black-box in nature. This raises a concern about the reusability of the discovered Lagrangian. In this article, we propose a novel data-driven machine-learning algorithm to automate the discovery of interpretable Lagrangian from data. The Lagrangian are derived in interpretable forms, which also allows the automated discovery of conservation laws and governing equations of motion. The architecture of the proposed framework is designed in such a way that it allows learning the Lagrangian from a subset of the underlying domain and then generalizing for an infinite-dimensional system. The fidelity of the proposed framework is exemplified using examples described by systems of ordinary differential equations and partial differential equations where the Lagrangian and conserved quantities are known.
    Multi-task Representation Learning for Pure Exploration in Linear Bandits. (arXiv:2302.04441v1 [cs.LG])
    Despite the recent success of representation learning in sequential decision making, the study of the pure exploration scenario (i.e., identify the best option and minimize the sample complexity) is still limited. In this paper, we study multi-task representation learning for best arm identification in linear bandits (RepBAI-LB) and best policy identification in contextual linear bandits (RepBPI-CLB), two popular pure exploration settings with wide applications, e.g., clinical trials and web content optimization. In these two problems, all tasks share a common low-dimensional linear representation, and our goal is to leverage this feature to accelerate the best arm (policy) identification process for all tasks. For these problems, we design computationally and sample efficient algorithms DouExpDes and C-DouExpDes, which perform double experimental designs to plan optimal sample allocations for learning the global representation. We show that by learning the common representation among tasks, our sample complexity is significantly better than that of the native approach which solves tasks independently. To the best of our knowledge, this is the first work to demonstrate the benefits of representation learning for multi-task pure exploration.
    Partial Optimality in Cubic Correlation Clustering. (arXiv:2302.04694v1 [cs.DM])
    The higher-order correlation clustering problem is an expressive model, and recently, local search heuristics have been proposed for several applications. Certifying optimality, however, is NP-hard and practically hampered already by the complexity of the problem statement. Here, we focus on establishing partial optimality conditions for the special case of complete graphs and cubic objective functions. In addition, we define and implement algorithms for testing these conditions and examine their effect numerically, on two datasets.
    On Fairness and Stability: Is Estimator Variance a Friend or a Foe?. (arXiv:2302.04525v1 [cs.LG])
    The error of an estimator can be decomposed into a (statistical) bias term, a variance term, and an irreducible noise term. When we do bias analysis, formally we are asking the question: "how good are the predictions?" The role of bias in the error decomposition is clear: if we trust the labels/targets, then we would want the estimator to have as low bias as possible, in order to minimize error. Fair machine learning is concerned with the question: "Are the predictions equally good for different demographic/social groups?" This has naturally led to a variety of fairness metrics that compare some measure of statistical bias on subsets corresponding to socially privileged and socially disadvantaged groups. In this paper we propose a new family of performance measures based on group-wise parity in variance. We demonstrate when group-wise statistical bias analysis gives an incomplete picture, and what group-wise variance analysis can tell us in settings that differ in the magnitude of statistical bias. We develop and release an open-source library that reconciles uncertainty quantification techniques with fairness analysis, and use it to conduct an extensive empirical analysis of our variance-based fairness measures on standard benchmarks.
    Hubbard-Stratonovich Detector for Simple Trainable MIMO Signal Detection. (arXiv:2302.04461v1 [cs.IT])
    Massive multiple-input multiple-output (MIMO) is a key technology used in fifth-generation wireless communication networks and beyond. Recently, various MIMO signal detectors based on deep learning have been proposed. Especially, deep unfolding (DU), which involves unrolling of an existing iterative algorithm and embedding of trainable parameters, has been applied with remarkable detection performance. Although DU has a lesser number of trainable parameters than conventional deep neural networks, the computational complexities related to training and execution have been problematic because DU-based MIMO detectors usually utilize matrix inversion to improve their detection performance. In this study, we attempted to construct a DU-based trainable MIMO detector with the simplest structure. The proposed detector based on the Hubbard--Stratonovich (HS) transformation and DU is called the trainable HS (THS) detector. It requires only $O(1)$ trainable parameters and its training and execution cost is $O(n^2)$ per iteration, where $n$ is the number of transmitting antennas. Numerical results show that the detection performance of the THS detector is better than that of existing algorithms of the same complexity and close to that of a DU-based detector, which has higher training and execution costs than the THS detector.
    Light and Accurate: Neural Architecture Search via Two Constant Shared Weights Initialisations. (arXiv:2302.04406v1 [cs.LG])
    In recent years, zero-cost proxies are gaining ground in neural architecture search (NAS). These methods allow finding the optimal neural network for a given task faster and with a lesser computational load than conventional NAS methods. Equally important is the fact that they also shed some light on the internal workings of neural architectures. This paper presents a zero-cost metric that highly correlates with the train set accuracy across the NAS-Bench-101, NAS-Bench-201 and NAS-Bench-NLP benchmark datasets. Architectures are initialised with two distinct constant shared weights, one at a time. Then, a fixed random mini-batch of data is passed forward through each initialisation. We observe that the dispersion of the outputs between two initialisations positively correlates with trained accuracy. The correlation further improves when we normalise dispersion by average output magnitude. Our metric, epsilon, does not require gradients computation or labels. It thus unbinds the NAS procedure from training hyperparameters, loss metrics and human-labelled data. Our method is easy to integrate within existing NAS algorithms and takes a fraction of a second to evaluate a single network.
    Adaptive State-Dependent Diffusion for Derivative-Free Optimization. (arXiv:2302.04370v1 [math.OC])
    This paper develops and analyzes a stochastic derivative-free optimization strategy. A key feature is the state-dependent adaptive variance. We prove global convergence in probability with algebraic rate and give the quantitative results in numerical examples. A striking fact is that convergence is achieved without explicit information of the gradient and even without comparing different objective function values as in established methods such as the simplex method and simulated annealing. It can otherwise be compared to annealing with state-dependent temperature.
    Liver Segmentation using Turbolift Learning for CT and Cone-beam C-arm Perfusion Imaging. (arXiv:2207.10167v2 [eess.IV] UPDATED)
    Model-based reconstruction employing the time separation technique (TST) was found to improve dynamic perfusion imaging of the liver using C-arm cone-beam computed tomography (CBCT). To apply TST using prior knowledge extracted from CT perfusion data, the liver should be accurately segmented from the CT scans. Reconstructions of primary and model-based CBCT data need to be segmented for proper visualisation and interpretation of perfusion maps. This research proposes Turbolift learning, which trains a modified version of the multi-scale Attention UNet on different liver segmentation tasks serially, following the order of the trainings CT, CBCT, CBCT TST - making the previous trainings act as pre-training stages for the subsequent ones - addressing the problem of limited number of datasets for training. For the final task of liver segmentation from CBCT TST, the proposed method achieved an overall Dice scores of 0.874$\pm$0.031 and 0.905$\pm$0.007 in 6-fold and 4-fold cross-validation experiments, respectively - securing statistically significant improvements over the model, which was trained only for that task. Experiments revealed that Turbolift not only improves the overall performance of the model but also makes it robust against artefacts originating from the embolisation materials and truncation artefacts. Additionally, in-depth analyses confirmed the order of the segmentation tasks. This paper shows the potential of segmenting the liver from CT, CBCT, and CBCT TST, learning from the available limited training data, which can possibly be used in the future for the visualisation and evaluation of the perfusion maps for the treatment evaluation of liver diseases.
    Learning Dynamical Systems by Leveraging Data from Similar Systems. (arXiv:2302.04344v1 [stat.ML])
    We consider the problem of learning the dynamics of a linear system when one has access to data generated by an auxiliary system that shares similar (but not identical) dynamics, in addition to data from the true system. We use a weighted least squares approach, and provide a finite sample error bound of the learned model as a function of the number of samples and various system parameters from the two systems as well as the weight assigned to the auxiliary data. We show that the auxiliary data can help to reduce the intrinsic system identification error due to noise, at the price of adding a portion of error that is due to the differences between the two system models. We further provide a data-dependent bound that is computable when some prior knowledge about the systems is available. This bound can also be used to determine the weight that should be assigned to the auxiliary data during the model training stage.
    Bag of Tricks for Training Data Extraction from Language Models. (arXiv:2302.04460v1 [cs.CL])
    With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty of this task, most of the existing methods are proof-of-concept and still not effective enough. In this paper, we investigate and benchmark tricks for improving training data extraction using a publicly available dataset. Because most existing extraction methods use a pipeline of generating-then-ranking, i.e., generating text candidates as potential training data and then ranking them based on specific criteria, our research focuses on the tricks for both text generation (e.g., sampling strategy) and text ranking (e.g., token-level criteria). The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction. Based on the GPT-Neo 1.3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research.
    On Computable Online Learning. (arXiv:2302.04357v1 [cs.LG])
    We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for "optimality" in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of "optimality" and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.
    Machine Learning Capability: A standardized metric using case difficulty with applications to individualized deployment of supervised machine learning. (arXiv:2302.04386v1 [cs.LG])
    Model evaluation is a critical component in supervised machine learning classification analyses. Traditional metrics do not currently incorporate case difficulty. This renders the classification results unbenchmarked for generalization. Item Response Theory (IRT) and Computer Adaptive Testing (CAT) with machine learning can benchmark datasets independent of the end-classification results. This provides high levels of case-level information regarding evaluation utility. To showcase, two datasets were used: 1) health-related and 2) physical science. For the health dataset a two-parameter IRT model, and for the physical science dataset a polytonomous IRT model, was used to analyze predictive features and place each case on a difficulty continuum. A CAT approach was used to ascertain the algorithms' performance and applicability to new data. This method provides an efficient way to benchmark data, using only a fraction of the dataset (less than 1%) and 22-60x more computationally efficient than traditional metrics. This novel metric, termed Machine Learning Capability (MLC) has additional benefits as it is unbiased to outcome classification and a standardized way to make model comparisons within and across datasets. MLC provides a metric on the limitation of supervised machine learning algorithms. In situations where the algorithm falls short, other input(s) are required for decision-making.
    Measuring the Privacy Leakage via Graph Reconstruction Attacks on Simplicial Neural Networks (Student Abstract). (arXiv:2302.04373v1 [cs.LG])
    In this paper, we measure the privacy leakage via studying whether graph representations can be inverted to recover the graph used to generate them via graph reconstruction attack (GRA). We propose a GRA that recovers a graph's adjacency matrix from the representations via a graph decoder that minimizes the reconstruction loss between the partial graph and the reconstructed graph. We study three types of representations that are trained on the graph, i.e., representations output from graph convolutional network (GCN), graph attention network (GAT), and our proposed simplicial neural network (SNN) via a higher-order combinatorial Laplacian. Unlike the first two types of representations that only encode pairwise relationships, the third type of representation, i.e., SNN outputs, encodes higher-order interactions (e.g., homological features) between nodes. We find that the SNN outputs reveal the lowest privacy-preserving ability to defend the GRA, followed by those of GATs and GCNs, which indicates the importance of building more private representations with higher-order node information that could defend the potential threats, such as GRAs.
    Neonatal Face and Facial Landmark Detection from Video Recordings. (arXiv:2302.04341v1 [eess.IV])
    This paper explores automated face and facial landmark detection of neonates, which is an important first step in many video-based neonatal health applications, such as vital sign estimation, pain assessment, sleep-wake classification, and jaundice detection. Utilising three publicly available datasets of neonates in the clinical environment, 366 images (258 subjects) and 89 (66 subjects) were annotated for training and testing, respectively. Transfer learning was applied to two YOLO-based models, with input training images augmented with random horizontal flipping, photo-metric colour distortion, translation and scaling during each training epoch. Additionally, the re-orientation of input images and fusion of trained deep learning models was explored. Our proposed model based on YOLOv7Face outperformed existing methods with a mean average precision of 84.8% for face detection, and a normalised mean error of 0.072 for facial landmark detection. Overall, this will assist in the development of fully automated neonatal health assessment algorithms.
    (Re)Defining Expertise in Machine Learning Development. (arXiv:2302.04337v1 [cs.LG])
    Domain experts are often engaged in the development of machine learning systems in a variety of ways, such as in data collection and evaluation of system performance. At the same time, who counts as an 'expert' and what constitutes 'expertise' is not always explicitly defined. In this project, we conduct a systematic literature review of machine learning research to understand 1) the bases on which expertise is defined and recognized and 2) the roles experts play in ML development. Our goal is to produce a high-level taxonomy to highlight limits and opportunities in how experts are identified and engaged in ML research.
    An Investigation into Pre-Training Object-Centric Representations for Reinforcement Learning. (arXiv:2302.04419v1 [cs.LG])
    Unsupervised object-centric representation (OCR) learning has recently drawn attention as a new paradigm of visual representation. This is because of its potential of being an effective pre-training technique for various downstream tasks in terms of sample efficiency, systematic generalization, and reasoning. Although image-based reinforcement learning (RL) is one of the most important and thus frequently mentioned such downstream tasks, the benefit in RL has surprisingly not been investigated systematically thus far. Instead, most of the evaluations have focused on rather indirect metrics such as segmentation quality and object property prediction accuracy. In this paper, we investigate the effectiveness of OCR pre-training for image-based reinforcement learning via empirical experiments. For systematic evaluation, we introduce a simple object-centric visual RL benchmark and conduct experiments to answer questions such as ``Does OCR pre-training improve performance on object-centric tasks?'' and ``Can OCR pre-training help with out-of-distribution generalization?''. Our results provide empirical evidence for valuable insights into the effectiveness of OCR pre-training for RL and the potential limitations of its use in certain scenarios. Additionally, this study also examines the critical aspects of incorporating OCR pre-training in RL, including performance in a visually complex environment and the appropriate pooling layer to aggregate the object representations.
    Importance Sampling Deterministic Annealing for Clustering. (arXiv:2302.04421v1 [stat.ML])
    A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in some real-world scenarios. In this paper, we propose an importance sampling based deterministic annealing approach (ISDA) for clustering problems which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method and be solved by the quasi-newton algorithm. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed ISDA. Furthermore, we show that fuzzy c-means is a special case of ISDA with the logarithmic distortion. This observation sheds a new light on the relationship between fuzzy c-means and deterministic annealing clustering algorithms and provides an interesting physical and information-theoretical interpretation for fuzzy exponent $m$.
    A data variation robust learning model based on importance sampling. (arXiv:2302.04438v1 [stat.ML])
    A crucial assumption underlying the most current theory of machine learning is that the training distribution is identical to the testing distribution. However, this assumption may not hold in some real-world applications. In this paper, we propose an importance sampling based data variation robust loss (ISloss) for learning problems which minimizes the worst case of loss under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from the importance sampling method. Furthermore, we reveal that there is a relationship between ISloss under the logarithmic transformation (LogISloss) and the p-norm loss. We apply the proposed LogISloss to the face verification problem on Racial Faces in the Wild dataset and show that the proposed method is robust under large distribution deviations.
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v2 [stat.ML] UPDATED)
    We consider the differentially private estimation of multiple quantiles (MQ) of a distribution from a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem. We establish that the resulting method is closely related to the recently published ad hoc algorithm JointExp. In particular, they share the same computational complexity and a similar efficiency. We prove the statistical consistency of these two algorithms for continuous distributions. Furthermore, we demonstrate both theoretically and empirically that this method suffers from an important lack of performance in the case of peaked distributions, which can degrade up to a potentially catastrophic impact in the presence of atoms. Its smoothed version (i.e. by applying a max kernel to its output density) would solve this problem, but remains an open challenge to implement. As a proxy, we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.
    Learning by Asking for Embodied Visual Navigation and Task Completion. (arXiv:2302.04865v1 [cs.CV])
    The research community has shown increasing interest in designing intelligent embodied agents that can assist humans in accomplishing tasks. Despite recent progress on related vision-language benchmarks, most prior work has focused on building agents that follow instructions rather than endowing agents the ability to ask questions to actively resolve ambiguities arising naturally in embodied environments. To empower embodied agents with the ability to interact with humans, in this work, we propose an Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task. We evaluate our model on the TEACH vision-dialog navigation and task completion dataset. Experimental results show that ELBA achieves improved task performance compared to baseline models without question-answering capabilities.
    TPU-MLIR: A Compiler For TPU Using MLIR. (arXiv:2210.15016v2 [cs.PL] UPDATED)
    Multi-level intermediate representations (MLIR) show great promise for reducing the cost of building domain-specific compilers by providing a reusable and extensible compiler infrastructure. This work presents TPU-MLIR, an end-to-end compiler based on MLIR that deploys pre-trained neural network (NN) models to a custom ASIC called a Tensor Processing Unit (TPU). TPU-MLIR defines two new dialects to implement its functionality: 1. a Tensor operation (TOP) dialect that encodes the deep learning graph semantics and independent of the deep learning framework and 2. a TPU kernel dialect to provide a standard kernel computation on TPU. A NN model is translated to the TOP dialect and then lowered to the TPU dialect for different TPUs according to the chip's configuration. We demonstrate how to use the MLIR pass pipeline to organize and perform optimization on TPU to generate machine code. The paper also presents a verification procedure to ensure the correctness of each transform stage.
    A Survey of Knowledge Tracing. (arXiv:2105.15106v3 [cs.CY] UPDATED)
    High-quality education is one of the keys to achieving a more sustainable world. In contrast to traditional face-to-face classroom education, online education enables us to record and research a large amount of learning data for offering intelligent educational services. Knowledge Tracing (KT), which aims to monitor students' evolving knowledge state in learning, is the fundamental task to support these intelligent services. In recent years, an increasing amount of research is focused on this emerging field and considerable progress has been made. In this survey, we categorize existing KT models from a technical perspective and investigate these models in a systematic manner. Subsequently, we review abundant variants of KT models that consider more strict learning assumptions from three phases: before, during, and after learning. To better facilitate researchers and practitioners working on this field, we open source two algorithm libraries: EduData for downloading and preprocessing KT-related datasets, and EduKTM with extensible and unified implementation of existing mainstream KT models. Moreover, the development of KT cannot be separated from its applications, therefore we further present typical KT applications in different scenarios. Finally, we discuss some potential directions for future research in this fast-growing field.
    What are the mechanisms underlying metacognitive learning?. (arXiv:2302.04840v1 [cs.LG])
    How is it that humans can solve complex planning tasks so efficiently despite limited cognitive resources? One reason is its ability to know how to use its limited computational resources to make clever choices. We postulate that people learn this ability from trial and error (metacognitive reinforcement learning). Here, we systematize models of the underlying learning mechanisms and enhance them with more sophisticated additional mechanisms. We fit the resulting 86 models to human data collected in previous experiments where different phenomena of metacognitive learning were demonstrated and performed Bayesian model selection. Our results suggest that a gradient ascent through the space of cognitive strategies can explain most of the observed qualitative phenomena, and is therefore a promising candidate for explaining the mechanism underlying metacognitive learning.
    Online Subset Selection using $\alpha$-Core with no Augmented Regret. (arXiv:2209.14222v3 [cs.LG] UPDATED)
    We revisit the classic problem of optimal subset selection in the online learning set-up. Assume that the set $[N]$ consists of $N$ distinct elements. On the $t$th round, an adversary chooses a monotone reward function $f_t: 2^{[N]} \to \mathbb{R}_+$ that assigns a non-negative reward to each subset of $[N].$ An online policy selects (perhaps randomly) a subset $S_t \subseteq [N]$ consisting of $k$ elements before the reward function $f_t$ for the $t$th round is revealed to the learner. As a consequence of its choice, the policy receives a reward of $f_t(S_t)$ on the $t$th round. Our goal is to design an online sequential subset selection policy to maximize the expected cumulative reward accumulated over a time horizon. In this connection, we propose an online learning policy called SCore (Subset Selection with Core) that solves the problem for a large class of reward functions. The proposed SCore policy is based on a new polyhedral characterization of the reward functions called $\alpha$-Core - a generalization of Core from the cooperative game theory literature. We establish a learning guarantee for the SCore policy in terms of a new performance metric called $\alpha$-augmented regret. In this new metric, the performance of the online policy is compared with an unrestricted offline benchmark that can select all $N$ elements at every round. We show that a large class of reward functions, including submodular, can be efficiently optimized with the SCore policy. We also extend the proposed policy to the optimistic learning set-up where the learner has access to additional untrusted hints regarding the reward functions. Finally, we conclude the paper with a list of open problems.
    CausalEGM: a general causal inference framework by encoding generative modeling. (arXiv:2212.05925v3 [stat.ML] UPDATED)
    Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM.
    rMultiNet: An R Package For Multilayer Networks Analysis. (arXiv:2302.04437v1 [stat.ML])
    This paper develops an R package rMultiNet to analyze multilayer network data. We provide two general frameworks from recent literature, e.g. mixture multilayer stochastic block model(MMSBM) and mixture multilayer latent space model(MMLSM) to generate the multilayer network. We also provide several methods to reveal the embedding of both nodes and layers followed by further data analysis methods, such as clustering. Three real data examples are processed in the package. The source code of rMultiNet is available at https://github.com/ChenyuzZZ73/rMultiNet.
    Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy. (arXiv:2302.04369v1 [cs.LG])
    Despite the recent success of stochastic gradient descent in deep learning, it is often difficult to train a deep neural network with an inappropriate choice of its initial parameters. Even if training is successful, it has been known that the initial parameter configuration may negatively impact generalization. In this paper, we propose an unsupervised algorithm to find good initialization for input data, given that a downstream task is d-way classification. We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification. We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters. We thus design an algorithm that encourages small perturbation to the initial parameter configuration leads to a diverse set of d-way classification tasks. In other words, the proposed algorithm ensures a solution to any downstream task to be near the initial parameter configuration. We empirically evaluate the proposed algorithm on various tasks derived from MNIST with a fully connected network. In these experiments, we observe that our algorithm improves average test accuracy across most of these tasks, and that such improvement is greater when the number of labelled examples is small.
    Knowledge is a Region in Weight Space for Fine-tuned Language Models. (arXiv:2302.04863v1 [cs.LG])
    Research on neural networks has largely focused on understanding a single model trained on a single dataset. However, relatively little is known about the relationships between different models, especially those trained or tested on different datasets. We address this by studying how the weight space and underlying loss landscape of different models are interconnected. Specifically, we demonstrate that fine-tuned models that were optimized for high performance, reside in well-defined regions in weight space, and vice versa -- that any model that resides anywhere in those regions also has high performance. Specifically, we show that language models that have been fine-tuned on the same dataset form a tight cluster in the weight space and that models fine-tuned on different datasets from the same underlying task form a looser cluster. Moreover, traversing around the region between the models reaches new models that perform comparably or even better than models found via fine-tuning, even on tasks that the original models were not fine-tuned on. Our findings provide insight into the relationships between models, demonstrating that a model positioned between two similar models can acquire the knowledge of both. We leverage this finding and design a method to pick a better model for efficient fine-tuning. Specifically, we show that starting from the center of the region is as good or better than the pre-trained model in 11 of 12 datasets and improves accuracy by 3.06 on average.
    Heterogeneous Federated Learning using Dynamic Model Pruning and Adaptive Gradient. (arXiv:2106.06921v2 [cs.LG] UPDATED)
    Federated Learning (FL) has emerged as a new paradigm for training machine learning models distributively without sacrificing data security and privacy. Learning models on edge devices such as mobile phones is one of the most common use cases for FL. However, Non-identical independent distributed~(non-IID) data in edge devices easily leads to training failures. Especially, over-parameterized machine learning models can easily be over-fitted on such data, hence, resulting in inefficient federated learning and poor model performance. To overcome the over-fitting issue, we proposed an adaptive dynamic pruning approach for FL, which can dynamically slim the model by dropping out unimportant parameters, hence, preventing over-fittings. Since the machine learning model's parameters react differently for different training samples, adaptive dynamic pruning will evaluate the salience of the model's parameter according to the input training sample, and only retain the salient parameter's gradients when doing back-propagation. We performed comprehensive experiments to evaluate our approach. The results show that our approach by removing the redundant parameters in neural networks can significantly reduce the over-fitting issue and greatly improves the training efficiency. In particular, when training the ResNet-32 on CIFAR-10, our approach reduces the communication cost by 57\%. We further demonstrate the inference acceleration capability of the proposed algorithm. Our approach reduces up to 50\% FLOPs inference of DNNs on edge devices while maintaining the model's quality.
    Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization. (arXiv:2108.08448v2 [cs.LG] UPDATED)
    Meta Reinforcement Learning (Meta-RL) has seen substantial advancements recently. In particular, off-policy methods were developed to improve the data efficiency of Meta-RL techniques. \textit{Probabilistic embeddings for actor-critic RL} (PEARL) is a leading approach for multi-MDP adaptation problems. A major drawback of many existing Meta-RL methods, including PEARL, is that they do not explicitly consider the safety of the prior policy when it is exposed to a new task for the first time. Safety is essential for many real-world applications, including field robots and Autonomous Vehicles (AVs). In this paper, we develop the PEARL PLUS (PEARL$^+$) algorithm, which optimizes the policy for both prior (pre-adaptation) safety and posterior (after-adaptation) performance. Building on top of PEARL, our proposed PEARL$^+$ algorithm introduces a prior regularization term in the reward function and a new Q-network for recovering the state-action value under prior context assumptions, to improve the robustness to task distribution shift and safety of the trained network exposed to a new task for the first time. The performance of PEARL$^+$ is validated by solving three safety-critical problems related to robots and AVs, including two MuJoCo benchmark problems. From the simulation experiments, we show that safety of the prior policy is significantly improved and more robust to task distribution shift compared to PEARL.
    Projection-free Online Exp-concave Optimization. (arXiv:2302.04859v1 [cs.LG])
    We consider the setting of online convex optimization (OCO) with \textit{exp-concave} losses. The best regret bound known for this setting is $O(n\log{}T)$, where $n$ is the dimension and $T$ is the number of prediction rounds (treating all other quantities as constants and assuming $T$ is sufficiently large), and is attainable via the well-known Online Newton Step algorithm (ONS). However, ONS requires on each iteration to compute a projection (according to some matrix-induced norm) onto the feasible convex set, which is often computationally prohibitive in high-dimensional settings and when the feasible set admits a non-trivial structure. In this work we consider projection-free online algorithms for exp-concave and smooth losses, where by projection-free we refer to algorithms that rely only on the availability of a linear optimization oracle (LOO) for the feasible set, which in many applications of interest admits much more efficient implementations than a projection oracle. We present an LOO-based ONS-style algorithm, which using overall $O(T)$ calls to a LOO, guarantees in worst case regret bounded by $\widetilde{O}(n^{2/3}T^{2/3})$ (ignoring all quantities except for $n,T$). However, our algorithm is most interesting in an important and plausible low-dimensional data scenario: if the gradients (approximately) span a subspace of dimension at most $\rho$, $\rho << n$, the regret bound improves to $\widetilde{O}(\rho^{2/3}T^{2/3})$, and by applying standard deterministic sketching techniques, both the space and average additional per-iteration runtime requirements are only $O(\rho{}n)$ (instead of $O(n^2)$). This improves upon recently proposed LOO-based algorithms for OCO which, while having the same state-of-the-art dependence on the horizon $T$, suffer from regret/oracle complexity that scales with $\sqrt{n}$ or worse.
    A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints. (arXiv:2302.04375v1 [cs.LG])
    In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for ''safe'' RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret $\tilde{O}(\frac{d H^3 \sqrt{dK}}{\Delta_c})$ that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where $d$ is the feature-mapping dimension, $K$ is the number of episodes, $H$ is the number of steps in each episode, and $\Delta_c$ is a safety-related parameter. We also provide a lower bound $\tilde{\Omega}(\max\{dH \sqrt{K}, \frac{H}{\Delta_c^2}\})$, which indicates that the dependency on $\Delta_c$ is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest.
    Lithium Metal Battery Quality Control via Transformer-CNN Segmentation. (arXiv:2302.04824v1 [cs.CV])
    Lithium metal battery (LMB) has the potential to be the next-generation battery system because of their high theoretical energy density. However, defects known as dendrites are formed by heterogeneous lithium (Li) plating, which hinder the development and utilization of LMBs. Non-destructive techniques to observe the dendrite morphology often use computerized X-ray tomography (XCT) imaging to provide cross-sectional views. To retrieve three-dimensional structures inside a battery, image segmentation becomes essential to quantitatively analyze XCT images. This work proposes a new binary semantic segmentation approach using a transformer-based neural network (T-Net) model capable of segmenting out dendrites from XCT data. In addition, we compare the performance of the proposed T-Net with three other algorithms, such as U-Net, Y-Net, and E-Net, consisting of an Ensemble Network model for XCT analysis. Our results show the advantages of using T-Net in terms of object metrics, such as mean Intersection over Union (mIoU) and mean Dice Similarity Coefficient (mDSC) as well as qualitatively through several comparative visualizations.
    Better Diffusion Models Further Improve Adversarial Training. (arXiv:2302.04638v1 [cs.CV])
    It has been recognized that the data generated by the denoising diffusion probabilistic model (DDPM) improves adversarial training. After two years of rapid development in diffusion models, a question naturally arises: can better diffusion models further improve adversarial training? This paper gives an affirmative answer by employing the most recent diffusion model which has higher efficiency ($\sim 20$ sampling steps) and image quality (lower FID score) compared with DDPM. Our adversarially trained models achieve state-of-the-art performance on RobustBench using only generated data (no external datasets). Under the $\ell_\infty$-norm threat model with $\epsilon=8/255$, our models achieve $70.69\%$ and $42.67\%$ robust accuracy on CIFAR-10 and CIFAR-100, respectively, i.e. improving upon previous state-of-the-art models by $+4.58\%$ and $+8.03\%$. Under the $\ell_2$-norm threat model with $\epsilon=128/255$, our models achieve $84.86\%$ on CIFAR-10 ($+4.44\%$). These results also beat previous works that use external data. Our code is available at https://github.com/wzekai99/DM-Improves-AT.
    Symbolic Metamodels for Interpreting Black-boxes Using Primitive Functions. (arXiv:2302.04791v1 [cs.LG])
    One approach for interpreting black-box machine learning models is to find a global approximation of the model using simple interpretable functions, which is called a metamodel (a model of the model). Approximating the black-box with a metamodel can be used to 1) estimate instance-wise feature importance; 2) understand the functional form of the model; 3) analyze feature interactions. In this work, we propose a new method for finding interpretable metamodels. Our approach utilizes Kolmogorov superposition theorem, which expresses multivariate functions as a composition of univariate functions (our primitive parameterized functions). This composition can be represented in the form of a tree. Inspired by symbolic regression, we use a modified form of genetic programming to search over different tree configurations. Gradient descent (GD) is used to optimize the parameters of a given configuration. Our method is a novel memetic algorithm that uses GD not only for training numerical constants but also for the training of building blocks. Using several experiments, we show that our method outperforms recent metamodeling approaches suggested for interpreting black-boxes.
    Learning to Select Pivotal Samples for Meta Re-weighting. (arXiv:2302.04418v1 [cs.LG])
    Sample re-weighting strategies provide a promising mechanism to deal with imperfect training data in machine learning, such as noisily labeled or class-imbalanced data. One such strategy involves formulating a bi-level optimization problem called the meta re-weighting problem, whose goal is to optimize performance on a small set of perfect pivotal samples, called meta samples. Many approaches have been proposed to efficiently solve this problem. However, all of them assume that a perfect meta sample set is already provided while we observe that the selections of meta sample set is performance critical. In this paper, we study how to learn to identify such a meta sample set from a large, imperfect training set, that is subsequently cleaned and used to optimize performance in the meta re-weighting setting. We propose a learning framework which reduces the meta samples selection problem to a weighted K-means clustering problem through rigorously theoretical analysis. We propose two clustering methods within our learning framework, Representation-based clustering method (RBC) and Gradient-based clustering method (GBC), for balancing performance and computational efficiency. Empirical studies demonstrate the performance advantage of our methods over various baseline methods.
    Complex Network for Complex Problems: A comparative study of CNN and Complex-valued CNN. (arXiv:2302.04584v1 [cs.CV])
    Neural networks, especially convolutional neural networks (CNN), are one of the most common tools these days used in computer vision. Most of these networks work with real-valued data using real-valued features. Complex-valued convolutional neural networks (CV-CNN) can preserve the algebraic structure of complex-valued input data and have the potential to learn more complex relationships between the input and the ground-truth. Although some comparisons of CNNs and CV-CNNs for different tasks have been performed in the past, a large-scale investigation comparing different models operating on different tasks has not been conducted. Furthermore, because complex features contain both real and imaginary components, CV-CNNs have double the number of trainable parameters as real-valued CNNs in terms of the actual number of trainable parameters. Whether or not the improvements in performance with CV-CNN observed in the past have been because of the complex features or just because of having double the number of trainable parameters has not yet been explored. This paper presents a comparative study of CNN, CNNx2 (CNN with double the number of trainable parameters as the CNN), and CV-CNN. The experiments were performed using seven models for two different tasks - brain tumour classification and segmentation in brain MRIs. The results have revealed that the CV-CNN models outperformed the CNN and CNNx2 models.
    Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples. (arXiv:2302.04578v1 [cs.CV])
    Diffusion Models (DMs) achieve state-of-the-art performance in generative tasks, boosting a wave in AI for Art. Despite the success of commercialization, DMs meanwhile provide tools for copyright violations, where infringers benefit from illegally using paintings created by human artists to train DMs and generate novel paintings in a similar style. In this paper, we show that it is possible to create an image $x'$ that is similar to an image $x$ for human vision but unrecognizable for DMs. We build a framework to define and evaluate this adversarial example for diffusion models. Based on the framework, we further propose AdvDM, an algorithm to generate adversarial examples for DMs. By optimizing upon different latent variables sampled from the reverse process of DMs, AdvDM conducts a Monte-Carlo estimation of adversarial examples for DMs. Extensive experiments show that the estimated adversarial examples can effectively hinder DMs from extracting their features. Our method can be a powerful tool for human artists to protect their copyright against infringers with DM-based AI-for-Art applications.
    Q-Diffusion: Quantizing Diffusion Models. (arXiv:2302.04304v1 [cs.CV])
    Diffusion models have recently achieved great success in synthesizing diverse and high-fidelity images. However, sampling speed and memory constraints remain a major barrier to the practical adoption of diffusion models as the generation process for these models can be slow due to the need for iterative noise estimation using complex neural networks. We propose a solution to this problem by compressing the noise estimation network to accelerate the generation process using post-training quantization (PTQ). While existing PTQ approaches have not been able to effectively deal with the changing output distributions of noise estimation networks in diffusion models over multiple time steps, we are able to formulate a PTQ method that is specifically designed to handle the unique multi-timestep structure of diffusion models with a data calibration scheme using data sampled from different time steps. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a FID change of at most 1.88. Our approach can also be applied to text-guided image generation, and for the first time we can run stable diffusion in 4-bit weights without losing much perceptual quality, as shown in Figure 5 and Figure 9.
    Scalable Task-Driven Robotic Swarm Control via Collision Avoidance and Learning Mean-Field Control. (arXiv:2209.07420v3 [cs.RO] UPDATED)
    In recent years, reinforcement learning and its multi-agent analogue have achieved great success in solving various complex control problems. However, multi-agent reinforcement learning remains challenging both in its theoretical analysis and empirical design of algorithms, especially for large swarms of embodied robotic agents where a definitive toolchain remains part of active research. We use emerging state-of-the-art mean-field control techniques in order to convert many-agent swarm control into more classical single-agent control of distributions. This allows profiting from advances in single-agent reinforcement learning at the cost of assuming weak interaction between agents. However, the mean-field model is violated by the nature of real systems with embodied, physically colliding agents. Thus, we combine collision avoidance and learning of mean-field control into a unified framework for tractably designing intelligent robotic swarm behavior. On the theoretical side, we provide novel approximation guarantees for general mean-field control both in continuous spaces and with collision avoidance. On the practical side, we show that our approach outperforms multi-agent reinforcement learning and allows for decentralized open-loop application while avoiding collisions, both in simulation and real UAV swarms. Overall, we propose a framework for the design of swarm behavior that is both mathematically well-founded and practically useful, enabling the solution of otherwise intractable swarm problems.
    Offsite-Tuning: Transfer Learning without Full Model. (arXiv:2302.04870v1 [cs.CL])
    Transfer learning is important for foundation models to adapt to downstream tasks. However, many foundation models are proprietary, so users must share their data with model owners to fine-tune the models, which is costly and raise privacy concerns. Moreover, fine-tuning large foundation models is computation-intensive and impractical for most downstream users. In this paper, we propose Offsite-Tuning, a privacy-preserving and efficient transfer learning framework that can adapt billion-parameter foundation models to downstream data without access to the full model. In offsite-tuning, the model owner sends a light-weight adapter and a lossy compressed emulator to the data owner, who then fine-tunes the adapter on the downstream data with the emulator's assistance. The fine-tuned adapter is then returned to the model owner, who plugs it into the full model to create an adapted foundation model. Offsite-tuning preserves both parties' privacy and is computationally more efficient than the existing fine-tuning methods that require access to the full model weights. We demonstrate the effectiveness of offsite-tuning on various large language and vision foundation models. Offsite-tuning can achieve comparable accuracy as full model fine-tuning while being privacy-preserving and efficient, achieving 6.5x speedup and 5.6x memory reduction. Code is available at https://github.com/mit-han-lab/offsite-tuning.
    Near-Optimal Adversarial Reinforcement Learning with Switching Costs. (arXiv:2302.04374v1 [cs.LG])
    Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $\beta$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
    Mind the Gap: Measuring Generalization Performance Across Multiple Objectives. (arXiv:2212.04183v2 [cs.LG] UPDATED)
    Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.
    Performative Recommendation: Diversifying Content via Strategic Incentives. (arXiv:2302.04336v1 [cs.LG])
    The primary goal in recommendation is to suggest relevant content to users, but optimizing for accuracy often results in recommendations that lack diversity. To remedy this, conventional approaches such as re-ranking improve diversity by presenting more diverse items. Here we argue that to promote inherent and prolonged diversity, the system must encourage its creation. Towards this, we harness the performative nature of recommendation, and show how learning can incentivize strategic content creators to create diverse content. Our approach relies on a novel form of regularization that anticipates strategic changes to content, and penalizes for content homogeneity. We provide analytic and empirical results that demonstrate when and how diversity can be incentivized, and experimentally demonstrate the utility of our approach on synthetic and semi-synthetic data.  ( 2 min )
  • Open

    CausalEGM: a general causal inference framework by encoding generative modeling. (arXiv:2212.05925v3 [stat.ML] UPDATED)
    Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM.
    rMultiNet: An R Package For Multilayer Networks Analysis. (arXiv:2302.04437v1 [stat.ML])
    This paper develops an R package rMultiNet to analyze multilayer network data. We provide two general frameworks from recent literature, e.g. mixture multilayer stochastic block model(MMSBM) and mixture multilayer latent space model(MMLSM) to generate the multilayer network. We also provide several methods to reveal the embedding of both nodes and layers followed by further data analysis methods, such as clustering. Three real data examples are processed in the package. The source code of rMultiNet is available at https://github.com/ChenyuzZZ73/rMultiNet.
    Flaky Performances when Pretraining on Relational Databases. (arXiv:2211.05213v1 [cs.LG] CROSS LISTED)
    We explore the downstream task performances for graph neural network (GNN) self-supervised learning (SSL) methods trained on subgraphs extracted from relational databases (RDBs). Intuitively, this joint use of SSL and GNNs should allow to leverage more of the available data, which could translate to better results. However, we found that naively porting contrastive SSL techniques can cause ``negative transfer'': linear evaluation on fixed representations from a pretrained model performs worse than on representations from the randomly-initialized model. Based on the conjecture that contrastive SSL conflicts with the message passing layers of the GNN, we propose InfoNode: a contrastive loss aiming to maximize the mutual information between a node's initial- and final-layer representation. The primary empirical results support our conjecture and the effectiveness of InfoNode.
    Trading Information between Latents in Hierarchical Variational Autoencoders. (arXiv:2302.04855v1 [stat.ML])
    Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content ("bit rate") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.
    Random survival forests with multivariate longitudinal endogenous covariates. (arXiv:2208.05801v2 [stat.ML] UPDATED)
    Predicting the individual risk of a clinical event using the complete patient history is still a major challenge for personalized medicine. Among the methods developed to compute individual dynamic predictions, the joint models have the assets of using all the available information while accounting for dropout. However, they are restricted to a very small number of longitudinal predictors. Our objective was to propose an innovative alternative solution to predict an event probability using a possibly large number of longitudinal predictors. We developed DynForest, an extension of competing-risk random survival forests that handles endogenous longitudinal predictors. At each node of the tree, the time-dependent predictors are translated into time-fixed features (using mixed models) to be used as candidates for splitting the subjects into two subgroups. The individual event probability is estimated in each tree by the Aalen-Johansen estimator of the leaf in which the subject is classified according to his/her history of predictors. The final individual prediction is given by the average of the tree-specific individual event probabilities. We carried out a simulation study to demonstrate the performances of DynForest both in a small dimensional context (in comparison with joint models) and in a large dimensional context (in comparison with a regression calibration method that ignores informative dropout). We also applied DynForest to (i) predict the individual probability of dementia in the elderly according to repeated measures of cognitive, functional, vascular and neuro-degeneration markers, and (ii) quantify the importance of each type of markers for the prediction of dementia. Implemented in the R package DynForest, our methodology provides a novel and appropriate solution for the prediction of events from any number of longitudinal endogenous predictors.
    Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes. (arXiv:2302.04534v1 [cs.LG])
    Autoencoders and their variants are among the most widely used models in representation learning and generative modeling. However, autoencoder-based models usually assume that the learned representations are i.i.d. and fail to capture the correlations between the data samples. To address this issue, we propose a novel Sparse Gaussian Process Bayesian Autoencoder (SGPBAE) model in which we impose fully Bayesian sparse Gaussian Process priors on the latent space of a Bayesian Autoencoder. We perform posterior estimation for this model via stochastic gradient Hamiltonian Monte Carlo. We evaluate our approach qualitatively and quantitatively on a wide range of representation learning and generative modeling tasks and show that our approach consistently outperforms multiple alternatives relying on Variational Autoencoders.
    A Benchmark on Uncertainty Quantification for Deep Learning Prognostics. (arXiv:2302.04730v1 [cs.LG])
    Reliable uncertainty quantification on RUL prediction is crucial for informative decision-making in predictive maintenance. In this context, we assess some of the latest developments in the field of uncertainty quantification for prognostics deep learning. This includes the state-of-the-art variational inference algorithms for Bayesian neural networks (BNN) as well as popular alternatives such as Monte Carlo Dropout (MCD), deep ensembles (DE) and heteroscedastic neural networks (HNN). All the inference techniques share the same inception deep learning architecture as a functional model. We performed hyperparameter search to optimize the main variational and learning parameters of the algorithms. The performance of the methods is evaluated on a subset of the large NASA NCMAPSS dataset for aircraft engines. The assessment includes RUL prediction accuracy, the quality of predictive uncertainty, and the possibility to break down the total predictive uncertainty into its aleatoric and epistemic parts. The results show no method clearly outperforms the others in all the situations. Although all methods are close in terms of accuracy, we find differences in the way they estimate uncertainty. Thus, DE and MCD generally provide more conservative predictive uncertainty than BNN. Surprisingly, HNN can achieve strong results without the added training complexity and extra parameters of the BNN. For tasks like active learning where a separation of epistemic and aleatoric uncertainty is required, radial BNN and MCD seem the best options.
    How degenerate is the parametrization of neural networks with the ReLU activation function?. (arXiv:1905.09803v3 [cs.LG] UPDATED)
    Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the main loss function generally only depends on the realization of the neural network, i.e. the function it computes. Studying the optimization problem over the space of realizations opens up new ways to understand neural network training. In particular, usual loss functions like mean squared error and categorical cross entropy are convex on spaces of neural network realizations, which themselves are non-convex. Approximation capabilities of neural networks can be used to deal with the latter non-convexity, which allows us to establish that for sufficiently large networks local minima of a regularized optimization problem on the realization space are almost optimal. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.
    Continual Causal Effect Estimation: Challenges and Opportunities. (arXiv:2301.01026v3 [cs.LG] UPDATED)
    A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.
    On Sampling with Approximate Transport Maps. (arXiv:2302.04763v1 [stat.ML])
    Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can reliability be handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.
    Domain Generalization by Functional Regression. (arXiv:2302.04724v1 [cs.LG])
    The problem of domain generalization is to learn, given data from different source distributions, a model that can be expected to generalize well on new target distributions which are only seen through unlabeled samples. In this paper, we study domain generalization as a problem of functional regression. Our concept leads to a new algorithm for learning a linear operator from marginal distributions of inputs to the corresponding conditional distributions of outputs given inputs. Our algorithm allows a source distribution-dependent construction of reproducing kernel Hilbert spaces for prediction, and, satisfies finite sample error bounds for the idealized risk. Numerical implementations and source code are available.
    Near-Optimal Adversarial Reinforcement Learning with Switching Costs. (arXiv:2302.04374v1 [cs.LG])
    Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $\beta$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
    Geometry-Complete Diffusion for 3D Molecule Generation. (arXiv:2302.04313v1 [cs.LG])
    Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods such as those of Hoogeboom et al. 2022 have been proposed for unconditionally generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. Toward this end, we propose GCDM, a geometry-complete diffusion model that achieves new state-of-the-art results for 3D molecule diffusion generation by leveraging the representation learning strengths offered by GNNs that perform geometry-complete message-passing. Our results with GCDM also offer preliminary insights into how physical inductive biases impact the generative dynamics of molecular DDPMs. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/bio-diffusion.
    Efficient displacement convex optimization with particle gradient descent. (arXiv:2302.04753v1 [cs.LG])
    Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Concretely, for Lipschitz displacement convex functions defined on probability over $\mathbb{R}^d$, we prove that $O(1/\epsilon^2)$ particles and $O(d/\epsilon^4)$ computations are sufficient to find the $\epsilon$-optimal solutions. We further provide improved complexity bounds for optimizing smooth displacement convex functions. We demonstrate the application of our results for function approximation with specific neural architectures with two-dimensional inputs.
    The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning. (arXiv:2302.04658v1 [stat.ML])
    Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $\nu$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tilde\Theta(\frac{D}{f'(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling.
    A data variation robust learning model based on importance sampling. (arXiv:2302.04438v1 [stat.ML])
    A crucial assumption underlying the most current theory of machine learning is that the training distribution is identical to the testing distribution. However, this assumption may not hold in some real-world applications. In this paper, we propose an importance sampling based data variation robust loss (ISloss) for learning problems which minimizes the worst case of loss under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from the importance sampling method. Furthermore, we reveal that there is a relationship between ISloss under the logarithmic transformation (LogISloss) and the p-norm loss. We apply the proposed LogISloss to the face verification problem on Racial Faces in the Wild dataset and show that the proposed method is robust under large distribution deviations.
    Sparse Random Networks for Communication-Efficient Federated Learning. (arXiv:2209.15328v2 [cs.LG] UPDATED)
    One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \emph{random} values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a \emph{stochastic} binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -- or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than $1$ bit per parameter (bpp)), convergence speed, and final model size (less than $1$ bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime under various system configurations.
    Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning. (arXiv:2302.04376v1 [cs.LG])
    A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.
    Local Lipschitz Bounds of Deep Neural Networks. (arXiv:2004.13135v3 [stat.ML] UPDATED)
    The Lipschitz constant is an important quantity that arises in analysing the convergence of gradient-based optimization methods. It is generally unclear how to estimate the Lipschitz constant of a complex model. Thus, this paper studies an important problem that may be useful to the broader area of non-convex optimization. The main result provides a local upper bound on the Lipschitz constants of a multi-layer feed-forward neural network and its gradient. Moreover, lower bounds are established as well, which are used to show that it is impossible to derive global upper bounds for the Lipschitz constants. In contrast to previous works, we compute the Lipschitz constants with respect to the network parameters and not with respect to the inputs. These constants are needed for the theoretical description of many step size schedulers of gradient based optimization schemes and their convergence analysis. The idea is both simple and effective. The results are extended to a generalization of neural networks, continuously deep neural networks, which are described by controlled ODEs.
    Conformal Off-policy Prediction. (arXiv:2206.06711v2 [stat.ML] UPDATED)
    Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.
    A Text-guided Protein Design Framework. (arXiv:2302.04611v1 [cs.LG])
    Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level properties. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP that aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that generates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We empirically verify the effectiveness of ProteinDT from three aspects: (1) consistently superior performance on four out of six protein property prediction benchmarks; (2) over 90% accuracy for text-guided protein generation; and (3) promising results for zero-shot text-guided protein editing.
    Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization. (arXiv:2302.04552v1 [cs.LG])
    Stochastically Extended Adversarial (SEA) model is introduced by Sachs et al. [2022] as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition, they demonstrate that the expected regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. They also provide a slightly weaker bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, without the convexity requirement of individual functions. For strongly convex and smooth functions, we establish an $\mathcal{O}(\min\{\log (\sigma_{1:T}^2+\Sigma_{1:T}^2), (\sigma_{\max}^2 + \Sigma_{\max}^2) \log T\})$ bound, better than their $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log T)$ bound. For \mbox{exp-concave} and smooth functions, we achieve a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Owing to the OMD framework, we can further extend our result to obtain dynamic regret guarantees, which are more favorable in non-stationary online scenarios. The attained results allow us to recover excess risk bounds of the stochastic setting and regret bounds of the adversarial setting, and derive new guarantees for many intermediate scenarios.
    Classification of BCI-EEG based on augmented covariance matrix. (arXiv:2302.04508v1 [cs.HC])
    Objective: Electroencephalography signals are recorded as a multidimensional dataset. We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification. Methods: From the autoregressive model can be derived the Yule-Walker equations, which show the emergence of a symmetric positive definite matrix: the augmented covariance matrix. The state-of the art for classifying covariance matrices is based on Riemannian Geometry. A fairly natural idea is therefore to extend the standard approach using these augmented covariance matrices. The methodology for creating the augmented covariance matrix shows a natural connection with the delay embedding theorem proposed by Takens for dynamical systems. Such an embedding method is based on the knowledge of two parameters: the delay and the embedding dimension, respectively related to the lag and the order of the autoregressive model. This approach provides new methods to compute the hyper-parameters in addition to standard grid search. Results: The augmented covariance matrix performed noticeably better than any state-of-the-art methods. We will test our approach on several datasets and several subjects using the MOABB framework, using both within-session and cross-session evaluation. Conclusion: The improvement in results is due to the fact that the augmented covariance matrix incorporates not only spatial but also temporal information, incorporating nonlinear components of the signal through an embedding procedure, which allows the leveraging of dynamical systems algorithms. Significance: These results extend the concepts and the results of the Riemannian distance based classification algorithm.
    Convergence of a robust deep FBSDE method for stochastic control. (arXiv:2201.06854v5 [math.OC] UPDATED)
    In this paper, we propose a deep learning based numerical scheme for strongly coupled FBSDEs, stemming from stochastic control. It is a modification of the deep BSDE method in which the initial value to the backward equation is not a free parameter, and with a new loss function being the weighted sum of the cost of the control problem, and a variance term which coincides with the mean squared error in the terminal condition. We show by a numerical example that a direct extension of the classical deep BSDE method to FBSDEs, fails for a simple linear-quadratic control problem, and motivate why the new method works. Under regularity and boundedness assumptions on the exact controls of time continuous and time discrete control problems, we provide an error analysis for our method. We show empirically that the method converges for three different problems, one being the one that failed for a direct extension of the deep BSDE method.
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v2 [stat.ML] UPDATED)
    We consider the differentially private estimation of multiple quantiles (MQ) of a distribution from a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem. We establish that the resulting method is closely related to the recently published ad hoc algorithm JointExp. In particular, they share the same computational complexity and a similar efficiency. We prove the statistical consistency of these two algorithms for continuous distributions. Furthermore, we demonstrate both theoretically and empirically that this method suffers from an important lack of performance in the case of peaked distributions, which can degrade up to a potentially catastrophic impact in the presence of atoms. Its smoothed version (i.e. by applying a max kernel to its output density) would solve this problem, but remains an open challenge to implement. As a proxy, we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.  ( 2 min )
    Learning Dynamical Systems by Leveraging Data from Similar Systems. (arXiv:2302.04344v1 [stat.ML])
    We consider the problem of learning the dynamics of a linear system when one has access to data generated by an auxiliary system that shares similar (but not identical) dynamics, in addition to data from the true system. We use a weighted least squares approach, and provide a finite sample error bound of the learned model as a function of the number of samples and various system parameters from the two systems as well as the weight assigned to the auxiliary data. We show that the auxiliary data can help to reduce the intrinsic system identification error due to noise, at the price of adding a portion of error that is due to the differences between the two system models. We further provide a data-dependent bound that is computable when some prior knowledge about the systems is available. This bound can also be used to determine the weight that should be assigned to the auxiliary data during the model training stage.  ( 2 min )
    Equivariant MuZero. (arXiv:2302.04798v1 [cs.LG])
    Deep reinforcement learning repeatedly succeeds in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms, such as the highly successful MuZero, aim to accomplish this by learning a world model. However, leveraging a world model has not consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.  ( 2 min )
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v2 [cs.LG] UPDATED)
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs. The code is available at: https://github.com/dchiji-ntt/meta-ticket  ( 2 min )
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v8 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.  ( 2 min )
    Lazy OCO: Online Convex Optimization on a Switching Budget. (arXiv:2102.03803v5 [cs.LG] UPDATED)
    We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary. In this work, we aim to fill the gap and present computationally efficient algorithms in the more prevalent oblivious setting, establishing a regret bound of $O(T/S)$ for general convex losses and $\widetilde O(T/S^2)$ for strongly convex losses. In addition, for stochastic i.i.d.~losses, we present a simple algorithm that performs $\log T$ switches with only a multiplicative $\log T$ factor overhead in its regret in both the general and strongly convex settings. Finally, we complement our algorithms with lower bounds that match our upper bounds in some of the cases we consider.  ( 2 min )
    A Constant-per-Iteration Likelihood Ratio Test for Online Changepoint Detection for Exponential Family Models. (arXiv:2302.04743v1 [stat.CO])
    Online changepoint detection algorithms that are based on likelihood-ratio tests have been shown to have excellent statistical properties. However, a simple online implementation is computationally infeasible as, at time $T$, it involves considering $O(T)$ possible locations for the change. Recently, the FOCuS algorithm has been introduced for detecting changes in mean in Gaussian data that decreases the per-iteration cost to $O(\log T)$. This is possible by using pruning ideas, which reduce the set of changepoint locations that need to be considered at time $T$ to approximately $\log T$. We show that if one wishes to perform the likelihood ratio test for a different one-parameter exponential family model, then exactly the same pruning rule can be used, and again one need only consider approximately $\log T$ locations at iteration $T$. Furthermore, we show how we can adaptively perform the maximisation step of the algorithm so that we need only maximise the test statistic over a small subset of these possible locations. Empirical results show that the resulting online algorithm, which can detect changes under a wide range of models, has a constant-per-iteration cost on average.  ( 2 min )
    On Computable Online Learning. (arXiv:2302.04357v1 [cs.LG])
    We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for "optimality" in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of "optimality" and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.  ( 2 min )
    Find a witness or shatter: the landscape of computable PAC learning. (arXiv:2302.04731v1 [cs.CC])
    This paper contributes to the study of CPAC learnability -- a computable version of PAC learning -- by solving three open questions from recent papers. Firstly, we prove that every improperly CPAC learnable class is contained in a class which is properly CPAC learnable with polynomial sample complexity. This confirms a conjecture by Agarwal et al (COLT 2021). Secondly, we show that there exists a decidable class of hypothesis which is properly CPAC learnable, but only with uncomputably fast growing sample complexity. This solves a question from Sterkenburg (COLT 2022). Finally, we construct a decidable class of finite Littlestone dimension which is not improperly CPAC learnable, strengthening a recent result of Sterkenburg (2022) and answering a question posed by Hasrati and Ben-David (ALT 2023). Together with previous work, our results provide a complete landscape for the learnability problem in the CPAC setting.  ( 2 min )
    Importance Sampling Deterministic Annealing for Clustering. (arXiv:2302.04421v1 [stat.ML])
    A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in some real-world scenarios. In this paper, we propose an importance sampling based deterministic annealing approach (ISDA) for clustering problems which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method and be solved by the quasi-newton algorithm. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed ISDA. Furthermore, we show that fuzzy c-means is a special case of ISDA with the logarithmic distortion. This observation sheds a new light on the relationship between fuzzy c-means and deterministic annealing clustering algorithms and provides an interesting physical and information-theoretical interpretation for fuzzy exponent $m$.  ( 2 min )
    Disentangling Learning Representations with Density Estimation. (arXiv:2302.04362v1 [cs.LG])
    Disentangled learning representations have promising utility in many applications, but they currently suffer from serious reliability issues. We present Gaussian Channel Autoencoder (GCAE), a method which achieves reliable disentanglement via flexible density estimation of the latent space. GCAE avoids the curse of dimensionality of density estimation by disentangling subsets of its latent space with the Dual Total Correlation (DTC) metric, thereby representing its high-dimensional latent joint distribution as a collection of many low-dimensional conditional distributions. In our experiments, GCAE achieves highly competitive and reliable disentanglement scores compared with state-of-the-art baselines.  ( 2 min )
    Introduction To Gaussian Process Regression In Bayesian Inverse Problems, With New ResultsOn Experimental Design For Weighted Error Measures. (arXiv:2302.04518v1 [stat.ML])
    Bayesian posterior distributions arising in modern applications, including inverse problems in partial differential equation models in tomography and subsurface flow, are often computationally intractable due to the large computational cost of evaluating the data likelihood. To alleviate this problem, we consider using Gaussian process regression to build a surrogate model for the likelihood, resulting in an approximate posterior distribution that is amenable to computations in practice. This work serves as an introduction to Gaussian process regression, in particular in the context of building surrogate models for inverse problems, and presents new insights into a suitable choice of training points. We show that the error between the true and approximate posterior distribution can be bounded by the error between the true and approximate likelihood, measured in the $L^2$-norm weighted by the true posterior, and that efficiently bounding the error between the true and approximate likelihood in this norm suggests choosing the training points in the Gaussian process surrogate model based on the true posterior.  ( 2 min )
    Generalization in Graph Neural Networks: Improved PAC-Bayesian Bounds on Graph Diffusion. (arXiv:2302.04451v1 [cs.LG])
    Graph neural networks are widely used tools for graph prediction tasks. Motivated by their empirical performance, prior works have developed generalization bounds for graph neural networks, which scale with graph structures in terms of the maximum degree. In this paper, we present generalization bounds that instead scale with the largest singular value of the graph neural network's feature diffusion matrix. These bounds are numerically much smaller than prior bounds for real-world graphs. We also construct a lower bound of the generalization gap that matches our upper bound asymptotically. To achieve these results, we analyze a unified model that includes prior works' settings (i.e., convolutional and message-passing networks) and new settings (i.e., graph isomorphism networks). Our key idea is to measure the stability of graph neural networks against noise perturbations using Hessians. Empirically, we find that Hessian-based measurements correlate with the observed generalization gaps of graph neural networks accurately; Optimizing noise stability properties for fine-tuning pretrained graph neural networks also improves test performance on several graph-level classification tasks.  ( 2 min )
    Outlier-Robust Gromov Wasserstein for Graph Data. (arXiv:2302.04610v1 [cs.LG])
    Gromov Wasserstein (GW) distance is a powerful tool for comparing and aligning probability distributions supported on different metric spaces. It has become the main modeling technique for aligning heterogeneous data for a wide range of graph learning tasks. However, the GW distance is known to be highly sensitive to outliers, which can result in large inaccuracies if the outliers are given the same weight as other samples in the objective function. To mitigate this issue, we introduce a new and robust version of the GW distance called RGW. RGW features optimistically perturbed marginal constraints within a $\varphi$-divergence based ambiguity set. To make the benefits of RGW more accessible in practice, we develop a computationally efficient algorithm, Bregman proximal alternating linearization minimization, with a theoretical convergence guarantee. Through extensive experimentation, we validate our theoretical results and demonstrate the effectiveness of RGW on real-world graph learning tasks, such as subgraph matching and partial shape correspondence.  ( 2 min )
    Robust and Scalable Bayesian Online Changepoint Detection. (arXiv:2302.04759v1 [stat.ML])
    This paper proposes an online, provably robust, and scalable Bayesian approach for changepoint detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective, and also addresses the scalability issues of previous attempts. Specifically, the proposed generalised Bayesian formalism leads to conjugate posteriors whose parameters are available in closed form by leveraging diffusion score matching. The resulting algorithm is exact, can be updated through simple algebra, and is more than 10 times faster than its closest competitor.  ( 2 min )
    Discovering interpretable Lagrangian of dynamical systems from data. (arXiv:2302.04400v1 [stat.ML])
    A complete understanding of physical systems requires models that are accurate and obeys natural conservation laws. Recent trends in representation learning involve learning Lagrangian from data rather than the direct discovery of governing equations of motion. The generalization of equation discovery techniques has huge potential; however, existing Lagrangian discovery frameworks are black-box in nature. This raises a concern about the reusability of the discovered Lagrangian. In this article, we propose a novel data-driven machine-learning algorithm to automate the discovery of interpretable Lagrangian from data. The Lagrangian are derived in interpretable forms, which also allows the automated discovery of conservation laws and governing equations of motion. The architecture of the proposed framework is designed in such a way that it allows learning the Lagrangian from a subset of the underlying domain and then generalizing for an infinite-dimensional system. The fidelity of the proposed framework is exemplified using examples described by systems of ordinary differential equations and partial differential equations where the Lagrangian and conserved quantities are known.  ( 2 min )
    InfoNCE is a variational autoencoder. (arXiv:2107.02495v2 [stat.ML] UPDATED)
    There are two main approaches to self-supervised learning (SSL), generative SSL, which learns a probabilistic model of the inputs, or contrastive SSL where we design a supervised learning task to encourage good representations. We reconcile these approaches by showing that contrastive SSL methods (including InfoNCE) which maximize the mutual information (MI) implicitly learn a probabilistic model of the inputs (specifically, a variational autoencoder; VAE). In particular, when we learn the optimal prior, the VAE objective (the ELBO) becomes equal to the MI (up to a constant). In turn, for a deterministic encoder the ELBO is equal to the log Bayesian model evidence. This establishes a profound connection between Bayesian inference and information theory. However, practical InfoNCE methods do not use the MI as an objective: the MI is invariant to arbitrary invertible transformations, so using an MI objective can lead to highly entangled representations (Tschannen et al., 2019). Instead, the actual InfoNCE objective is a simplified lower bound on the MI which is loose even in the infinite sample limit. Thus, an objective that works (i.e. the actual InfoNCE objective) appears to be motivated as a loose bound on an objective that does not work (i.e. the true MI which gives arbitrarily entangled representations). We give an alternative motivation for the actual InfoNCE objective. In particular, we show that in the infinite sample limit, and for a particular choice of prior, the actual InfoNCE objective is equal to the log Bayesian model evidence (up to a constant). Thus, we argue that our VAE perspective gives a better motivation for InfoNCE than MI, as the actual InfoNCE objective is only loosely bounded by the MI, but is equal to the log Bayesian model evidence (up to a constant).  ( 2 min )
    Nonlinear Random Matrices and Applications to the Sum of Squares Hierarchy. (arXiv:2302.04462v1 [cs.CC])
    We develop new tools in the theory of nonlinear random matrices and apply them to study the performance of the Sum of Squares (SoS) hierarchy on average-case problems. The SoS hierarchy is a powerful optimization technique that has achieved tremendous success for various problems in combinatorial optimization, robust statistics and machine learning. It's a family of convex relaxations that lets us smoothly trade off running time for approximation guarantees. In recent works, it's been shown to be extremely useful for recovering structure in high dimensional noisy data. It also remains our best approach towards refuting the notorious Unique Games Conjecture. In this work, we analyze the performance of the SoS hierarchy on fundamental problems stemming from statistics, theoretical computer science and statistical physics. In particular, we show subexponential-time SoS lower bounds for the problems of the Sherrington-Kirkpatrick Hamiltonian, Planted Slightly Denser Subgraph, Tensor Principal Components Analysis and Sparse Principal Components Analysis. These SoS lower bounds involve analyzing large random matrices, wherein lie our main contributions. These results offer strong evidence for the truth of and insight into the low-degree likelihood ratio hypothesis, an important conjecture that predicts the power of bounded-time algorithms for hypothesis testing. We also develop general-purpose tools for analyzing the behavior of random matrices which are functions of independent random variables. Towards this, we build on and generalize the matrix variant of the Efron-Stein inequalities. In particular, our general theorem on matrix concentration recovers various results that have appeared in the literature. We expect these random matrix theory ideas to have other significant applications.  ( 2 min )
    Sample Complexity Using Infinite Multiview Models. (arXiv:2302.04292v1 [math.ST])
    Recent works have demonstrated that the convergence rate of a nonparametric density estimator can be greatly improved by using a low-rank estimator when the target density is a convex combination of separable probability densities with Lipschitz continuous marginals, i.e. a multiview model. However, this assumption is very restrictive and it is not clear to what degree these findings can be extended to general pdfs. This work answers this question by introducing a new way of characterizing a pdf's complexity, the non-negative Lipschitz spectrum (NL-spectrum), which, unlike smoothness properties, can be used to characterize virtually any pdf. Finite sample bounds are presented that are dependent on the target density's NL-spectrum. From this dimension-independent rates of convergence are derived that characterize when an NL-spectrum allows for a fast rate of convergence.  ( 2 min )

  • Open

    [D] Best available text to speech free AI model out there for english
    Greetings everyone. I am looking for the best text to speech AI model out there for english I am looking for links to the models you know as best If the model supports subtitle file to speech that would be even more awesome Like providing .srt or .vtt to generate speech - speeding up the necessary parts of speech to fit into durations Thank you very much again I will use this to replace audio of my older lecture recordings by providing a time generated manually corrected subtitle file like srt or vtt I am looking for any male sounding model that sounds natural ​ I have found this They colab and looks very easy to generate. I think I can automate it. But is this one the best? https://www.reddit.com/r/MachineLearning/comments/v9rigf/p_silero_tts_full_v3_release/ found this too but only female voice :/ https://www.reddit.com/r/MachineLearning/comments/ttgsr4/r_nixtts_an_incredibly_lightweight_texttospeech/ I need a male voice any other good ones? ​ submitted by /u/CeFurkan [link] [comments]  ( 43 min )
    [D] Is Efficient-Net same as MobileNetV2
    Quick question, is EfficientNet-V1 same as MobileNet-V2? I think they use the same backbone, the inverted linear residual block, no? submitted by /u/No_Oilve_6577 [link] [comments]  ( 42 min )
    [D] Speed up HuggingFace Inference Pipeline
    Running a pipeline sentiment analysis call with transformers on 16 cpu takes 6-9 seconds for one inference. How can I speed this up? My ideas, for your inputs please: Ray cluster - parallel computing, memory usage is high. Within the pipeline() call, use parameter batch_size. However, is batch_size not appropriate for cpu? HF Accelerate - not sure how to implement on a published model... Model distillation - not sure how to implement on a published model... Thanks in advance! submitted by /u/askingforhelp1111 [link] [comments]  ( 43 min )
    [R] Large Language Models Can Teach Themselves to Use Tools
    submitted by /u/MysteryInc152 [link] [comments]  ( 44 min )
    [D] Is it legal to use images or videos with copyright to train a model?
    Hello, I want to know if it is legal to use scraped video or images to train a predictive model, for example, If I scrape photos of faces in google, and after that, I share that model in order that a lot of people can detect faces in their applications, is that legal? submitted by /u/Tlaloc-Es [link] [comments]  ( 43 min )
    [D] What ML or ML-powered projects are you currently building?
    This would be for ones that aren't finished enough to post as a link on the weekend. Just things that are in progress. Include a screenshot if you can! submitted by /u/TikkunCreation [link] [comments]  ( 42 min )
    [Discussion] We are AI fully developed!
    Humans nowadays see that AI approaches human capabilities. But where is the limit? For sure AI will replace some jobs in the future and become better and more accurate. Later it will teach and train on self-taught data. Then I am asking, what is the difference between a human and AI? WHAT IF WE ARE THE FINAL FORM OF AI? Will AI later think it is a human and repeat human history by developing another AI? submitted by /u/SubZero0xFF [link] [comments]  ( 42 min )
    [D] Locally-runnable text to speech AI?
    I've got a 4090 and some stuff that I think it would be fun to have narrated. I've looked at some of the paid online options and $20-$30/mo for 2 hours of AI TTS is not gonna gut it. Can anyone point me to software that I can run locally that'll give me high quality? It seems like if people are making billions of waifus in stable diffusion there ought to be something like this out there. submitted by /u/gruevy [link] [comments]  ( 44 min )
    [D] Simulator for RL problems
    I have seen people advocate a simulator for RL problems a lot. I am not sure by simulator what do they mean exactly? Is it the exact simulation (then the problem becomes easy) or some kind of feedback loop (start with a naïve simulator and once we get data then keep improving the simulator – this looks similar to value iteration or policy iteration). I assume it’s really difficult to get a simulator for data generation (except for video games etc.). Also, If we already have a simulator, we can easily train a model-free RL (e.g. just planning). submitted by /u/atulcst [link] [comments]  ( 43 min )
    [Discussion] WASM equivalant to Gradio but without needing a server?
    I'm really impressed with gradio for making interactive webapps. I was wondering... Gradio basically runs off a server so you have to standup a server just to demo certain kinds of apps. Is there something similar out that that can handle basic tabular data plots without needing a server? I was thinking perhaps something like a WASM app that can point to csvs on AWS S3 and generate plots on the fly? submitted by /u/RogueStargun [link] [comments]  ( 43 min )
    [P] Did anyone manage to run the MusicLM implementation from lucidrains?
    I really want to play with the repo but I'm stuck at the last step of the instructions (https://github.com/lucidrains/musiclm-pytorch#usage-1). If anyone has tips, please let me know! Here's the issue I have: https://github.com/lucidrains/musiclm-pytorch/issues/13 submitted by /u/BackgroundPass2082 [link] [comments]  ( 42 min )
    [D] numbers of parameters that can affect on a model.
    Hi guys, my question is what is different between parameters and FLOPs in terms of computation times. I know that the FLOPs is related to the computation of input images. For example, higher the size, higher the figure. But, how much parameters can affect on a model compared to the metric? I understand that the weights, biases are parameters. But, the cost of computation about them makes me difficult to determine what should I get a specific model. I can measure the decision based on the FLOPs, which decrease time of training my model when they are lower. However, I also want to decide a specific model with the number of parameters. Thanks. submitted by /u/Mundane_Definition_8 [link] [comments]  ( 44 min )
    [P] I'm using Instruct GPT to show anti-clickbait summaries on youtube videos
    submitted by /u/AlesioRFM [link] [comments]  ( 46 min )
    [D] Experiences finding a job in the US as a Mexican currently working in the UK
    Hi, I am considering moving to the US, and I was wondering about the job market for people in Mexico and the chances of getting an offer. I know that in theory, it should be 'easier' due to the United States–Mexico–Canada Agreement by getting a TN visa. Are there any Mexicans here that found a job in the US as a machine learning engineer/data scientist? Would anyone have a pointer? I'll obviously research companies and send my resumes, just thought of posting here to see what is the experience of other people. submitted by /u/darcia_scientist [link] [comments]  ( 43 min )
    [P] Resume parsing + Cv analysis
    Hi ! so for my final year project I will be working on a cv parser and matching cvs with job postings, I'm thinking about fine tuning LayoutLM on my cvs dataset( of 5000 resumes or so not yet labeled) to get the structure of a resume (contact info , skills , education , etc) and then combine it with NER to identify the details in each section (name , uni name , date of start etc ) . Is it good enough or should I take another approach ? Or how would you tackle the problem ? feel free to share any ideas u have about this project Thank you ! submitted by /u/Melodic_Secretary_42 [link] [comments]  ( 45 min )
    [D] The Kaggle Book : Book by Konrad Banachewicz and Luca Massaron
    Does anyone have "The Kaggle Book : Book by Konrad Banachewicz and Luca Massaron" in pdf ?? Please share the link submitted by /u/narendra7799 [link] [comments]  ( 42 min )
    [D] Critique of statistics research from machine learning perspectives (and vice versa)?
    I was just looking around at some paper published by statisticians, I couldn't help but notice that the flavor of their research is vastly different. For example, one researcher wrote about a dozen paper on LASSO alone over the span of a decade, whereas LASSO is just given a power point slide worth of attention in ML. Why is there such a disparity and a divergence in the aim of these disciplines? Are there some good critique of these research fields from each other's perspective (not just on the technical aspects)? Perhaps by someone who works in both? submitted by /u/fromnighttilldawn [link] [comments]  ( 50 min )
    [D] Should I put my current or past affiliation on my EACL paper?
    Hey guys. I have got a paper accepted to the EACL 2023 conference. When I was working on the paper I did not have any official affiliation. I was working as an independent researcher. I have started my PhD at PSU recently. I was wondering if I should use my current affiliation on the paper. I am the corresponding author for this paper. Also, I am planning to use my PSU address for all research communications from now on instead of my gmail address. So putting my PSU affiliation would make sense in that way. So my question is, is it okay to use my current affiliation? submitted by /u/ibraheemMmoosa [link] [comments]  ( 43 min )
  • Open

    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Who remembers Clippy? I can't think of a better time for it's return #ClippyGPT
    submitted by /u/hakJav [link] [comments]  ( 41 min )
    Developed an AI tool for Google Docs - what do you think?
    submitted by /u/vfra32 [link] [comments]  ( 41 min )
    Telling ChatGPT (GPT-3) to go by Gipee Tithree ala Star Wars
    submitted by /u/the_ferryman_abides [link] [comments]  ( 40 min )
    Can Bing AI listen to podcasts and summarize them?
    submitted by /u/tlkop123 [link] [comments]  ( 41 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Create your own videos from text inputs with this easy-to-use stable-diffusion-based WebApp! Choose between up to 5 different models, a worlds'-first prompt helper, super smooth videos with interpolation and much more.
    submitted by /u/nonicknamefornic [link] [comments]  ( 6 min )
    Merge A Face & Style Together With The LORA Extraction Method
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Read the Latest Issue of Weekly Piece of Future for Insights about Robotics, AI, Biotech, and Space!
    submitted by /u/RushingRobotics_com [link] [comments]  ( 40 min )
    How exactly is an AI "punished"?
    Yes so - I saw quite a few people 'training' their artificial intelligence to do and learn certain things using reward and punishment. When the AI does something bad, it is punished and when it does something right it is rewarded. But how does this work exactly? How is AI punished? Does it feel pain? Is part of its code deleted? How does it feel? One such example is a special bot for Minecraft. This bot can do pretty much anything and has a hive mind with all the other bots. It can perfectly navigate through even the toughest terrain and is able to best the best of the best PVPers (weird way to write it I know XD) This AI was trained in three sections. First in singleplayer where it learned how to survive. The settings were set so that whenever the AI dies in-game it will receive punish…  ( 47 min )
    WordPress To Add AI-Generated Images And Content Writing Feature
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    2022 in review with Olivia Gamblin - The Machine Ethics Podcast
    submitted by /u/benbyford [link] [comments]  ( 40 min )
    Where can I find some recent and high-quality Artificial Intelligence Conferences?
    "Where can I find some recent and high-quality Artificial Intelligence Conferences to watch online? I've checked the O'Reilly website, but the videos there seem to be dated (6 years old). Any suggestions would be greatly appreciated. submitted by /u/xfocus3 [link] [comments]  ( 41 min )
    How to be a better AI
    submitted by /u/Legal-Ad-1650 [link] [comments]  ( 41 min )
    Header - Open Source AI Training Pool(Idea, just ironing out the flaws)
    THE PITCH - Chatgpt has 175 billion parameters assuming each parameter is a 64 bit float we can calculate that the approximate weight of just the weights are around 1.8 tb and just to be on the safe side 2 tb with all the logic required for the model to work. Arguably a better google, a very primitive programmer and capable of being a very capable general assistant is a very important set of traits that it seems to have perfected. But we also know it seems to have some inherent biases since the data that it was fed was probably biased and that is very dangerous in and of itself. We also don't have the weights ourselves to tinker with the chatbot. While currently modern consumer grade gpu's cannot run chatgpt smoothly, with advancement in gpu technology, specialized gpu geared towards …  ( 43 min )
    Treating AI Like Nuclear Bombs
    submitted by /u/SpawnOfCthun [link] [comments]  ( 40 min )
    How long until we get history movies with deep faked characters?
    Like a biographical movie on Stalin with a deep-faked dictator? submitted by /u/Dyllbill_ [link] [comments]  ( 40 min )
    Pretty cool AI productivity tool
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    Is this a possible/likely scenario and is there a name to it?
    I was experimenting with ChatGPT the other day. I found it impressive but still make subtle mistakes especially in more technical fields. It could definitely trick a layperson, but to an expert it's at times just a bunch of words meshed together that don't make sense. (Interestingly, it is also completely useless when describing video game walkthroughs, like how to find a certain item or access a certain area in a game, which should be very straightforward and unambiguous information.) Now let's say AI has a 95% accuracy. Since AI generated content is much easier to produce, there will only be more and more AI generated content on the internet. And since AI is trained from the information on the internet, eventually there will come a point where AI is learning mostly from other AI-generated content. One can imagine that now the AI will only have a 95%*95% accuracy, and so on and so forth. Eventually I imagine the accuracy of the information on the internet should deteriorate to a point where an AI like chatgpt is no longer trustworthy (at least in non-academic fields), as well as those AI-generated websites that you already see a lot on the internet. Possible/likely? Is there an academic name to this possibility? submitted by /u/profmarylowe [link] [comments]  ( 45 min )
    An honest review of Nick Bostrom's book about AI development: Superintelligence
    submitted by /u/DeeMore [link] [comments]  ( 6 min )
    Photoshop tutorial created using AI voiceover (ElevenLabs)
    submitted by /u/howardpinsky [link] [comments]  ( 41 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 2 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Don't you guys think it's time for a new, relatively contextualized, subreddit for AI conversations?
    Hear me out! I am not disregarding this subreddit in any way, but since the conversation on AI has broadened, my reasoning is that we would need a much broader subreddit that also answers questions about these new models and also for users not particularly interested in the nitty gritty of AI jargon. I am particularly interested in conversations revolving around AI-powered search engines, given that Microsoft just launched Bing, and Google intends to try and catch up with Bard AI in a couple of weeks. I know there are a lot of people who also want to have these conversations and they also need a community, right? I am, in fact, thinking of creating one! submitted by /u/b_wanker [link] [comments]  ( 41 min )
  • Open

    Any Literature Regarding Reinforcement Learning in Production?
    There is a alot of literature available on building ML/DL pipelines from data transformations to monitoring. Is there similar literature available for setting up RL projects in production? What Kinda tech stack would one use for managing such projects (other then the algorithm and machine learning libraries) and why? submitted by /u/ZIGGY-Zz [link] [comments]  ( 41 min )
    Best approach for creating c++ environment for python?
    pybind11 functions that do all logic standalone build, and glue it to python with "subprocess" and communicate via pipes standalone build with HTTP communication ​ If you have used pybind11, how have you dealt with managing "state"? submitted by /u/reinforcement_agent [link] [comments]  ( 42 min )
    Simulator for RL problems
    I have seen people advocate a simulator for RL problems a lot. I am not sure by simulator what do they mean exactly? Is it the exact simulation (then the problem becomes easy) or some kind of feedback loop (start with a naïve simulator and once we get data then keep improving the simulator – this looks similar to value iteration or policy iteration). I assume it’s really difficult to get a simulator for data generation (except for video games etc.). Also, If we already have a simulator, we can easily train a model-free RL (e.g. just planning). submitted by /u/atulcst [link] [comments]  ( 42 min )
    GitHub - riiswa/planning-multi-robot-gym: A Gymnasium environment for simulating multi-robot planning.
    submitted by /u/riiswa [link] [comments]  ( 40 min )
    Imitation learning with partial observablity
    Hello, I'm working on a project where the goal is to train policy in a POMDP with imitation learning. In the POMDP, I need to use a belief module or an lstm to represent the history of observations. I do not understand exactly how I can use the demonstrations for imitation in this situation. The demonstrations are trajectories of observations by some unknown policy which do not have a belief module. My goal is to learn a policy in POMDP with both RL and imitation so that if it faces a state in the distribution of the demonstration, it should act according to what was demonstrated, other wise or should act according to the RL reward. submitted by /u/Ill_Satisfaction_865 [link] [comments]  ( 42 min )
    Communication between Agents in MARL
    I'd like to use either RLLib or StableBaselines3+PantheonRL to create a multi-agent system where the agents can share information about their surroundings with each other. I'm finding this to be very hard to implement. Does anyone have a starting point about implementing communication in MARL systems with tensor flow agents? submitted by /u/tessherelurkingnow [link] [comments]  ( 41 min )
    Is option framework proper using online learning?
    Hi, I am wondering if option framework (one of HRL framework) is proper or not following online learning framework. Intuitively, option framework should exploit offline learning method because it needs to train a different NNs or function approximators (tabular Q) with updating termination function as well. We may expect that this forces us to use replay buffer. Otherwise, the efficiency of using option framework dramatically decreases. I observed this on my experiments as well. Are there any articles mentioning this? Thanks! submitted by /u/Pikachu930 [link] [comments]  ( 41 min )
  • Open

    Google Research, 2022 & beyond: Algorithmic advances
    Posted by Vahab Mirrokni, VP and Google Fellow, Google Research (This is Part 5 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) Robust algorithm design is the backbone of systems across Google, particularly for our ML and AI models. Hence, developing algorithms with improved efficiency, performance and speed remains a high priority as it empowers services ranging from Search and Ads to Maps and YouTube. Google Research has been at the forefront of this effort, developing many innovations from privacy-safe recommendation systems to scalable solutions for large-scale ML. In 2022, we continued this journey, and advanced the state-of-the-art in several related areas. Here we highlight our progress in a …  ( 95 min )
  • Open

    Helping companies deploy AI models more responsibly
    MIT spinout Verta offers tools to help companies introduce, monitor, and manage machine-learning models safely and at scale.  ( 10 min )
  • Open

    Identifying defense coverage schemes in NFL’s Next Gen Stats
    This post is co-written with Jonathan Jung, Mike Band, Michael Chi, and Thompson Bliss at the National Football League. A coverage scheme refers to the rules and responsibilities of each football defender tasked with stopping an offensive pass. It is at the core of understanding and analyzing any football defensive strategy. Classifying the coverage scheme […]  ( 14 min )
  • Open

    Metaverse Development: Building the Future of Virtual Reality
    The metaverse, a term popularised by science fiction, refers to a shared virtual space where users can interact with each other in a virtual environment. It’s a convergence of real and virtual worlds, creating a new reality that exists simultaneously with the physical world. With the rapid advancement of technology, particularly in the field of… Read More »Metaverse Development: Building the Future of Virtual Reality The post Metaverse Development: Building the Future of Virtual Reality appeared first on Data Science Central.  ( 20 min )
    5 Common Mistakes Blockchain Professionals Should Avoid
    The hype around blockchain technology is something people are passionate about. Many blockchain enthusiasts openly claim the efficiency of blockchain functionalities like enhancing transparency in the governance process, developing new cryptocurrencies, and better supply chain management. However, the proof of concept is applied in various industries establishing blockchain dominance.   Those organizations not undertaking the development… Read More »5 Common Mistakes Blockchain Professionals Should Avoid  The post 5 Common Mistakes Blockchain Professionals Should Avoid  appeared first on Data Science Central.  ( 21 min )
  • Open

    DiversiTree: A New Method to Efficiently Compute Diverse Sets of Near-Optimal Solutions to Mixed-Integer Optimization Problems. (arXiv:2204.03822v3 [cs.DM] UPDATED)
    While most methods for solving mixed-integer optimization problems compute a single optimal solution, a diverse set of near-optimal solutions can often lead to improved outcomes. We present a new method for finding a set of diverse solutions by emphasizing diversity within the search for near-optimal solutions. Specifically, within a branch-and-bound framework, we investigated parameterized node selection rules that explicitly consider diversity. Our results indicate that our approach significantly increases the diversity of the final solution set. When compared with two existing methods, our method runs with similar runtime as regular node selection methods and gives a diversity improvement between 12% and 190%. In contrast, popular node selection rules, such as best-first search, in some instances performed worse than state-of-the-art methods by more than 35% and gave an improvement of no more than 130%. Further, we find that our method is most effective when diversity in node selection is continuously emphasized after reaching a minimal depth in the tree and when the solution set has grown sufficiently large. Our method can be easily incorporated into integer programming solvers and has the potential to significantly increase the diversity of solution sets.  ( 2 min )
    Robust representations of oil wells' intervals via sparse attention mechanism. (arXiv:2212.14246v2 [cs.LG] UPDATED)
    Determining the characteristics of newly drilled wells (e.g. reservoir formation properties) is a major challenge. One of the corresponding tasks is a well-interval similarity assessment: if we can learn to predict which oilfields are rich and which are not by comparing them with existing ones, this will lead to significant cost reductions. There are three main requirements for applying machine learning to oil&gas data: high quality even for unreliable data, low manual effort and interpretability of the model itself. Neural networks can be used to address these challenges. The use of a self-supervised paradigm leads to automatic model construction. However, existing approaches lack interpretability, and their quality prevents their use in applications. In particular, existing approaches like LSTM suffer from short-term memory, paying more attention to the end of a sequence. Instead, neural networks with Transformer architecture cast their attention over all sequences to make a decision. To make them more efficient in terms of computational time and more robust to noisy or absent values, we introduce a limited attention mechanism similar to that of the Informer architecture that considers only top correspondences. We run experiments on an open dataset with more than $20$ wells, making our experiments reliable and suitable for industrial use. The best results were obtained with our adaptation of the Informer variant of Transformer with ROC AUC $0.982$. It outperforms classical approaches with ROC AUC $0.824$, recurrent neural networks (RNNs) with ROC AUC $0.934$ and the direct use of Transformer with ROC AUC $0.961$. We show that well-interval representations obtained by Informer are of higher quality than those extracted by RNNs. Moreover, the obtained attention is interpretable, as it corresponds to the importance of a particular part of an interval for the similarity estimation.  ( 3 min )
    Automated Identification of Toxic Code Reviews Using ToxiCR. (arXiv:2202.13056v3 [cs.SE] UPDATED)
    Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy interactions among its members. However, off-the-shelf toxicity detectors perform poorly on Software Engineering (SE) datasets, such as one curated from code review comments. To encounter this challenge, we present ToxiCR, a supervised learning-based toxicity identification tool for code review interactions. ToxiCR includes a choice to select one of the ten supervised learning algorithms, an option to select text vectorization techniques, eight preprocessing steps, and a large-scale labeled dataset of 19,571 code review comments. Two out of those eight preprocessing steps are SE domain specific. With our rigorous evaluation of the models with various combinations of preprocessing steps and vectorization techniques, we have identified the best combination for our dataset that boosts 95.8% accuracy and 88.9% F1 score. ToxiCR significantly outperforms existing toxicity detectors on our dataset. We have released our dataset, pre-trained models, evaluation results, and source code publicly available at: https://github.com/WSU-SEAL/ToxiCR  ( 2 min )
    Shortcut Detection with Variational Autoencoders. (arXiv:2302.04246v1 [cs.LG])
    For real-world applications of machine learning (ML), it is essential that models make predictions based on well-generalizing features rather than spurious correlations in the data. The identification of such spurious correlations, also known as shortcuts, is a challenging problem and has so far been scarcely addressed. In this work, we present a novel approach to detect shortcuts in image and audio datasets by leveraging variational autoencoders (VAEs). The disentanglement of features in the latent space of VAEs allows us to discover correlations in datasets and semi-automatically evaluate them for ML shortcuts. We demonstrate the applicability of our method on several real-world datasets and identify shortcuts that have not been discovered before. Based on these findings, we also investigate the construction of shortcut adversarial examples.  ( 2 min )
    Revisiting the Linear-Programming Framework for Offline RL with General Function Approximation. (arXiv:2212.13861v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to find an optimal policy for sequential decision-making using a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and provide a new reformulation that advances the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, also with careful choices of the function classes and initial state distributions. We hope our insights bring into light the use of LP formulations and the induced primal-dual minimax optimization, in offline RL.  ( 2 min )
    Alternately Optimized Graph Neural Networks. (arXiv:2206.03638v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have greatly advanced the semi-supervised node classification task on graphs. The majority of existing GNNs are trained in an end-to-end manner that can be viewed as tackling a bi-level optimization problem. This process is often inefficient in computation and memory usage. In this work, we propose a new optimization framework for semi-supervised learning on graphs. The proposed framework can be conveniently solved by the alternating optimization algorithms, resulting in significantly improved efficiency. Extensive experiments demonstrate that the proposed method can achieve comparable or better performance with state-of-the-art baselines while it has significantly better computation and memory efficiency.  ( 2 min )
    Making Progress Based on False Discoveries. (arXiv:2204.08809v2 [cs.LG] UPDATED)
    The study of adaptive data analysis examines how many statistical queries can be answered accurately using a fixed dataset while avoiding false discoveries (statistically inaccurate answers). In this paper, we tackle a question that precedes the field of study: Is data only valuable when it provides accurate answers to statistical queries? To answer this question, we use Stochastic Convex Optimization as a case study. In this model, algorithms are considered as analysts who query an estimate of the gradient of a noisy function at each iteration and move towards its minimizer. It is known that $O(1/\epsilon^2)$ examples can be used to minimize the objective function, but none of the existing methods depend on the accuracy of the estimated gradients along the trajectory. Therefore, we ask: How many samples are needed to minimize a noisy convex function if we require $\epsilon$-accurate estimates of $O(1/\epsilon^2)$ gradients? Or, might it be that inaccurate gradient estimates are \emph{necessary} for finding the minimum of a stochastic convex function at an optimal statistical rate? We provide two partial answers to this question. First, we show that a general analyst (queries that may be maliciously chosen) requires $\Omega(1/\epsilon^3)$ samples, ruling out the possibility of a foolproof mechanism. Second, we show that, under certain assumptions on the oracle, $\tilde \Omega(1/\epsilon^{2.5})$ samples are necessary for gradient descent to interact with the oracle. Our results are in contrast to classical bounds that show that $O(1/\epsilon^2)$ samples can optimize the population risk to an accuracy of $O(\epsilon)$, but with spurious gradients.  ( 2 min )
    Guaranteed Conformance of Neurosymbolic Models to Natural Constraints. (arXiv:2212.01346v4 [cs.LG] UPDATED)
    Deep neural networks have emerged as the workhorse for a large section of robotics and control applications, especially as models for dynamical systems. Such data-driven models are in turn used for designing and verifying autonomous systems. This is particularly useful in modeling medical systems where data can be leveraged to individualize treatment. In safety-critical applications, it is important that the data-driven model is conformant to established knowledge from the natural sciences. Such knowledge is often available or can often be distilled into a (possibly black-box) model $M$. For instance, the unicycle model (which encodes Newton's laws) for an F1 racing car. In this light, we consider the following problem - given a model $M$ and state transition dataset, we wish to best approximate the system model while being bounded distance away from $M$. We propose a method to guarantee this conformance. Our first step is to distill the dataset into few representative samples called memories, using the idea of a growing neural gas. Next, using these memories we partition the state space into disjoint subsets and compute bounds that should be respected by the neural network, when the input is drawn from a particular subset. This serves as a symbolic wrapper for guaranteed conformance. We argue theoretically that this only leads to bounded increase in approximation error; which can be controlled by increasing the number of memories. We experimentally show that on three case studies (Car Model, Drones, and Artificial Pancreas), our constrained neurosymbolic models conform to specified $M$ models (each encoding various constraints) with order-of-magnitude improvements compared to the augmented Lagrangian and vanilla training methods. Our code can be found at https://github.com/kaustubhsridhar/Constrained_Models  ( 3 min )
    User-Aware Algorithmic Recourse with Preference Elicitation. (arXiv:2205.13743v2 [cs.LG] UPDATED)
    Counterfactual interventions are a powerful tool to explain the decisions of a black-box decision process and to enable algorithmic recourse. They are a sequence of actions that, if performed by a user, can overturn an unfavourable decision made by an automated decision system. However, most of the current methods provide interventions without considering the user's preferences. In this work, we propose a shift of paradigm by providing a novel formalization which considers the user as an active part of the process rather than a mere target. Following the preference elicitation setting, we introduce the first human-in-the-loop approach to perform algorithmic recourse. We also present a polynomial procedure to ask questions which maximize the Expected Utility of Selection (EUS), a measure of the utility of the choice set that accounts for the uncertainty with respect to both the model and the user response. We use it to iteratively refine our cost estimates in a Bayesian fashion. We integrate this preference elicitation strategy into a reinforcement learning agent coupled with Monte Carlo Tree Search for the efficient exploration, so as to provide personalized interventions achieving algorithmic recourse. An experimental evaluation of synthetic and real-world datasets shows that a handful of queries allows for achieving a substantial reduction in the cost of interventions with respect to user-independent alternatives.  ( 2 min )
    Look where you look! Saliency-guided Q-networks for generalization in visual Reinforcement Learning. (arXiv:2209.09203v3 [cs.LG] UPDATED)
    Deep reinforcement learning policies, despite their outstanding efficiency in simulated visual control tasks, have shown disappointing ability to generalize across disturbances in the input training images. Changes in image statistics or distracting background elements are pitfalls that prevent generalization and real-world applicability of such control policies. We elaborate on the intuition that a good visual policy should be able to identify which pixels are important for its decision, and preserve this identification of important sources of information across images. This implies that training of a policy with small generalization gap should focus on such important pixels and ignore the others. This leads to the introduction of saliency-guided Q-networks (SGQN), a generic method for visual reinforcement learning, that is compatible with any value function learning method. SGQN vastly improves the generalization capability of Soft Actor-Critic agents and outperforms existing stateof-the-art methods on the Deepmind Control Generalization benchmark, setting a new reference in terms of training efficiency, generalization gap, and policy interpretability.  ( 2 min )
    Latent Neural ODEs with Sparse Bayesian Multiple Shooting. (arXiv:2210.03466v2 [cs.LG] UPDATED)
    Training dynamic models, such as neural ODEs, on long trajectories is a hard problem that requires using various tricks, such as trajectory splitting, to make model training work in practice. These methods are often heuristics with poor theoretical justifications, and require iterative manual tuning. We propose a principled multiple shooting technique for neural ODEs that splits the trajectories into manageable short segments, which are optimised in parallel, while ensuring probabilistic control on continuity over consecutive segments. We derive variational inference for our shooting-based latent neural ODE models and propose amortized encodings of irregularly sampled trajectories with a transformer-based recognition network with temporal attention and relative positional encoding. We demonstrate efficient and stable training, and state-of-the-art performance on multiple large-scale benchmark datasets.  ( 2 min )
    Versatile Skill Control via Self-supervised Adversarial Imitation of Unlabeled Mixed Motions. (arXiv:2209.07899v2 [cs.RO] UPDATED)
    Learning diverse skills is one of the main challenges in robotics. To this end, imitation learning approaches have achieved impressive results. These methods require explicitly labeled datasets or assume consistent skill execution to enable learning and active control of individual behaviors, which limits their applicability. In this work, we propose a cooperative adversarial method for obtaining single versatile policies with controllable skill sets from unlabeled datasets containing diverse state transition patterns by maximizing their discriminability. Moreover, we show that by utilizing unsupervised skill discovery in the generative adversarial imitation learning framework, novel and useful skills emerge with successful task fulfillment. Finally, the obtained versatile policies are tested on an agile quadruped robot called Solo 8 and present faithful replications of diverse skills encoded in the demonstrations.  ( 2 min )
    Identify ambiguous tasks combining crowdsourced labels by weighting Areas Under the Margin. (arXiv:2209.15380v2 [cs.LG] UPDATED)
    In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per-worker trust score. Yet, such worker-centric approaches discard each task's ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful to the learning step. In a standard supervised learning setting - with one label per task - the Area Under the Margin (AUM) is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by task-dependent scores. We show that the WAUM can help discard ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements over existing strategies for learning a crowd, both for simulated settings and for the CIFAR-10H, LabelMe and Music crowdsourced datasets.  ( 2 min )
    Reconstruction-guided attention improves the robustness and shape processing of neural networks. (arXiv:2209.13620v2 [cs.CV] UPDATED)
    Many visual phenomena suggest that humans use top-down generative or reconstructive processes to create visual percepts (e.g., imagery, object completion, pareidolia), but little is known about the role reconstruction plays in robust object recognition. We built an iterative encoder-decoder network that generates an object reconstruction and used it as top-down attentional feedback to route the most relevant spatial and feature information to feed-forward object recognition processes. We tested this model using the challenging out-of-distribution digit recognition dataset, MNIST-C, where 15 different types of transformation and corruption are applied to handwritten digit images. Our model showed strong generalization performance against various image perturbations, on average outperforming all other models including feedforward CNNs and adversarially trained networks. Our model is particularly robust to blur, noise, and occlusion corruptions, where shape perception plays an important role. Ablation studies further reveal two complementary roles of spatial and feature-based attention in robust object recognition, with the former largely consistent with spatial masking benefits in the attention literature (the reconstruction serves as a mask) and the latter mainly contributing to the model's inference speed (i.e., number of time steps to reach a certain confidence threshold) by reducing the space of possible object hypotheses. We also observed that the model sometimes hallucinates a non-existing pattern out of noise, leading to highly interpretable human-like errors. Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism, which can help us understand the role of generating perception in human visual processing.  ( 2 min )
    Scalars are universal: Equivariant machine learning, structured like classical physics. (arXiv:2106.06610v4 [cs.LG] UPDATED)
    There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar\'e groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be universally expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. We complement our theory with numerical examples that show that the scalar-based method is simple, efficient, and scalable.  ( 2 min )
    What do we learn? Debunking the Myth of Unsupervised Outlier Detection. (arXiv:2206.03698v2 [cs.CV] UPDATED)
    Even though auto-encoders (AEs) have the desirable property of learning compact representations without labels and have been widely applied to out-of-distribution (OoD) detection, they are generally still poorly understood and are used incorrectly in detecting outliers where the normal and abnormal distributions are strongly overlapping. In general, the learned manifold is assumed to contain key information that is only important for describing samples within the training distribution, and that the reconstruction of outliers leads to high residual errors. However, recent work suggests that AEs are likely to be even better at reconstructing some types of OoD samples. In this work, we challenge this assumption and investigate what auto-encoders actually learn when they are posed to solve two different tasks. First, we propose two metrics based on the Fr\'echet inception distance (FID) and confidence scores of a trained classifier to assess whether AEs can learn the training distribution and reliably recognize samples from other domains. Second, we investigate whether AEs are able to synthesize normal images from samples with abnormal regions, on a more challenging lung pathology detection task. We have found that state-of-the-art (SOTA) AEs are either unable to constrain the latent manifold and allow reconstruction of abnormal patterns, or they are failing to accurately restore the inputs from their latent distribution, resulting in blurred or misaligned reconstructions. We propose novel deformable auto-encoders (MorphAEus) to learn perceptually aware global image priors and locally adapt their morphometry based on estimated dense deformation fields. We demonstrate superior performance over unsupervised methods in detecting OoD and pathology.  ( 2 min )
    Equivariance-aware Architectural Optimization of Neural Networks. (arXiv:2210.05484v3 [cs.LG] UPDATED)
    Incorporating equivariance to symmetry groups as a constraint during neural network training can improve performance and generalization for tasks exhibiting those symmetries, but such symmetries are often not perfectly nor explicitly present. This motivates algorithmically optimizing the architectural constraints imposed by equivariance. We propose the equivariance relaxation morphism, which preserves functionality while reparameterizing a group equivariant layer to operate with equivariance constraints on a subgroup, as well as the [G]-mixed equivariant layer, which mixes layers constrained to different groups to enable within-layer equivariance optimization. We further present evolutionary and differentiable neural architecture search (NAS) algorithms that utilize these mechanisms respectively for equivariance-aware architectural optimization. Experiments across a variety of datasets show the benefit of dynamically constrained equivariance to find effective architectures with approximate equivariance.  ( 2 min )
    Streaming Encoding Algorithms for Scalable Hyperdimensional Computing. (arXiv:2209.09868v4 [cs.LG] UPDATED)
    Hyperdimensional computing (HDC) is a paradigm for data representation and learning originating in computational neuroscience. HDC represents data as high-dimensional, low-precision vectors which can be used for a variety of information processing tasks like learning or recall. The mapping to high-dimensional space is a fundamental problem in HDC, and existing methods encounter scalability issues when the input data itself is high-dimensional. In this work, we explore a family of streaming encoding techniques based on hashing. We show formally that these methods enjoy comparable guarantees on performance for learning applications while being substantially more efficient than existing alternatives. We validate these results experimentally on a popular high-dimensional classification problem and show that our approach easily scales to very large data sets.
    A Graph Neural Network Approach to Automated Model Building in Cryo-EM Maps. (arXiv:2210.00006v3 [q-bio.QM] UPDATED)
    Electron cryo-microscopy (cryo-EM) produces three-dimensional (3D) maps of the electrostatic potential of biological macromolecules, including proteins. Along with knowledge about the imaged molecules, cryo-EM maps allow de novo atomic modelling, which is typically done through a laborious manual process. Taking inspiration from recent advances in machine learning applications to protein structure prediction, we propose a graph neural network (GNN) approach for automated model building of proteins in cryo-EM maps. The GNN acts on a graph with nodes assigned to individual amino acids and edges representing the protein chain. Combining information from the voxel-based cryo-EM data, the amino acid sequence data and prior knowledge about protein geometries, the GNN refines the geometry of the protein chain and classifies the amino acids for each of its nodes. Application to 28 test cases shows that our approach outperforms the state-of-the-art and approximates manual building for cryo-EM maps with resolutions better than 3.5 \r{A}.
    Flow Matching for Generative Modeling. (arXiv:2210.02747v2 [cs.LG] UPDATED)
    We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.
    Quantifying Context Mixing in Transformers. (arXiv:2301.12971v2 [cs.CL] UPDATED)
    Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models' decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in the output representations. In this work, by expanding the scope of analysis to the whole encoder block, we propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer. We demonstrate the superiority of our context mixing score over other analysis methods through a series of complementary evaluations with different viewpoints based on linguistically informed rationales, probing, and faithfulness analysis.
    Physics-Constrained Climate Downscaling. (arXiv:2208.05424v2 [physics.ao-ph] UPDATED)
    The availability of reliable, high-resolution climate and weather data is important to inform long-term decisions on climate adaptation and mitigation and to guide rapid responses to extreme events. Forecasting models are limited by computational costs and, therefore, often generate coarse-resolution predictions. Statistical downscaling, including super-resolution methods from deep learning, can provide an efficient method of upsampling low-resolution data. However, despite achieving visually compelling results in some cases, such models frequently violate conservation laws when predicting physical variables. In order to conserve physical quantities, we develop methods that guarantee physical constraints are satisfied by a deep learning downscaling model while also improving their performance according to traditional metrics. We compare different constraining approaches and demonstrate their applicability across different neural architectures as well as a variety of climate and weather datasets. Besides enabling faster and more accurate climate predictions, we also show that our novel methodologies can improve super-resolution for satellite data and standard datasets.
    Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories. (arXiv:2210.06518v2 [cs.LG] UPDATED)
    Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful - on several D4RL benchmarks, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% trajectories from the low return regime. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
    Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization. (arXiv:2212.09396v2 [stat.ML] UPDATED)
    The nonconvex formulation of matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient descent (GD) is the simplest yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizers to prove the convergence of GD. In this work, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in logarithmic amount of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence and show that a larger initialization can be used as more samples are available. We observe that implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.
    Convolutional Learning on Multigraphs. (arXiv:2209.11354v2 [cs.LG] UPDATED)
    Graph convolutional learning has led to many exciting discoveries in diverse areas. However, in some applications, traditional graphs are insufficient to capture the structure and intricacies of the data. In such scenarios, multigraphs arise naturally as discrete structures in which complex dynamics can be embedded. In this paper, we develop convolutional information processing on multigraphs and introduce convolutional multigraph neural networks (MGNNs). To capture the complex dynamics of information diffusion within and across each of the multigraph's classes of edges, we formalize a convolutional signal processing model, defining the notions of signals, filtering, and frequency representations on multigraphs. Leveraging this model, we develop a multigraph learning architecture, including a sampling procedure to reduce computational complexity. The introduced architecture is applied towards optimal wireless resource allocation and a hate speech localization task, offering improved performance over traditional graph neural networks.
    Relative Probability on Finite Outcome Spaces: A Systematic Examination of its Axiomatization, Properties, and Applications. (arXiv:2212.14555v2 [stat.ML] UPDATED)
    This work proposes a view of probability as a relative measure rather than an absolute one. To demonstrate this concept, we focus on finite outcome spaces and develop three fundamental axioms that establish requirements for relative probability functions. We then provide a library of examples of these functions and a system for composing them. Additionally, we discuss a relative version of Bayesian inference and its digital implementation. Finally, we prove the topological closure of the relative probability space, highlighting its ability to preserve information under limits.
    WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. (arXiv:2207.01206v4 [cs.CL] UPDATED)
    Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop -- a simulated e-commerce website environment with $1.18$ million real-world products and $12,087$ crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. We collect over $1,600$ human demonstrations for the task, and train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of $29\%$, which outperforms rule-based heuristics ($9.6\%$) but is far lower than human expert performance ($59\%$). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show that agents trained on WebShop exhibit non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of WebShop in developing practical web-based agents that can operate in the wild.
    On Grounded Planning for Embodied Tasks with Language Models. (arXiv:2209.00465v2 [cs.AI] UPDATED)
    Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear whether they have the capacity to generate grounded, executable plans for embodied tasks. This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment. In this paper, we address this important research question and present the first investigation into the topic. Our novel problem formulation, named G-PlanET, inputs a high-level goal and a data table about objects in a specific environment, and then outputs a step-by-step actionable plan for a robotic agent to follow. To facilitate the study, we establish an evaluation protocol and design a dedicated metric, KAS, to assess the quality of the plans. Our experiments demonstrate that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs' ability in grounded planning. Our analysis also reveals interesting and non-trivial findings.
    LLEDA -- Lifelong Self-Supervised Domain Adaptation. (arXiv:2211.09027v2 [cs.LG] UPDATED)
    Humans and animals have the ability to continuously learn new information over their lifetime without losing previously acquired knowledge. However, artificial neural networks struggle with this due to new information conflicting with old knowledge, resulting in catastrophic forgetting. The complementary learning systems (CLS) theory suggests that the interplay between hippocampus and neocortex systems enables long-term and efficient learning in the mammalian brain, with memory replay facilitating the interaction between these two systems to reduce forgetting. The proposed Lifelong Self-Supervised Domain Adaptation (LLEDA) framework draws inspiration from the CLS theory and mimics the interaction between two networks: a DA network inspired by the hippocampus that quickly adjusts to changes in data distribution and an SSL network inspired by the neocortex that gradually learns domain-agnostic general representations. LLEDA's latent replay technique facilitates communication between these two networks by reactivating and replaying the past memory latent representations to stabilise long-term generalisation and retention without interfering with the previously learned information. Extensive experiments demonstrate that the proposed method outperforms several other methods resulting in a long-term adaptation while being less prone to catastrophic forgetting when transferred to new domains.
    Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling. (arXiv:2207.04179v2 [cs.LG] UPDATED)
    Neural Processes (NPs) are a popular class of approaches for meta-learning. Similar to Gaussian Processes (GPs), NPs define distributions over functions and can estimate uncertainty in their predictions. However, unlike GPs, NPs and their variants suffer from underfitting and often have intractable likelihoods, which limit their applications in sequential decision making. We propose Transformer Neural Processes (TNPs), a new member of the NP family that casts uncertainty-aware meta learning as a sequence modeling problem. We learn TNPs via an autoregressive likelihood-based objective and instantiate it with a novel transformer-based architecture. The model architecture respects the inductive biases inherent to the problem structure, such as invariance to the observed data points and equivariance to the unobserved points. We further investigate knobs within the TNP framework that tradeoff expressivity of the decoding distribution with extra computation. Empirically, we show that TNPs achieve state-of-the-art performance on various benchmark problems, outperforming all previous NP variants on meta regression, image completion, contextual multi-armed bandits, and Bayesian optimization.
    Adversarial Self-Attention for Language Understanding. (arXiv:2206.12608v3 [cs.CL] UPDATED)
    Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut'' between the labels and inputs, thus impairing the generalization and robustness. This paper advances the self-attention mechanism to its robust variant for Transformer-based pre-trained language models (e.g. BERT). We propose \textit{Adversarial Self-Attention} mechanism (ASA), which adversarially biases the attentions to effectively suppress the model reliance on features (e.g. specific keywords) and encourage its exploration of broader semantics. We conduct a comprehensive evaluation across a wide range of tasks for both pre-training and fine-tuning stages. For pre-training, ASA unfolds remarkable performance gains compared to naive training for longer steps. For fine-tuning, ASA-empowered models outweigh naive models by a large margin considering both generalization and robustness.
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v2 [stat.ML] UPDATED)
    Most machine learning methods require tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate an approximate expression for how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection heuristic based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.
    Modified Policy Iteration for Exponential Cost Risk Sensitive MDPs. (arXiv:2302.03811v1 [cs.LG])
    Modified policy iteration (MPI) also known as optimistic policy iteration is at the core of many reinforcement learning algorithms. It works by combining elements of policy iteration and value iteration. The convergence of MPI has been well studied in the case of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, modified policy iteration is relatively unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems. The proof of approximate modified policy iteration for risk sensitive MDPs is also provided in the appendix.  ( 2 min )
    Provably Efficient Offline Goal-Conditioned Reinforcement Learning with General Function Approximation and Single-Policy Concentrability. (arXiv:2302.03770v1 [cs.LG])
    Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills which aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we propose a novel provably efficient algorithm (the sample complexity is $\tilde{O}({\rm poly}(1/\epsilon))$ where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation. Our algorithm only requires nearly minimal assumptions of the dataset (single-policy concentrability) and the function class (realizability). Moreover, our algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. To the best of our knowledge, this is the first algorithm with general function approximation and single-policy concentrability that is both statistically efficient and computationally stable.  ( 2 min )
    Self-Supervised Unseen Object Instance Segmentation via Long-Term Robot Interaction. (arXiv:2302.03793v1 [cs.RO])
    We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the fine-tuned networks improve top-down robotic grasping of unseen objects in the real world.  ( 2 min )
    Graph Signal Sampling for Inductive One-Bit Matrix Completion: a Closed-form Solution. (arXiv:2302.03933v1 [cs.LG])
    Inductive one-bit matrix completion is motivated by modern applications such as recommender systems, where new users would appear at test stage with the ratings consisting of only ones and no zeros. We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (signal) on the vertices of an item-item graph, then learn structural graph properties to recover the function from its values on certain vertices -- the problem of graph signal sampling. We propose a class of regularization functionals that takes into account discrete random label noise in the graph vertex domain, then develop the GS-IMC approach which biases the reconstruction towards functions that vary little between adjacent vertices for noise reduction. Theoretical result shows that accurate reconstructions can be achieved under mild conditions. For the online setting, we develop a Bayesian extension, i.e., BGS-IMC which considers continuous random Gaussian noise in the graph Fourier domain and builds upon a prediction-correction update algorithm to obtain the unbiased and minimum-variance reconstruction. Both GS-IMC and BGS-IMC have closed-form solutions and thus are highly scalable in large data. Experiments show that our methods achieve state-of-the-art performance on public benchmarks.  ( 2 min )
    Non-Stationary Bandits with Knapsack Problems with Advice. (arXiv:2302.04182v1 [cs.LG])
    We consider a non-stationary Bandits with Knapsack problem. The outcome distribution at each time is scaled by a non-stationary quantity that signifies changing demand volumes. Instead of studying settings with limited non-stationarity, we investigate how online predictions on the total demand volume $Q$ allows us to improve our performance guarantees. We show that, without any prediction, any online algorithm incurs a linear-in-$T$ regret. In contrast, with online predictions on $Q$, we propose an online algorithm that judiciously incorporates the predictions, and achieve regret bounds that depends on the accuracy of the predictions. These bounds are shown to be tight in settings when prediction accuracy improves across time. Our theoretical results are corroborated by our numerical findings.
    Deep Learning Based Walking Tasks Classification in Older Adults using fNIRS. (arXiv:2102.03987v3 [cs.LG] UPDATED)
    Decline in gait features is common in older adults and an indicator of increased risk of disability, morbidity, and mortality. Under dual task walking (DTW) conditions, further degradation in the performance of both the gait and the secondary cognitive task were found in older adults which were significantly correlated to falls history. Cortical control of gait, specifically in the pre-frontal cortex (PFC) as measured by functional near infrared spectroscopy (fNIRS), during DTW in older adults has recently been studied. However, the automatic classification of differences in cognitive activations under single and dual task gait conditions has not been extensively studied yet. In this paper, we formulate this as a classification task and leverage deep learning to perform automatic classification of STW, DTW and single cognitive task (STA). We conduct analysis on the data samples which reveals the characteristics on the difference between HbO2 and Hb values that are subsequently used as additional features. We perform feature engineering to formulate the fNIRS features as a 3-channel image and apply various image processing techniques for data augmentation to enhance the performance of deep learning models. Experimental results show that pre-trained deep learning models that are fine-tuned using the collected fNIRS dataset together with gender and cognitive status information can achieve around 81\% classification accuracy which is about 10\% higher than the traditional machine learning algorithms. We further perform an ablation study to identify rankings of features such as the fNIRS levels and/or voxel locations on the contribution of the classification task.
    Decision trees compensate for model misspecification. (arXiv:2302.04081v1 [stat.ML])
    The best-performing models in ML are not interpretable. If we can explain why they outperform, we may be able to replicate these mechanisms and obtain both interpretability and performance. One example are decision trees and their descendent gradient boosting machines (GBMs). These perform well in the presence of complex interactions, with tree depth governing the order of interactions. However, interactions cannot fully account for the depth of trees found in practice. We confirm 5 alternative hypotheses about the role of tree depth in performance in the absence of true interactions, and present results from experiments on a battery of datasets. Part of the success of tree models is due to their robustness to various forms of mis-specification. We present two methods for robust generalized linear models (GLMs) addressing the composite and mixed response scenarios.
    Algorithmic Collective Action in Machine Learning. (arXiv:2302.04262v1 [cs.LG])
    We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm's learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: the case of a nonparametric optimal learning algorithm, a parametric risk minimizer, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective's size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform's learning algorithm.
    Transformers Can Do Bayesian Inference. (arXiv:2112.10510v6 [cs.LG] UPDATED)
    Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released at https://github.com/automl/TransformersCanDoBayesianInference.
    WF-UNet: Weather Fusion UNet for Precipitation Nowcasting. (arXiv:2302.04102v1 [cs.LG])
    Designing early warning systems for harsh weather and its effects, such as urban flooding or landslides, requires accurate short-term forecasts (nowcasts) of precipitation. Nowcasting is a significant task with several environmental applications, such as agricultural management or increasing flight safety. In this study, we investigate the use of a UNet core-model and its extension for precipitation nowcasting in western Europe for up to 3 hours ahead. In particular, we propose the Weather Fusion UNet (WF-UNet) model, which utilizes the Core 3D-UNet model and integrates precipitation and wind speed variables as input in the learning process and analyze its influences on the precipitation target task. We have collected six years of precipitation and wind radar images from Jan 2016 to Dec 2021 of 14 European countries, with 1-hour temporal resolution and 31 square km spatial resolution based on the ERA5 dataset, provided by Copernicus, the European Union's Earth observation programme. We compare the proposed WF-UNet model to persistence model as well as other UNet based architectures that are trained only using precipitation radar input data. The obtained results show that WF-UNet outperforms the other examined best-performing architectures by 22%, 8% and 6% lower MSE at a horizon of 1, 2 and 3 hours respectively.
    Improved Langevin Monte Carlo for stochastic optimization via landscape modification. (arXiv:2302.03973v1 [math.PR])
    Given a target function $H$ to minimize or a target Gibbs distribution $\pi_{\beta}^0 \propto e^{-\beta H}$ to sample from in the low temperature, in this paper we propose and analyze Langevin Monte Carlo (LMC) algorithms that run on an alternative landscape as specified by $H^f_{\beta,c,1}$ and target a modified Gibbs distribution $\pi^f_{\beta,c,1} \propto e^{-\beta H^f_{\beta,c,1}}$, where the landscape of $H^f_{\beta,c,1}$ is a transformed version of that of $H$ which depends on the parameters $f,\beta$ and $c$. While the original Log-Sobolev constant affiliated with $\pi^0_{\beta}$ exhibits exponential dependence on both $\beta$ and the energy barrier $M$ in the low temperature regime, with appropriate tuning of these parameters and subject to assumptions on $H$, we prove that the energy barrier of the transformed landscape is reduced which consequently leads to polynomial dependence on both $\beta$ and $M$ in the modified Log-Sobolev constant associated with $\pi^f_{\beta,c,1}$. This yield improved total variation mixing time bounds and improved convergence toward a global minimum of $H$. We stress that the technique developed in this paper is not only limited to LMC and is broadly applicable to other gradient-based optimization or sampling algorithms.
    The Modern Mathematics of Deep Learning. (arXiv:2105.04026v2 [cs.LG] UPDATED)
    We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.
    Machine Learning for Synthetic Data Generation: a Review. (arXiv:2302.04062v1 [cs.LG])
    Data plays a crucial role in machine learning. However, in real-world applications, there are several problems with data, e.g., data are of low quality; a limited number of data points lead to under-fitting of the machine learning model; it is hard to access the data due to privacy, safety and regulatory concerns. \textit{Synthetic data generation} offers a promising new avenue, as it can be shared and used in ways that real-world data cannot. This paper systematically reviews the existing works that leverage machine learning models for synthetic data generation. Specifically, we discuss the synthetic data generation works from several perspectives: (i) applications, including computer vision, speech, natural language, healthcare, and business; (ii) machine learning methods, particularly neural network architectures and deep generative models; (iii) privacy and fairness issue. In addition, we identify the challenges and opportunities in this emerging field and suggest future research directions.
    On the Complexity of Computing G\"odel Numbers. (arXiv:2302.04213v1 [math.LO])
    Given a computable sequence of natural numbers, it is a natural task to find a G\"odel number of a program that generates this sequence. It is easy to see that this problem is neither continuous nor computable. In algorithmic learning theory this problem is well studied from several perspectives and one question studied there is for which sequences this problem is at least learnable in the limit. Here we study the problem on all computable sequences and we classify the Weihrauch complexity of it. For this purpose we can, among other methods, utilize the amalgamation technique known from learning theory. As a benchmark for the classification we use closed and compact choice problems and their jumps on natural numbers, and we argue that these problems correspond to induction and boundedness principles, as they are known from the Kirby-Paris hierarchy in reverse mathematics. We provide a topological as well as a computability-theoretic classification, which reveal some significant differences.
    A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech. (arXiv:2302.04215v1 [eess.AS])
    Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.
    PreCNet: Next-Frame Video Prediction Based on Predictive Coding. (arXiv:2004.14878v3 [cs.CV] UPDATED)
    Predictive coding, currently a highly influential theory in neuroscience, has not been widely adopted in machine learning yet. In this work, we transform the seminal model of Rao and Ballard (1999) into a modern deep learning framework while remaining maximally faithful to the original schema. The resulting network we propose (PreCNet) is tested on a widely used next frame video prediction benchmark, which consists of images from an urban environment recorded from a car-mounted camera, and achieves state-of-the-art performance. Performance on all measures (MSE, PSNR, SSIM) was further improved when a larger training set (2M images from BDD100k), pointing to the limitations of the KITTI training set. This work demonstrates that an architecture carefully based in a neuroscience model, without being explicitly tailored to the task at hand, can exhibit exceptional performance.
    Combining Variational Autoencoders and Physical Bias for Improved Microscopy Data Analysis. (arXiv:2302.04216v1 [cs.LG])
    Electron and scanning probe microscopy produce vast amounts of data in the form of images or hyperspectral data, such as EELS or 4D STEM, that contain information on a wide range of structural, physical, and chemical properties of materials. To extract valuable insights from these data, it is crucial to identify physically separate regions in the data, such as phases, ferroic variants, and boundaries between them. In order to derive an easily interpretable feature analysis, combining with well-defined boundaries in a principled and unsupervised manner, here we present a physics augmented machine learning method which combines the capability of Variational Autoencoders to disentangle factors of variability within the data and the physics driven loss function that seeks to minimize the total length of the discontinuities in images corresponding to latent representations. Our method is applied to various materials, including NiO-LSMO, BiFeO3, and graphene. The results demonstrate the effectiveness of our approach in extracting meaningful information from large volumes of imaging data. The fully notebook containing implementation of the code and analysis workflow is available at https://github.com/arpanbiswas52/PaperNotebooks
    The Test of Tests: A Framework For Differentially Private Hypothesis Testing. (arXiv:2302.04260v1 [stat.ME])
    We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework of this type, as well as to several individually-designed private hypothesis tests. Our framework is higher power than other generic solutions and at least competitive with (and often better than) individually-designed tests.
    Graph Representation Learning via Aggregation Enhancement. (arXiv:2201.12843v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become a powerful tool for processing graph-structured data but still face challenges in effectively aggregating and propagating information between layers, which limits their performance. We tackle this problem with the kernel regression (KR) approach, using KR loss as the primary loss in self-supervised settings or as a regularization term in supervised settings. We show substantial performance improvements compared to state-of-the-art in both scenarios on multiple transductive and inductive node classification datasets, especially for deep networks. As opposed to mutual information (MI), KR loss is convex and easy to estimate in high-dimensional cases, even though it indirectly maximizes the MI between its inputs. Our work highlights the potential of KR to advance the field of graph representation learning and enhance the performance of GNNs. The code to reproduce our experiments is available at https://github.com/Anonymous1252022/KR_for_GNNs
    A Multimodal Sensing Ring for Quantification of Scratch Intensity. (arXiv:2302.03813v1 [cs.LG])
    An objective measurement of the debilitating symptom, chronic itch, is necessary for improvements in patient care for numerous medical conditions. While wearable devices have shown promise for scratch detection, they are currently unable to estimate scratch intensity, preventing a comprehensive understanding of the effect of itch on an individual. In this work, we present a framework for the estimation of scratch intensity in addition to scratch detection consisting of a multimodal wearable ring device and machine learning algorithms for regression of scratch intensity on a 0-600 mW mechanical power scale that can be mapped to a 0-10 continuous scale. We evaluate the performance of our algorithms on 20 individuals using Leave One Subject Out (LOSO) Cross Validation (CV) and using data from 14 additional participants, we show that our algorithms achieve clinically-relevant discrimination of scratching intensity levels. This work demonstrates that a finger-worn device can provide multidimensional, objective, real-time measures for the action of scratching.  ( 2 min )
    MMA-RNN: A Multi-level Multi-task Attention-based Recurrent Neural Network for Discrimination and Localization of Atrial Fibrillation. (arXiv:2302.03731v1 [cs.LG])
    The automatic detection of atrial fibrillation based on electrocardiograph (ECG) signals has received wide attention both clinically and practically. It is challenging to process ECG signals with cyclical pattern, varying length and unstable quality due to noise and distortion. Besides, there has been insufficient research on separating persistent atrial fibrillation from paroxysmal atrial fibrillation, and little discussion on locating the onsets and end points of AF episodes. It is even more arduous to perform well on these two distinct but interrelated tasks, while avoiding the mistakes inherent from stage-by-stage approaches. This paper proposes the Multi-level Multi-task Attention-based Recurrent Neural Network for three-class discrimination on patients and localization of the exact timing of AF episodes. Our model captures three-level sequential features based on a hierarchical architecture utilizing Bidirectional Long and Short-Term Memory Network (Bi-LSTM) and attention layers, and accomplishes the two tasks simultaneously with a multi-head classifier. The model is designed as an end-to-end framework to enhance information interaction and reduce error accumulation. Finally, we conduct experiments on CPSC 2021 dataset and the result demonstrates the superior performance of our method, indicating the potential application of MMA-RNN to wearable mobile devices for routine AF monitoring and early diagnosis.
    Fundamental Performance Limits for Sensor-Based Robot Control and Policy Learning. (arXiv:2202.00129v3 [cs.RO] UPDATED)
    Our goal is to develop theory and algorithms for establishing fundamental limits on performance for a given task imposed by a robot's sensors. In order to achieve this, we define a quantity that captures the amount of task-relevant information provided by a sensor. Using a novel version of the generalized Fano inequality from information theory, we demonstrate that this quantity provides an upper bound on the highest achievable expected reward for one-step decision making tasks. We then extend this bound to multi-step problems via a dynamic programming approach. We present algorithms for numerically computing the resulting bounds, and demonstrate our approach on three examples: (i) the lava problem from the literature on partially observable Markov decision processes, (ii) an example with continuous state and observation spaces corresponding to a robot catching a freely-falling object, and (iii) obstacle avoidance using a depth sensor with non-Gaussian noise. We demonstrate the ability of our approach to establish strong limits on achievable performance for these problems by comparing our upper bounds with achievable lower bounds (computed by synthesizing or learning concrete control policies).
    Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective. (arXiv:2302.03751v1 [cs.CV])
    Vision transformer (ViT) is an attention neural network architecture that is shown to be effective for computer vision tasks. However, compared to ResNet-18 with a similar number of parameters, ViT has a significantly lower evaluation accuracy when trained on small datasets. To facilitate studies in related fields, we provide a visual intuition to help understand why it is the case. We first compare the performance of the two models and confirm that ViT has less accuracy than ResNet-18 when trained on small datasets. We then interpret the results by showing attention map visualization for ViT and feature map visualization for ResNet-18. The difference is further analyzed through a representation similarity perspective. We conclude that the representation of ViT trained on small datasets is hugely different from ViT trained on large datasets, which may be the reason why the performance drops a lot on small datasets.  ( 2 min )
    Evaluation of Interpretability for Deep Learning algorithms in EEG Emotion Recognition: A case study in Autism. (arXiv:2111.13208v5 [eess.SP] UPDATED)
    Current models on Explainable Artificial Intelligence (XAI) have shown an evident and quantified lack of reliability for measuring feature-relevance when statistically entangled features are proposed for training deep classifiers. There has been an increase in the application of Deep Learning in clinical trials to predict early diagnosis of neuro-developmental disorders, such as Autism Spectrum Disorder (ASD). However, the inclusion of more reliable saliency-maps to obtain more trustworthy and interpretable metrics using neural activity features is still insufficiently mature for practical applications in diagnostics or clinical trials. Moreover, in ASD research the inclusion of deep classifiers that use neural measures to predict viewed facial emotions is relatively unexplored. Therefore, in this study we propose the evaluation of a Convolutional Neural Network (CNN) for electroencephalography (EEG)-based facial emotion recognition decoding complemented with a novel RemOve-And-Retrain (ROAR) methodology to recover highly relevant features used in the classifier. Specifically, we compare well-known relevance maps such as Layer-Wise Relevance Propagation (LRP), PatternNet, Pattern-Attribution, and Smooth-Grad Squared. This study is the first to consolidate a more transparent feature-relevance calculation for a successful EEG-based facial emotion recognition using a within-subject-trained CNN in typically-developed and ASD individuals.
    DropMessage: Unifying Random Dropping for Graph Neural Networks. (arXiv:2204.10037v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are powerful tools for graph representation learning. Despite their rapid development, GNNs also face some challenges, such as over-fitting, over-smoothing, and non-robustness. Previous works indicate that these problems can be alleviated by random dropping methods, which integrate augmented data into models by randomly masking parts of the input. However, some open problems of random dropping on GNNs remain to be solved. First, it is challenging to find a universal method that are suitable for all cases considering the divergence of different datasets and models. Second, augmented data introduced to GNNs causes the incomplete coverage of parameters and unstable training process. Third, there is no theoretical analysis on the effectiveness of random dropping methods on GNNs. In this paper, we propose a novel random dropping method called DropMessage, which performs dropping operations directly on the propagated messages during the message-passing process. More importantly, we find that DropMessage provides a unified framework for most existing random dropping methods, based on which we give theoretical analysis of their effectiveness. Furthermore, we elaborate the superiority of DropMessage: it stabilizes the training process by reducing sample variance; it keeps information diversity from the perspective of information theory, enabling it become a theoretical upper bound of other methods. To evaluate our proposed method, we conduct experiments that aims for multiple tasks on five public datasets and two industrial datasets with various backbone models. The experimental results show that DropMessage has the advantages of both effectiveness and generalization, and can significantly alleviate the problems mentioned above.
    Learning How to Infer Partial MDPs for In-Context Adaptation and Exploration. (arXiv:2302.04250v1 [cs.LG])
    To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising approach, but it requires Bayesian inference and dynamic programming, which often involve unknowns (e.g., a prior) and costly computations. To address these difficulties, we use a transformer to learn an inference process from training tasks and consider a hypothesis space of partial models, represented as small Markov decision processes that are cheap for dynamic programming. In our version of the Symbolic Alchemy benchmark, our method's adaptation speed and exploration-exploitation balance approach those of an exact posterior sampling oracle. We also show that even though partial models exclude relevant information from the environment, they can nevertheless lead to good policies.
    Adversarial Prompting for Black Box Foundation Models. (arXiv:2302.04237v1 [cs.LG])
    Prompting interfaces allow users to quickly adjust the output of generative models in both vision and language. However, small changes and design choices in the prompt can lead to significant differences in the output. In this work, we develop a black-box framework for generating adversarial prompts for unstructured image and text generation. These prompts, which can be standalone or prepended to benign prompts, induce specific behaviors into the generative process, such as generating images of a particular object or biasing the frequency of specific letters in the generated text.
    Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection. (arXiv:2302.03857v1 [cs.LG])
    Adversarial contrastive learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks and also generalizes to a wide range of downstream tasks. However, ACL needs tremendous running time to generate the adversarial variants of all training data, which limits its scalability to large datasets. To speed up ACL, this paper proposes a robustness-aware coreset selection (RCS) method. RCS does not require label information and searches for an informative subset that minimizes a representational divergence, which is the distance of the representation between natural data and their virtual adversarial variants. The vanilla solution of RCS via traversing all possible subsets is computationally prohibitive. Therefore, we theoretically transform RCS into a surrogate problem of submodular maximization, of which the greedy search is an efficient solution with an optimality guarantee for the original problem. Empirically, our comprehensive results corroborate that RCS can speed up ACL by a large margin without significantly hurting the robustness and standard transferability. Notably, to the best of our knowledge, we are the first to conduct ACL efficiently on the large-scale ImageNet-1K dataset to obtain an effective robust representation via RCS.  ( 2 min )
    Connections and Equivalences between the Nystr\"om Method and Sparse Variational Gaussian Processes. (arXiv:2106.01121v2 [stat.ML] UPDATED)
    We investigate the connections between sparse approximation methods for making kernel methods and Gaussian processes (GPs) scalable to large-scale data, focusing on the Nystr\"om method and the Sparse Variational Gaussian Processes (SVGP). While sparse approximation methods for GPs and kernel methods share some algebraic similarities, the literature lacks a deep understanding of how and why they are related. This may pose an obstacle to the communications between the GP and kernel communities, making it difficult to transfer results from one side to the other. Our motivation is to remove this obstacle, by clarifying the connections between the sparse approximations for GPs and kernel methods. In this work, we study the two popular approaches, the Nystr\"om and SVGP approximations, in the context of a regression problem, and establish various connections and equivalences between them. In particular, we provide an RKHS interpretation of the SVGP approximation, and show that the Evidence Lower Bound of the SVGP contains the objective function of the Nystr\"om approximation, revealing the origin of the algebraic equivalence between the two approaches. We also study recently established convergence results for the SVGP and how they are related to the approximation quality of the Nystr\"om method.
    PASTA: Pessimistic Assortment Optimization. (arXiv:2302.03821v1 [cs.LG])
    We consider a class of assortment optimization problems in an offline data-driven setting. A firm does not know the underlying customer choice model but has access to an offline dataset consisting of the historically offered assortment set, customer choice, and revenue. The objective is to use the offline dataset to find an optimal assortment. Due to the combinatorial nature of assortment optimization, the problem of insufficient data coverage is likely to occur in the offline dataset. Therefore, designing a provably efficient offline learning algorithm becomes a significant challenge. To this end, we propose an algorithm referred to as Pessimistic ASsortment opTimizAtion (PASTA for short) designed based on the principle of pessimism, that can correctly identify the optimal assortment by only requiring the offline data to cover the optimal assortment under general settings. In particular, we establish a regret bound for the offline assortment optimization problem under the celebrated multinomial logit model. We also propose an efficient computational procedure to solve our pessimistic assortment optimization problem. Numerical studies demonstrate the superiority of the proposed method over the existing baseline method.  ( 2 min )
    TVAE: Triplet-Based Variational Autoencoder using Metric Learning. (arXiv:1802.04403v3 [stat.ML] UPDATED)
    Deep metric learning has been demonstrated to be highly effective in learning semantic representation and encoding information that can be used to measure data similarity, by relying on the embedding learned from metric learning. At the same time, variational autoencoder (VAE) has widely been used to approximate inference and proved to have a good performance for directed probabilistic models. However, for traditional VAE, the data label or feature information are intractable. Similarly, traditional representation learning approaches fail to represent many salient aspects of the data. In this project, we propose a novel integrated framework to learn latent embedding in VAE by incorporating deep metric learning. The features are learned by optimizing a triplet loss on the mean vectors of VAE in conjunction with standard evidence lower bound (ELBO) of VAE. This approach, which we call Triplet based Variational Autoencoder (TVAE), allows us to capture more fine-grained information in the latent embedding. Our model is tested on MNIST data set and achieves a high triplet accuracy of 95.60% while the traditional VAE (Kingma & Welling, 2013) achieves triplet accuracy of 75.08%.
    PFGM++: Unlocking the Potential of Physics-Inspired Generative Models. (arXiv:2302.04265v1 [cs.LG])
    We introduce a new family of physics-inspired generative models termed PFGM++ that unifies diffusion models and Poisson Flow Generative Models (PFGM). These models realize generative trajectories for $N$ dimensional data by embedding paths in $N{+}D$ dimensional space while still controlling the progression with a simple scalar norm of the $D$ additional variables. The new models reduce to PFGM when $D{=}1$ and to diffusion models when $D{\to}\infty$. The flexibility of choosing $D$ allows us to trade off robustness against rigidity as increasing $D$ results in more concentrated coupling between the data and the additional variable norms. We dispense with the biased large batch field targets used in PFGM and instead provide an unbiased perturbation-based objective similar to diffusion models. To explore different choices of $D$, we provide a direct alignment method for transferring well-tuned hyperparameters from diffusion models ($D{\to} \infty$) to any finite $D$ values. Our experiments show that models with finite $D$ can be superior to previous state-of-the-art diffusion models on CIFAR-10/FFHQ $64{\times}64$ datasets, with FID scores of $1.91/2.43$ when $D{=}2048/128$. In addition, we demonstrate that models with smaller $D$ exhibit improved robustness against modeling errors. Code is available at https://github.com/Newbeeer/pfgmpp
    Machine learning classification of non-Markovian noise disturbing quantum dynamics. (arXiv:2101.03221v3 [quant-ph] UPDATED)
    In this paper machine learning and artificial neural network models are proposed for the classification of external noise sources affecting a given quantum dynamics. For this purpose, we train and then validate support vector machine, multi-layer perceptron and recurrent neural network models with different complexity and accuracy, to solve supervised binary classification problems. As a result, we demonstrate the high efficacy of such tools in classifying noisy quantum dynamics using simulated data sets from different realizations of the quantum system dynamics. In addition, we show that for a successful classification one just needs to measure, in a sequence of discrete time instants, the probabilities that the analysed quantum system is in one of the allowed positions or energy configurations. Albeit the training of machine learning models is here performed on synthetic data, our approach is expected to find application in experimental schemes, as e.g. for the noise benchmarking of noisy intermediate-scale quantum devices.
    DIFF2: Differential Private Optimization via Gradient Differences for Nonconvex Distributed Learning. (arXiv:2302.03884v1 [cs.LG])
    Differential private optimization for nonconvex smooth objective is considered. In the previous work, the best known utility bound is $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ in terms of the squared full gradient norm, which is achieved by Differential Private Gradient Descent (DP-GD) as an instance, where $n$ is the sample size, $d$ is the problem dimensionality and $\varepsilon_\mathrm{DP}$ is the differential privacy parameter. To improve the best known utility bound, we propose a new differential private optimization framework called \emph{DIFF2 (DIFFerential private optimization via gradient DIFFerences)} that constructs a differential private global gradient estimator with possibly quite small variance based on communicated \emph{gradient differences} rather than gradients themselves. It is shown that DIFF2 with a gradient descent subroutine achieves the utility of $\widetilde O(d^{2/3}/(n\varepsilon_\mathrm{DP})^{4/3})$, which can be significantly better than the previous one in terms of the dependence on the sample size $n$. To the best of our knowledge, this is the first fundamental result to improve the standard utility $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ for nonconvex objectives. Additionally, a more computational and communication efficient subroutine is combined with DIFF2 and its theoretical analysis is also given. Numerical experiments are conducted to validate the superiority of DIFF2 framework.
    Operator Shifting for Model-based Policy Evaluation. (arXiv:2110.12658v3 [cs.LG] UPDATED)
    In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.  ( 2 min )
    Diagnosing and Rectifying Vision Models using Language. (arXiv:2302.04269v1 [cs.LG])
    Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier.
    Generalizing Neural Wave Functions. (arXiv:2302.04168v1 [cs.LG])
    Recent neural network-based wave functions have achieved state-of-the-art accuracies in modeling ab-initio ground-state potential energy surface. However, these networks can only solve different spatial arrangements of the same set of atoms. To overcome this limitation, we present Graph-learned Orbital Embeddings (Globe), a neural network-based reparametrization method that can adapt neural wave functions to different molecules. We achieve this by combining a localization method for molecular orbitals with spatial message-passing networks. Further, we propose a locality-driven wave function, the Molecular Oribtal Network (Moon), tailored to solving Schr\"odinger equations of different molecules jointly. In our experiments, we find Moon requiring 8 times fewer steps to converge to similar accuracies as previous methods when trained on different molecules jointly while Globe enabling the transfer from smaller to larger molecules. Further, our analysis shows that Moon converges similarly to recent transformer-based wave functions on larger molecules. In both the computational chemistry and machine learning literature, we are the first to demonstrate that a single wave function can solve the Schr\"odinger equation of molecules with different atoms jointly.
    Exploratory Analysis of Federated Learning Methods with Differential Privacy on MIMIC-III. (arXiv:2302.04208v1 [cs.LG])
    Background: Federated learning methods offer the possibility of training machine learning models on privacy-sensitive data sets, which cannot be easily shared. Multiple regulations pose strict requirements on the storage and usage of healthcare data, leading to data being in silos (i.e. locked-in at healthcare facilities). The application of federated algorithms on these datasets could accelerate disease diagnostic, drug development, as well as improve patient care. Methods: We present an extensive evaluation of the impact of different federation and differential privacy techniques when training models on the open-source MIMIC-III dataset. We analyze a set of parameters influencing a federated model performance, namely data distribution (homogeneous and heterogeneous), communication strategies (communication rounds vs. local training epochs), federation strategies (FedAvg vs. FedProx). Furthermore, we assess and compare two differential privacy (DP) techniques during model training: a stochastic gradient descent-based differential privacy algorithm (DP-SGD), and a sparse vector differential privacy technique (DP-SVT). Results: Our experiments show that extreme data distributions across sites (imbalance either in the number of patients or the positive label ratios between sites) lead to a deterioration of model performance when trained using the FedAvg strategy. This issue is resolved when using FedProx with the use of appropriate hyperparameter tuning. Furthermore, the results show that both differential privacy techniques can reach model performances similar to those of models trained without DP, however at the expense of a large quantifiable privacy leakage. Conclusions: We evaluate empirically the benefits of two federation strategies and propose optimal strategies for the choice of parameters when using differential privacy techniques.
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v5 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    IRTCI: Item Response Theory for Categorical Imputation. (arXiv:2302.04165v1 [stat.ML])
    Most datasets suffer from partial or complete missing values, which has downstream limitations on the available models on which to test the data and on any statistical inferences that can be made from the data. Several imputation techniques have been designed to replace missing data with stand in values. The various approaches have implications for calculating clinical scores, model building and model testing. The work showcased here offers a novel means for categorical imputation based on item response theory (IRT) and compares it against several methodologies currently used in the machine learning field including k-nearest neighbors (kNN), multiple imputed chained equations (MICE) and Amazon Web Services (AWS) deep learning method, Datawig. Analyses comparing these techniques were performed on three different datasets that represented ordinal, nominal and binary categories. The data were modified so that they also varied on both the proportion of data missing and the systematization of the missing data. Two different assessments of performance were conducted: accuracy in reproducing the missing values, and predictive performance using the imputed data. Results demonstrated that the new method, Item Response Theory for Categorical Imputation (IRTCI), fared quite well compared to currently used methods, outperforming several of them in many conditions. Given the theoretical basis for the new approach, and the unique generation of probabilistic terms for determining category belonging for missing cells, IRTCI offers a viable alternative to current approaches.
    The Hardware Impact of Quantization and Pruning for Weights in Spiking Neural Networks. (arXiv:2302.04174v1 [cs.LG])
    Energy efficient implementations and deployments of Spiking neural networks (SNNs) have been of great interest due to the possibility of developing artificial systems that can achieve the computational powers and energy efficiency of the biological brain. Efficient implementations of SNNs on modern digital hardware are also inspired by advances in machine learning and deep neural networks (DNNs). Two techniques widely employed in the efficient deployment of DNNs -- the quantization and pruning of parameters, can both compress the model size, reduce memory footprints, and facilitate low-latency execution. The interaction between quantization and pruning and how they might impact model performance on SNN accelerators is currently unknown. We study various combinations of pruning and quantization in isolation, cumulatively, and simultaneously (jointly) to a state-of-the-art SNN targeting gesture recognition for dynamic vision sensor cameras (DVS). We show that this state-of-the-art model is amenable to aggressive parameter quantization, not suffering from any loss in accuracy down to ternary weights. However, pruning only maintains iso-accuracy up to 80% sparsity, which results in 45% more energy than the best quantization on our architectural model. Applying both pruning and quantization can result in an accuracy loss to offer a favourable trade-off on the energy-accuracy Pareto-frontier for the given hardware configuration.
    Attending to Graph Transformers. (arXiv:2302.04181v1 [cs.LG])
    Recently, transformer architectures for graphs emerged as an alternative to established techniques for machine learning with graphs, such as graph neural networks. So far, they have shown promising empirical results, e.g., on molecular prediction datasets, often attributed to their ability to circumvent graph neural networks' shortcomings, such as over-smoothing and over-squashing. Here, we derive a taxonomy of graph transformer architectures, bringing some order to this emerging field. We overview their theoretical properties, survey structural and positional encodings, and discuss extensions for important graph classes, e.g., 3D molecular graphs. Empirically, we probe how well graph transformers can recover various graph properties, how well they can deal with heterophilic graphs, and to what extent they prevent over-squashing. Further, we outline open challenges and research direction to stimulate future work. Our code is available at https://github.com/luis-mueller/probing-graph-transformers.
    Combining self-labeling and demand based active learning for non-stationary data streams. (arXiv:2302.04141v1 [cs.LG])
    Learning from non-stationary data streams is a research direction that gains increasing interest as more data in form of streams becomes available, for example from social media, smartphones, or industrial process monitoring. Most approaches assume that the ground truth of the samples becomes available (possibly with some delay) and perform supervised online learning in the test-then-train scheme. While this assumption might be valid in some scenarios, it does not apply to all settings. In this work, we focus on scarcely labeled data streams and explore the potential of self-labeling in gradually drifting data streams. We formalize this setup and propose a novel online $k$-nn classifier that combines self-labeling and demand-based active learning.
    Can Physics-Informed Neural Networks beat the Finite Element Method?. (arXiv:2302.04107v1 [math.NA])
    Partial differential equations play a fundamental role in the mathematical modelling of many processes and systems in physical, biological and other sciences. To simulate such processes and systems, the solutions of PDEs often need to be approximated numerically. The finite element method, for instance, is a usual standard methodology to do so. The recent success of deep neural networks at various approximation tasks has motivated their use in the numerical solution of PDEs. These so-called physics-informed neural networks and their variants have shown to be able to successfully approximate a large range of partial differential equations. So far, physics-informed neural networks and the finite element method have mainly been studied in isolation of each other. In this work, we compare the methodologies in a systematic computational study. Indeed, we employ both methods to numerically solve various linear and nonlinear partial differential equations: Poisson in 1D, 2D, and 3D, Allen-Cahn in 1D, semilinear Schr\"odinger in 1D and 2D. We then compare computational costs and approximation accuracies. In terms of solution time and accuracy, physics-informed neural networks have not been able to outperform the finite element method in our study. In some experiments, they were faster at evaluating the solved PDE.
    Predicting the performance of hybrid ventilation in buildings using a multivariate attention-based biLSTM Encoder-Decoder neural network. (arXiv:2302.04126v1 [cs.LG])
    Hybrid ventilation (coupling natural and mechanical ventilation) is an energy-efficient solution to provide fresh air for most climates, given that it has a reliable control system. To operate such systems optimally, a high-fidelity control-oriented model is required. It should enable near-real time forecast of the indoor air temperature and humidity based on operational conditions such as window opening and HVAC schedules. However, widely used physics-based simulation models (i.e., white-box models) are labour-intensive and computationally expensive. Alternatively, black-box models based on artificial neural networks can be trained to be good estimators for building dynamics. This paper investigates the capabilities of a multivariate multi-head attention-based long short-term memory (LSTM) encoder-decoder neural network to predict indoor air conditions of a building equipped with hybrid ventilation. The deep neural network used for this study aims to predict indoor air temperature dynamics when a window is opened and closed, respectively. Training and test data were generated from detailed multi-zone office building model (EnergyPlus). The deep neural network is able to accurately predict indoor air temperature of five zones whenever a window was opened and closed.
    DynGFN: Bayesian Dynamic Causal Discovery using Generative Flow Networks. (arXiv:2302.04178v1 [cs.LG])
    Learning the causal structure of observable variables is a central focus for scientific discovery. Bayesian causal discovery methods tackle this problem by learning a posterior over the set of admissible graphs given our priors and observations. Existing methods primarily consider observations from static systems and assume the underlying causal structure takes the form of a directed acyclic graph (DAG). In settings with dynamic feedback mechanisms that regulate the trajectories of individual variables, this acyclicity assumption fails unless we account for time. We focus on learning Bayesian posteriors over cyclic graphs and treat causal discovery as a problem of sparse identification of a dynamical system. This imposes a natural temporal causal order between variables and captures cyclic feedback loops through time. Under this lens, we propose a new framework for Bayesian causal discovery for dynamical systems and present a novel generative flow network architecture (DynGFN) tailored for this task. Our results indicate that DynGFN learns posteriors that better encapsulate the distributions over admissible cyclic causal structures compared to counterpart state-of-the-art approaches.
    Two-step hyperparameter optimization method: Accelerating hyperparameter search by using a fraction of a training dataset. (arXiv:2302.03845v1 [cs.LG])
    Hyperparameter optimization (HPO) can be an important step in machine learning model development, but our common practice is archaic -- primarily using a manual or grid search. This is partly because adopting an advanced HPO algorithm entails extra complexity to workflow and longer computation time. This imposes a significant hurdle to machine learning (ML) applications since the choice of suboptimal hyperparameters limits the performance of ML models, ultimately failing to harness the full potential of ML techniques. In this article, we present a two-step HPO method as a strategy to minimize compute and wait time as a lesson learned during applied ML parameterization work. A preliminary evaluation of hyperparameters is first conducted on a small subset of a training dataset, then top-performing candidate models are re-evaluated after retraining with an entire training dataset. This two-step HPO method can be applied to any HPO search algorithm, and we argue it has attractive efficiencies. As a case study, we present our recent application of the two-step HPO method to the development of neural network emulators of aerosol activation. Using only 5% of a training dataset in the initial step is sufficient to find optimal hyperparameter configurations from much more extensive sampling. The benefits of HPO are then revealed by analysis of hyperparameters and model performance, revealing a minimal model complexity required to achieve the best performance, and the diversity of top-performing models harvested from the HPO process allows us to choose a high-performing model with a low inference cost for efficient use in GCMs.  ( 2 min )
    Improving the Model Consistency of Decentralized Federated Learning. (arXiv:2302.04083v1 [cs.LG])
    To mitigate the privacy leakages and communication burdens of Federated Learning (FL), decentralized FL (DFL) discards the central server and each client only communicates with its neighbors in a decentralized communication network. However, existing DFL suffers from high inconsistency among local clients, which results in severe distribution shift and inferior performance compared with centralized FL (CFL), especially on heterogeneous data or sparse communication topology. To alleviate this issue, we propose two DFL algorithms named DFedSAM and DFedSAM-MGS to improve the performance of DFL. Specifically, DFedSAM leverages gradient perturbation to generate local flat models via Sharpness Aware Minimization (SAM), which searches for models with uniformly low loss values. DFedSAM-MGS further boosts DFedSAM by adopting Multiple Gossip Steps (MGS) for better model consistency, which accelerates the aggregation of local flat models and better balances communication complexity and generalization. Theoretically, we present improved convergence rates $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{1}{K^{1/2}T^{3/2}(1-\lambda)^2}\big)$ and $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{\lambda^Q+1}{K^{1/2}T^{3/2}(1-\lambda^Q)^2}\big)$ in non-convex setting for DFedSAM and DFedSAM-MGS, respectively, where $1-\lambda$ is the spectral gap of gossip matrix and $Q$ is the number of MGS. Empirically, our methods can achieve competitive performance compared with CFL methods and outperform existing DFL methods.
    ZipLM: Hardware-Aware Structured Pruning of Language Models. (arXiv:2302.04089v1 [cs.LG])
    The breakthrough performance of large language models (LLMs) comes with large computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a new structured compression approach for LLMs, called ZipLM, which provides state-of-the-art compression-vs-accuracy results, while guaranteeing to match a set of (achievable) target speedups on any given target hardware. Specifically, given a task, a model, an inference environment, as well as a set of speedup targets, ZipLM identifies and removes redundancies in the model through iterative structured shrinking of the model's weight matrices. Importantly, ZipLM works in both, the post-training/one-shot and the gradual compression setting, where it produces a set of accurate models in a single run, making it highly-efficient in practice. Our approach is based on new structured pruning and knowledge distillation techniques, and consistently outperforms prior structured compression methods in terms of accuracy-versus-speedup in experiments on BERT- and GPT-family models. In particular, when compressing GPT2 model, it outperforms DistilGPT2 while being 60% smaller and 30% faster. Further, ZipLM matches performance of heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large architecture, and outperforms all prior BERT-base compression techniques like CoFi, MiniLM and TinyBERT.
    Automating Code-Related Tasks Through Transformers: The Impact of Pre-training. (arXiv:2302.04048v1 [cs.SE])
    Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then fine-tuned to support the specific downstream task of interest (e.g., code summarization). While there is evidence suggesting the boost in performance provided by pre-training, little is known about the impact of the specific pre-training objective(s) used. Indeed, MLM is just one of the possible pre-training objectives and recent work from the natural language processing field suggest that pre-training objectives tailored for the specific downstream task of interest may substantially boost the model's performance. In this study, we focus on the impact of pre-training objectives on the performance of transformers when automating code-related tasks. We start with a systematic literature review aimed at identifying the pre-training objectives used in SE. Then, we pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in SE; and (ii) pre-training objectives tailored to specific code-related tasks subject of our experimentation, namely bug-fixing, code summarization, and code completion. We also compare the pre-trained models with non pre-trained ones. Our results show that: (i) pre-training helps in boosting performance only if the amount of fine-tuning data available is small; (ii) the MLM objective is usually sufficient to maximize the prediction performance of the model, even when comparing it with pre-training objectives specialized for the downstream task at hand.
    Local Law 144: A Critical Analysis of Regression Metrics. (arXiv:2302.04119v1 [cs.LG])
    The use of automated decision tools in recruitment has received an increasing amount of attention. In November 2021, the New York City Council passed a legislation (Local Law 144) that mandates bias audits of Automated Employment Decision Tools. From 15th April 2023, companies that use automated tools for hiring or promoting employees are required to have these systems audited by an independent entity. Auditors are asked to compute bias metrics that compare outcomes for different groups, based on sex/gender and race/ethnicity categories at a minimum. Local Law 144 proposes novel bias metrics for regression tasks (scenarios where the automated system scores candidates with a continuous range of values). A previous version of the legislation proposed a bias metric that compared the mean scores of different groups. The new revised bias metric compares the proportion of candidates in each group that falls above the median. In this paper, we argue that both metrics fail to capture distributional differences over the whole domain, and therefore cannot reliably detect bias. We first introduce two metrics, as possible alternatives to the legislation metrics. We then compare these metrics over a range of theoretical examples, for which the legislation proposed metrics seem to underestimate bias. Finally, we study real data and show that the legislation metrics can similarly fail in a real-world recruitment application.
    DeepVATS: Deep Visual Analytics for Time Series. (arXiv:2302.03858v1 [cs.LG])
    The field of Deep Visual Analytics (DVA) has recently arisen from the idea of developing Visual Interactive Systems supported by deep learning techniques, in order to provide them with large-scale data processing capabilities and to unify their implementation across different data modalities and domains of application. In this paper we present DeepVATS, an open-source tool that brings the field of DVA into time series data. DeepVATS trains, in a self-supervised way, a masked time series autoencoder that reconstructs patches of a time series, and projects the knowledge contained in the embeddings of that model in an interactive plot, from which time series patterns and anomalies emerge and can be easily spotted. The tool has been tested on both synthetic and real datasets, and its code is publicly available on https://github.com/vrodriguezf/deepvats  ( 2 min )
    Decentralized Riemannian Algorithm for Nonconvex Minimax Problems. (arXiv:2302.03825v1 [cs.LG])
    The minimax optimization over Riemannian manifolds (possibly nonconvex constraints) has been actively applied to solve many problems, such as robust dimensionality reduction and deep neural networks with orthogonal weights (Stiefel manifold). Although many optimization algorithms for minimax problems have been developed in the Euclidean setting, it is difficult to convert them into Riemannian cases, and algorithms for nonconvex minimax problems with nonconvex constraints are even rare. On the other hand, to address the big data challenges, decentralized (serverless) training techniques have recently been emerging since they can reduce communications overhead and avoid the bottleneck problem on the server node. Nonetheless, the algorithm for decentralized Riemannian minimax problems has not been studied. In this paper, we study the distributed nonconvex-strongly-concave minimax optimization problem over the Stiefel manifold and propose both deterministic and stochastic minimax methods. The local model is non-convex strong-concave and the Steifel manifold is a non-convex set. The global function is represented as the finite sum of local functions. For the deterministic setting, we propose DRGDA and prove that our deterministic method achieves a gradient complexity of $O( \epsilon^{-2})$ under mild conditions. For the stochastic setting, we propose DRSGDA and prove that our stochastic method achieves a gradient complexity of $O(\epsilon^{-4})$. The DRGDA and DRSGDA are the first algorithms for distributed minimax optimization with nonconvex constraints with exact convergence. Extensive experimental results on the Deep Neural Networks (DNNs) training over the Stiefel manifold demonstrate the efficiency of our algorithms.
    A Model for Forecasting Air Quality Index in Port Harcourt Nigeria Using Bi-LSTM Algorithm. (arXiv:2302.03930v1 [cs.LG])
    The release of toxic gases by industries, emissions from vehicles, and an increase in the concentration of harmful gases and particulate matter in the atmosphere are all contributing factors to the deterioration of the quality of the air. Factors such as industries, urbanization, population growth, and the increased use of vehicles contribute to the rapid increase in pollution levels, which can adversely impact human health. This paper presents a model for forecasting the air quality index in Nigeria using the Bi-directional LSTM model. The air pollution data was downloaded from an online database (UCL). The dataset was pre-processed using both pandas tools in python. The pre-processed result was used as input features in training a Bi-LSTM model in making future forecasts of the values of the particulate matter Pm2.5, and Pm10. The Bi-LSTM model was evaluated using some evaluation parameters such as mean square error, mean absolute error, absolute mean square, and R^2 square. The result of the Bi-LSTM shows a mean square error of 52.99%, relative mean square error of 7.28%, mean absolute error of 3.4%, and R^2 square of 97%. The model. This shows that the model follows a seamless trend in forecasting the air quality in Port Harcourt, Nigeria.  ( 2 min )
    Neural Artistic Style Transfer with Conditional Adversaria. (arXiv:2302.03875v1 [cs.CV])
    A neural artistic style transformation (NST) model can modify the appearance of a simple image by adding the style of a famous image. Even though the transformed images do not look precisely like artworks by the same artist of the respective style images, the generated images are appealing. Generally, a trained NST model specialises in a style, and a single image represents that style. However, generating an image under a new style is a tedious process, which includes full model training. In this paper, we present two methods that step toward the style image independent neural style transfer model. In other words, the trained model could generate semantically accurate generated image under any content, style image input pair. Our novel contribution is a unidirectional-GAN model that ensures the Cyclic consistency by the model architecture.Furthermore, this leads to much smaller model size and an efficient training and validation phase.
    A Scale-Independent Multi-Objective Reinforcement Learning with Convergence Analysis. (arXiv:2302.04179v1 [cs.LG])
    Many sequential decision-making problems need optimization of different objectives which possibly conflict with each other. The conventional way to deal with a multi-task problem is to establish a scalar objective function based on a linear combination of different objectives. However, for the case of having conflicting objectives with different scales, this method needs a trial-and-error approach to properly find proper weights for the combination. As such, in most cases, this approach cannot guarantee an optimal Pareto solution. In this paper, we develop a single-agent scale-independent multi-objective reinforcement learning on the basis of the Advantage Actor-Critic (A2C) algorithm. A convergence analysis is then done for the devised multi-objective algorithm providing a convergence-in-mean guarantee. We then perform some experiments over a multi-task problem to evaluate the performance of the proposed algorithm. Simulation results show the superiority of developed multi-objective A2C approach against the single-objective algorithm.
    Federated Minimax Optimization with Client Heterogeneity. (arXiv:2302.04249v1 [cs.LG])
    Minimax optimization has seen a surge in interest with the advent of modern applications such as GANs, and it is inherently more challenging than simple minimization. The difficulty is exacerbated by the training data residing at multiple edge devices or \textit{clients}, especially when these clients can have heterogeneous datasets and local computation capabilities. We propose a general federated minimax optimization framework that subsumes such settings and several existing methods like Local SGDA. We show that naive aggregation of heterogeneous local progress results in optimizing a mismatched objective function -- a phenomenon previously observed in standard federated minimization. To fix this problem, we propose normalizing the client updates by the number of local steps undertaken between successive communication rounds. We analyze the convergence of the proposed algorithm for classes of nonconvex-concave and nonconvex-nonconcave functions and characterize the impact of heterogeneous client data, partial client participation, and heterogeneous local computations. Our analysis works under more general assumptions on the intra-client noise and inter-client heterogeneity than so far considered in the literature. For all the function classes considered, we significantly improve the existing computation and communication complexity results. Experimental results support our theoretical claims.
    FFHR: Fully and Flexible Hyperbolic Representation for Knowledge Graph Completion. (arXiv:2302.04088v1 [cs.LG])
    Learning hyperbolic embeddings for knowledge graph (KG) has gained increasing attention due to its superiority in capturing hierarchies. However, some important operations in hyperbolic space still lack good definitions, making existing methods unable to fully leverage the merits of hyperbolic space. Specifically, they suffer from two main limitations: 1) existing Graph Convolutional Network (GCN) methods in hyperbolic space rely on tangent space approximation, which would incur approximation error in representation learning, and 2) due to the lack of inner product operation definition in hyperbolic space, existing methods can only measure the plausibility of facts (links) with hyperbolic distance, which is difficult to capture complex data patterns. In this work, we contribute: 1) a Full Poincar\'{e} Multi-relational GCN that achieves graph information propagation in hyperbolic space without requiring any approximation, and 2) a hyperbolic generalization of Euclidean inner product that is beneficial to capture both hierarchical and complex patterns. On this basis, we further develop a \textbf{F}ully and \textbf{F}lexible \textbf{H}yperbolic \textbf{R}epresentation framework (\textbf{FFHR}) that is able to transfer recent Euclidean-based advances to hyperbolic space. We demonstrate it by instantiating FFHR with four representative KGC methods. Extensive experiments on benchmark datasets validate the superiority of our FFHRs over their Euclidean counterparts as well as state-of-the-art hyperbolic embedding methods.
    ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines. (arXiv:2302.03851v1 [cs.LG])
    Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to high data movement costs. In this paper, we provide an approach for batching dynamic DNNs based on finite state machines, which enables the automatic discovery of batching policies specialized for each DNN via reinforcement learning. Moreover, we find that memory planning that is aware of the batching policy can save significant data movement overheads, which is automated by a PQ tree-based algorithm we introduce. Experimental results show that our framework speeds up state-of-the-art frameworks by on average 1.15x, 1.39x, and 2.45x for chain-based, tree-based, and lattice-based DNNs across CPU and GPU.  ( 2 min )
    On Generalized Degree Fairness in Graph Neural Networks. (arXiv:2302.03881v1 [cs.LG])
    Conventional graph neural networks (GNNs) are often confronted with fairness issues that may stem from their input, including node attributes and neighbors surrounding a node. While several recent approaches have been proposed to eliminate the bias rooted in sensitive attributes, they ignore the other key input of GNNs, namely the neighbors of a node, which can introduce bias since GNNs hinge on neighborhood structures to generate node representations. In particular, the varying neighborhood structures across nodes, manifesting themselves in drastically different node degrees, give rise to the diverse behaviors of nodes and biased outcomes. In this paper, we first define and generalize the degree bias using a generalized definition of node degree as a manifestation and quantification of different multi-hop structures around different nodes. To address the bias in the context of node classification, we propose a novel GNN framework called Generalized Degree Fairness-centric Graph Neural Network (Deg-FairGNN). Specifically, in each GNN layer, we employ a learnable debiasing function to generate debiasing contexts, which modulate the layer-wise neighborhood aggregation to eliminate the degree bias originating from the diverse degrees among nodes. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our model on both accuracy and fairness metrics.  ( 2 min )
    Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models. (arXiv:2302.03900v1 [cs.CV])
    Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.  ( 2 min )
    Topological Deep Learning: A Review of an Emerging Paradigm. (arXiv:2302.03836v1 [cs.LG])
    Topological data analysis (TDA) provides insight into data shape. The summaries obtained by these methods are principled global descriptions of multi-dimensional data whilst exhibiting stable properties such as robustness to deformation and noise. Such properties are desirable in deep learning pipelines but they are typically obtained using non-TDA strategies. This is partly caused by the difficulty of combining TDA constructs (e.g. barcode and persistence diagrams) with current deep learning algorithms. Fortunately, we are now witnessing a growth of deep learning applications embracing topologically-guided components. In this survey, we review the nascent field of topological deep learning by first revisiting the core concepts of TDA. We then explore how the use of TDA techniques has evolved over time to support deep learning frameworks, and how they can be integrated into different aspects of deep learning. Furthermore, we touch on TDA usage for analyzing existing deep models; deep topological analytics. Finally, we discuss the challenges and future prospects of topological deep learning.  ( 2 min )
    Towards Inferential Reproducibility of Machine Learning Research. (arXiv:2302.04054v1 [cs.LG])
    Reliability of machine learning evaluation -- the consistency of observed evaluation scores across replicated model training runs -- is affected by several sources of nondeterminism which can be regarded as measurement noise. Current tendencies to remove noise in order to enforce reproducibility of research results neglect inherent nondeterminism at the implementation level and disregard crucial interaction effects between algorithmic noise factors and data properties. This limits the scope of conclusions that can be drawn from such experiments. Instead of removing noise, we propose to incorporate several sources of variance, including their interaction with data properties, into an analysis of significance and reliability of machine learning evaluation, with the aim to draw inferences beyond particular instances of trained models. We show how to use linear mixed effects models (LMEMs) to analyze performance evaluation scores, and to conduct statistical inference with a generalized likelihood ratio test (GLRT). This allows us to incorporate arbitrary sources of noise like meta-parameter variations into statistical significance testing, and to assess performance differences conditional on data properties. Furthermore, a variance component analysis (VCA) enables the analysis of the contribution of noise sources to overall variance and the computation of a reliability coefficient by the ratio of substantial to total variance.
    A prototype-oriented clustering for domain shift with source privacy. (arXiv:2302.03807v1 [cs.LG])
    Unsupervised clustering under domain shift (UCDS) studies how to transfer the knowledge from abundant unlabeled data from multiple source domains to learn the representation of the unlabeled data in a target domain. In this paper, we introduce Prototype-oriented Clustering with Distillation (PCD) to not only improve the performance and applicability of existing methods for UCDS, but also address the concerns on protecting the privacy of both the data and model of the source domains. PCD first constructs a source clustering model by aligning the distributions of prototypes and data. It then distills the knowledge to the target model through cluster labels provided by the source model while simultaneously clustering the target data. Finally, it refines the target model on the target domain data without guidance from the source model. Experiments across multiple benchmarks show the effectiveness and generalizability of our source-private clustering method.
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v1 [cs.LG])
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as an expensive black-box optimization problem over the discrete chemical space. Computational methods have achieved initial success but still struggle with simultaneously optimizing multiple competing properties in a sample-efficient manner. In this work, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. Inspired by reinforcement learning, we further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. Through synthetic experiments, we illustrate that HN-GFN has adequate capacity to generalize over preferences. Extensive experiments show that our framework outperforms the best baselines by a large margin in terms of hypervolume in various real-world MOBO settings.
    Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps. (arXiv:2302.04065v1 [stat.ML])
    Optimal transport (OT) theory focuses, among all maps $T:\mathbb{R}^d\rightarrow \mathbb{R}^d$ that can morph a probability measure onto another, on those that are the ``thriftiest'', i.e. such that the averaged cost $c(x, T(x))$ between $x$ and its image $T(x)$ be as small as possible. Many computational approaches have been proposed to estimate such Monge maps when $c$ is the $\ell_2^2$ distance, e.g., using entropic maps [Pooladian'22], or neural networks [Makkuva'20, Korotin'20]. We propose a new model for transport maps, built on a family of translation invariant costs $c(x, y):=h(x-y)$, where $h:=\tfrac{1}{2}\|\cdot\|_2^2+\tau$ and $\tau$ is a regularizer. We propose a generalization of the entropic map suitable for $h$, and highlight a surprising link tying it with the Bregman centroids of the divergence $D_h$ generated by $h$, and the proximal operator of $\tau$. We show that choosing a sparsity-inducing norm for $\tau$ results in maps that apply Occam's razor to transport, in the sense that the displacement vectors $\Delta(x):= T(x)-x$ they induce are sparse, with a sparsity pattern that varies depending on $x$. We showcase the ability of our method to estimate meaningful OT maps for high-dimensional single-cell transcription data, in the $34000$-$d$ space of gene counts for cells, without using dimensionality reduction, thus retaining the ability to interpret all displacements at the gene level.
    Clinical BioBERT Hyperparameter Optimization using Genetic Algorithm Clinical BioBERT Hyperparameter Optimization using Genetic Algorithm. (arXiv:2302.03822v1 [cs.AI])
    Clinical factors account only for a small portion, about 10-30%, of the controllable factors that affect an individual's health outcomes. The remaining factors include where a person was born and raised, where he/she pursued their education, what their work and family environment is like, etc. These factors are collectively referred to as Social Determinants of Health (SDoH). The majority of SDoH data is recorded in unstructured clinical notes by physicians and practitioners. Recording SDoH data in a structured manner (in an EHR) could greatly benefit from a dedicated ontology of SDoH terms. Our research focuses on extracting sentences from clinical notes, making use of such an SDoH ontology (called SOHO) to provide appropriate concepts. We utilize recent advancements in Deep Learning to optimize the hyperparameters of a Clinical BioBERT model for SDoH text. A genetic algorithm-based hyperparameter tuning regimen was implemented to identify optimal parameter settings. To implement a complete classifier, we pipelined Clinical BioBERT with two subsequent linear layers and two dropout layers. The output predicts whether a text fragment describes an SDoH issue of the patient. We compared the AdamW, Adafactor, and LAMB optimizers. In our experiments, AdamW outperformed the others in terms of accuracy.  ( 2 min )
    Explainable Label-flipping Attacks on Human Emotion Assessment System. (arXiv:2302.04109v1 [cs.LG])
    This paper's main goal is to provide an attacker's point of view on data poisoning assaults that use label-flipping during the training phase of systems that use electroencephalogram (EEG) signals to evaluate human emotion. To attack different machine learning classifiers such as Adaptive Boosting (AdaBoost) and Random Forest dedicated to the classification of 4 different human emotions using EEG signals, this paper proposes two scenarios of label-flipping methods. The results of the studies show that the proposed data poison attacksm based on label-flipping are successful regardless of the model, but different models show different degrees of resistance to the assaults. In addition, numerous Explainable Artificial Intelligence (XAI) techniques are used to explain the data poison attacks on EEG signal-based human emotion evaluation systems.
    WAT: Improve the Worst-class Robustness in Adversarial Training. (arXiv:2302.04025v1 [cs.LG])
    Deep Neural Networks (DNN) have been shown to be vulnerable to adversarial examples. Adversarial training (AT) is a popular and effective strategy to defend against adversarial attacks. Recent works (Benz et al., 2020; Xu et al., 2021; Tian et al., 2021) have shown that a robust model well-trained by AT exhibits a remarkable robustness disparity among classes, and propose various methods to obtain consistent robust accuracy across classes. Unfortunately, these methods sacrifice a good deal of the average robust accuracy. Accordingly, this paper proposes a novel framework of worst-class adversarial training and leverages no-regret dynamics to solve this problem. Our goal is to obtain a classifier with great performance on worst-class and sacrifice just a little average robust accuracy at the same time. We then rigorously analyze the theoretical properties of our proposed algorithm, and the generalization error bound in terms of the worst-class robust risk. Furthermore, we propose a measurement to evaluate the proposed method in terms of both the average and worst-class accuracies. Experiments on various datasets and networks show that our proposed method outperforms the state-of-the-art approaches.
    Revisit the Algorithm Selection Problem for TSP with Spatial Information Enhanced Graph Neural Networks. (arXiv:2302.04035v1 [cs.LG])
    Algorithm selection is a well-known problem where researchers investigate how to construct useful features representing the problem instances and then apply feature-based machine learning models to predict which algorithm works best with the given instance. However, even for simple optimization problems such as Euclidean Traveling Salesman Problem (TSP), there lacks a general and effective feature representation for problem instances. The important features of TSP are relatively well understood in the literature, based on extensive domain knowledge and post-analysis of the solutions. In recent years, Convolutional Neural Network (CNN) has become a popular approach to select algorithms for TSP. Compared to traditional feature-based machine learning models, CNN has an automatic feature-learning ability and demands less domain expertise. However, it is still required to generate intermediate representations, i.e., multiple images to represent TSP instances first. In this paper, we revisit the algorithm selection problem for TSP, and propose a novel Graph Neural Network (GNN), called GINES. GINES takes the coordinates of cities and distances between cities as input. It is composed of a new message-passing mechanism and a local neighborhood feature extractor to learn spatial information of TSP instances. We evaluate GINES on two benchmark datasets. The results show that GINES outperforms CNN and the original GINE models. It is better than the traditional handcrafted feature-based approach on one dataset. The code and dataset will be released in the final version of this paper.
    On the Richness of Calibration. (arXiv:2302.04118v1 [cs.LG])
    Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.
    Multi-view Feature Extraction based on Dual Contrastive Head. (arXiv:2302.03932v1 [cs.CV])
    Multi-view feature extraction is an efficient approach for alleviating the issue of dimensionality in highdimensional multi-view data. Contrastive learning (CL), which is a popular self-supervised learning method, has recently attracted considerable attention. Most CL-based methods were constructed only from the sample level. In this study, we propose a novel multiview feature extraction method based on dual contrastive head, which introduce structural-level contrastive loss into sample-level CL-based method. Structural-level CL push the potential subspace structures consistent in any two cross views, which assists sample-level CL to extract discriminative features more effectively. Furthermore, it is proven that the relationships between structural-level CL and mutual information and probabilistic intraand inter-scatter, which provides the theoretical support for the excellent performance. Finally, numerical experiments on six real datasets demonstrate the superior performance of the proposed method compared to existing methods.
    Learning-based Online Optimization for Autonomous Mobility-on-Demand Fleet Control. (arXiv:2302.03963v1 [math.OC])
    Autonomous mobility-on-demand systems are a viable alternative to mitigate many transportation-related externalities in cities, such as rising vehicle volumes in urban areas and transportation-related pollution. However, the success of these systems heavily depends on efficient and effective fleet control strategies. In this context, we study online control algorithms for autonomous mobility-on-demand systems and develop a novel hybrid combinatorial optimization enriched machine learning pipeline which learns online dispatching and rebalancing policies from optimal full-information solutions. We test our hybrid pipeline on large-scale real-world scenarios with different vehicle fleet sizes and various request densities. We show that our approach outperforms state-of-the-art greedy, and model-predictive control approaches with respect to various KPIs, e.g., by up to 17.1% and on average by 6.3% in terms of realized profit.
    Zero-shot Sim2Real Adaptation Across Environments. (arXiv:2302.04013v1 [cs.LG])
    Simulation based learning often provides a cost-efficient recourse to reinforcement learning applications in robotics. However, simulators are generally incapable of accurately replicating real-world dynamics, and thus bridging the sim2real gap is an important problem in simulation based learning. Current solutions to bridge the sim2real gap involve hybrid simulators that are augmented with neural residual models. Unfortunately, they require a separate residual model for each individual environment configuration (i.e., a fixed setting of environment variables such as mass, friction etc.), and thus are not transferable to new environments quickly. To address this issue, we propose a Reverse Action Transformation (RAT) policy which learns to imitate simulated policies in the real-world. Once learnt from a single environment, RAT can then be deployed on top of a Universal Policy Network to achieve zero-shot adaptation to new environments. We empirically evaluate our approach in a set of continuous control tasks and observe its advantage as a few-shot and zero-shot learner over competing baselines.
    Probabilistic Attention based on Gaussian Processes for Deep Multiple Instance Learning. (arXiv:2302.04061v1 [cs.LG])
    Multiple Instance Learning (MIL) is a weakly supervised learning paradigm that is becoming increasingly popular because it requires less labeling effort than fully supervised methods. This is especially interesting for areas where the creation of large annotated datasets remains challenging, as in medicine. Although recent deep learning MIL approaches have obtained state-of-the-art results, they are fully deterministic and do not provide uncertainty estimations for the predictions. In this work, we introduce the Attention Gaussian Process (AGP) model, a novel probabilistic attention mechanism based on Gaussian Processes for deep MIL. AGP provides accurate bag-level predictions as well as instance-level explainability, and can be trained end-to-end. Moreover, its probabilistic nature guarantees robustness to overfitting on small datasets and uncertainty estimations for the predictions. The latter is especially important in medical applications, where decisions have a direct impact on the patient's health. The proposed model is validated experimentally as follows. First, its behavior is illustrated in two synthetic MIL experiments based on the well-known MNIST and CIFAR-10 datasets, respectively. Then, it is evaluated in three different real-world cancer detection experiments. AGP outperforms state-of-the-art MIL approaches, including deterministic deep learning ones. It shows a strong performance even on a small dataset with less than 100 labels and generalizes better than competing methods on an external test set. Moreover, we experimentally show that predictive uncertainty correlates with the risk of wrong predictions, and therefore it is a good indicator of reliability in practice. Our code is publicly available.
    Approximately Optimal Core Shapes for Tensor Decompositions. (arXiv:2302.03886v1 [cs.DS])
    This work studies the combinatorial optimization problem of finding an optimal core tensor shape, also called multilinear rank, for a size-constrained Tucker decomposition. We give an algorithm with provable approximation guarantees for its reconstruction error via connections to higher-order singular values. Specifically, we introduce a novel Tucker packing problem, which we prove is NP-hard, and give a polynomial-time approximation scheme based on a reduction to the 2-dimensional knapsack problem with a matroid constraint. We also generalize our techniques to tree tensor network decompositions. We implement our algorithm using an integer programming solver, and show that its solution quality is competitive with (and sometimes better than) the greedy algorithm that uses the true Tucker decomposition loss at each step, while also running up to 1000x faster.
    Rover: An online Spark SQL tuning service via generalized transfer learning. (arXiv:2302.04046v1 [cs.LG])
    Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.
    QS-ADN: Quasi-Supervised Artifact Disentanglement Network for Low-Dose CT Image Denoising by Local Similarity Among Unpaired Data. (arXiv:2302.03916v1 [cs.LG])
    Deep learning has been successfully applied to low-dose CT (LDCT) image denoising for reducing potential radiation risk. However, the widely reported supervised LDCT denoising networks require a training set of paired images, which is expensive to obtain and cannot be perfectly simulated. Unsupervised learning utilizes unpaired data and is highly desirable for LDCT denoising. As an example, an artifact disentanglement network (ADN) relies on unparied images and obviates the need for supervision but the results of artifact reduction are not as good as those through supervised learning.An important observation is that there is often hidden similarity among unpaired data that can be utilized. This paper introduces a new learning mode, called quasi-supervised learning, to empower the ADN for LDCT image denoising.For every LDCT image, the best matched image is first found from an unpaired normal-dose CT (NDCT) dataset. Then, the matched pairs and the corresponding matching degree as prior information are used to construct and train our ADN-type network for LDCT denoising.The proposed method is different from (but compatible with) supervised and semi-supervised learning modes and can be easily implemented by modifying existing networks. The experimental results show that the method is competitive with state-of-the-art methods in terms of noise suppression and contextual fidelity. The code and working dataset are publicly available at https://github.com/ruanyuhui/ADN-QSDL.git.
    InMyFace: Inertial and Mechanomyography-Based Sensor Fusion for Wearable Facial Activity Recognition. (arXiv:2302.04024v1 [cs.LG])
    Recognizing facial activity is a well-understood (but non-trivial) computer vision problem. However, reliable solutions require a camera with a good view of the face, which is often unavailable in wearable settings. Furthermore, in wearable applications, where systems accompany users throughout their daily activities, a permanently running camera can be problematic for privacy (and legal) reasons. This work presents an alternative solution based on the fusion of wearable inertial sensors, planar pressure sensors, and acoustic mechanomyography (muscle sounds). The sensors were placed unobtrusively in a sports cap to monitor facial muscle activities related to facial expressions. We present our integrated wearable sensor system, describe data fusion and analysis methods, and evaluate the system in an experiment with thirteen subjects from different cultural backgrounds (eight countries) and both sexes (six women and seven men). In a one-model-per-user scheme and using a late fusion approach, the system yielded an average F1 score of 85.00% for the case where all sensing modalities are combined. With a cross-user validation and a one-model-for-all-user scheme, an F1 score of 79.00% was obtained for thirteen participants (six females and seven males). Moreover, in a hybrid fusion (cross-user) approach and six classes, an average F1 score of 82.00% was obtained for eight users. The results are competitive with state-of-the-art non-camera-based solutions for a cross-user study. In addition, our unique set of participants demonstrates the inclusiveness and generalizability of the approach.
    Fortuna: A Library for Uncertainty Quantification in Deep Learning. (arXiv:2302.04019v1 [cs.LG])
    We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to Flax-based deep neural networks trained from scratch for improved uncertainty quantification and accuracy. By providing a coherent framework for advanced uncertainty quantification methods, Fortuna simplifies the process of benchmarking and helps practitioners build robust AI systems.
    Information-Theoretic Diffusion. (arXiv:2302.03792v1 [cs.LG])
    Denoising diffusion models have spurred significant gains in density modeling and image generation, precipitating an industrial revolution in text-guided AI art generation. We introduce a new mathematical foundation for diffusion models inspired by classic results in information theory that connect Information with Minimum Mean Square Error regression, the so-called I-MMSE relations. We generalize the I-MMSE relations to exactly relate the data distribution to an optimal denoising regression problem, leading to an elegant refinement of existing diffusion bounds. This new insight leads to several improvements for probability distribution estimation, including theoretical justification for diffusion model ensembling. Remarkably, our framework shows how continuous and discrete probabilities can be learned with the same regression objective, avoiding domain-specific generative models used in variational methods. Code to reproduce experiments is provided at this http URL and simplified demonstration code is at this http URL  ( 2 min )
    Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions. (arXiv:2302.03764v1 [stat.ML])
    Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. Our technique allows interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, placing the method on the memory-quality Pareto frontier of several large scale benchmarks.  ( 2 min )
    Compositional Score Modeling for Simulation-based Inference. (arXiv:2209.14249v2 [cs.LG] UPDATED)
    Neural Posterior Estimation methods for simulation-based inference can be ill-suited for dealing with posterior distributions obtained by conditioning on multiple observations, as they tend to require a large number of simulator calls to learn accurate approximations. In contrast, Neural Likelihood Estimation methods can handle multiple observations at inference time after learning from individual observations, but they rely on standard inference methods, such as MCMC or variational inference, which come with certain performance drawbacks. We introduce a new method based on conditional score modeling that enjoys the benefits of both approaches. We model the scores of the (diffused) posterior distributions induced by individual observations, and introduce a way of combining the learned scores to approximately sample from the target posterior distribution. Our approach is sample-efficient, can naturally aggregate multiple observations at inference time, and avoids the drawbacks of standard inference methods.
    Fully-Dynamic Approximate Decision Trees With Worst-Case Update Time Guarantees. (arXiv:2302.03994v1 [cs.DS])
    We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $\alpha$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $\epsilon$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $\epsilon > 0$, maintains an $\epsilon$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.
    Classification of Methods to Reduce Clinical Alarm Signals for Remote Patient Monitoring: A Critical Review. (arXiv:2302.03885v1 [cs.LG])
    Remote Patient Monitoring (RPM) is an emerging technology paradigm that helps reduce clinician workload by automated monitoring and raising intelligent alarm signals. High sensitivity and intelligent data-processing algorithms used in RPM devices result in frequent false-positive alarms, resulting in alarm fatigue. This study aims to critically review the existing literature to identify the causes of these false-positive alarms and categorize the various interventions used in the literature to eliminate these causes. That act as a catalog and helps in false alarm reduction algorithm design. A step-by-step approach to building an effective alarm signal generator for clinical use has been proposed in this work. Second, the possible causes of false-positive alarms amongst RPM applications were analyzed from the literature. Third, a critical review has been done of the various interventions used in the literature depending on causes and classification based on four major approaches: clinical knowledge, physiological data, medical sensor devices, and clinical environments. A practical clinical alarm strategy could be developed by following our pentagon approach. The first phase of this approach emphasizes identifying the various causes for the high number of false-positive alarms. Future research will focus on developing a false alarm reduction method using data mining.
    Cut your Losses with Squentropy. (arXiv:2302.03952v1 [cs.LG])
    Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.
    Structural hierarchical learning for energy networks. (arXiv:2302.03978v1 [cs.LG])
    Many sectors nowadays require accurate and coherent predictions across their organization to effectively operate. Otherwise, decision-makers would be planning using disparate views of the future, resulting in inconsistent decisions across their sectors. To secure coherency across hierarchies, recent research has put forward hierarchical learning, a coherency-informed hierarchical regressor leveraging the power of machine learning thanks to a custom loss function founded on optimal reconciliation methods. While promising potentials were outlined, results exhibited discordant performances in which coherency information only improved hierarchical forecasts in one setting. This work proposes to tackle these obstacles by investigating custom neural network designs inspired by the topological structures of hierarchies. Results unveil that, in a data-limited setting, structural models with fewer connections perform overall best and demonstrate the coherency information value for both accuracy and coherency forecasting performances, provided individual forecasts were generated within reasonable accuracy limits. Overall, this work expands and improves hierarchical learning methods thanks to a structurally-scaled learning mechanism extension coupled with tailored network designs, producing a resourceful, data-efficient, and information-rich learning process.
    Efficient Joint Learning for Clinical Named Entity Recognition and Relation Extraction Using Fourier Networks: A Use Case in Adverse Drug Events. (arXiv:2302.04185v1 [cs.CL])
    Current approaches for clinical information extraction are inefficient in terms of computational costs and memory consumption, hindering their application to process large-scale electronic health records (EHRs). We propose an efficient end-to-end model, the Joint-NER-RE-Fourier (JNRF), to jointly learn the tasks of named entity recognition and relation extraction for documents of variable length. The architecture uses positional encoding and unitary batch sizes to process variable length documents and uses a weight-shared Fourier network layer for low-complexity token mixing. Finally, we reach the theoretical computational complexity lower bound for relation extraction using a selective pooling strategy and distance-aware attention weights with trainable polynomial distance functions. We evaluated the JNRF architecture using the 2018 N2C2 ADE benchmark to jointly extract medication-related entities and relations in variable-length EHR summaries. JNRF outperforms rolling window BERT with selective pooling by 0.42%, while being twice as fast to train. Compared to state-of-the-art BiLSTM-CRF architectures on the N2C2 ADE benchmark, results show that the proposed approach trains 22 times faster and reduces GPU memory consumption by 1.75 folds, with a reasonable performance tradeoff of 90%, without the use of external tools, hand-crafted rules or post-processing. Given the significant carbon footprint of deep learning models and the current energy crises, these methods could support efficient and cleaner information extraction in EHRs and other types of large-scale document databases.
    Fast Linear Model Trees by PILOT. (arXiv:2302.03931v1 [stat.ML])
    Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for $PI$ecewise $L$inear $O$rganic $T$ree, where `organic' refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.
    ASTRIDE: Adaptive Symbolization for Time Series Databases. (arXiv:2302.04097v1 [cs.LG])
    We introduce ASTRIDE (Adaptive Symbolization for Time seRIes DatabasEs), a novel symbolic representation of time series, along with its accelerated variant FASTRIDE (Fast ASTRIDE). Unlike most symbolization procedures, ASTRIDE is adaptive during both the segmentation step by performing change-point detection and the quantization step by using quantiles. Instead of proceeding signal by signal, ASTRIDE builds a dictionary of symbols that is common to all signals in a data set. We also introduce D-GED (Dynamic General Edit Distance), a novel similarity measure on symbolic representations based on the general edit distance. We demonstrate the performance of the ASTRIDE and FASTRIDE representations compared to SAX (Symbolic Aggregate approXimation), 1d-SAX, SFA (Symbolic Fourier Approximation), and ABBA (Adaptive Brownian Bridge-based Aggregation) on reconstruction and, when applicable, on classification tasks. These algorithms are evaluated on 86 univariate equal-size data sets from the UCR Time Series Classification Archive. An open source GitHub repository called astride is made available to reproduce all the experiments in Python.
    What Matters In The Structured Pruning of Generative Language Models?. (arXiv:2302.03773v1 [cs.CL])
    Auto-regressive large language models such as GPT-3 require enormous computational resources to use. Traditionally, structured pruning methods are employed to reduce resource usage. However, their application to and efficacy for generative language models is heavily under-explored. In this paper we conduct an comprehensive evaluation of common structured pruning methods, including magnitude, random, and movement pruning on the feed-forward layers in GPT-type models. Unexpectedly, random pruning results in performance that is comparable to the best established methods, across multiple natural language generation tasks. To understand these results, we provide a framework for measuring neuron-level redundancy of models pruned by different methods, and discover that established structured pruning methods do not take into account the distinctiveness of neurons, leaving behind excess redundancies. In view of this, we introduce Globally Unique Movement (GUM) to improve the uniqueness of neurons in pruned models. We then discuss the effects of our techniques on different redundancy metrics to explain the improved performance.
    Fairness in Matching under Uncertainty. (arXiv:2302.03810v1 [cs.LG])
    The prevalence and importance of algorithmic two-sided marketplaces has drawn attention to the issue of fairness in such settings. Algorithmic decisions are used in assigning students to schools, users to advertisers, and applicants to job interviews. These decisions should heed the preferences of individuals, and simultaneously be fair with respect to their merits (synonymous with fit, future performance, or need). Merits conditioned on observable features are always uncertain, a fact that is exacerbated by the widespread use of machine learning algorithms to infer merit from the observables. As our key contribution, we carefully axiomatize a notion of individual fairness in the two-sided marketplace setting which respects the uncertainty in the merits; indeed, it simultaneously recognizes uncertainty as the primary potential cause of unfairness and an approach to address it. We design a linear programming framework to find fair utility-maximizing distributions over allocations, and we show that the linear program is robust to perturbations in the estimated parameters of the uncertain merit distributions, a key property in combining the approach with machine learning techniques.
    Noise2Music: Text-conditioned Music Generation with Diffusion Models. (arXiv:2302.03917v1 [cs.SD])
    We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music
    Predictable MDP Abstraction for Unsupervised Model-Based RL. (arXiv:2302.03921v1 [cs.LG])
    A key component of model-based reinforcement learning (RL) is a dynamics model that predicts the outcomes of actions. Errors in this predictive model can degrade the performance of model-based controllers, and complex Markov decision processes (MDPs) can present exceptionally difficult prediction problems. To mitigate this issue, we propose predictable MDP abstraction (PMA): instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space that only permits predictable, easy-to-model actions, while covering the original state-action space as much as possible. As a result, model learning becomes easier and more accurate, which allows robust, stable model-based planning or model-based RL. This transformation is learned in an unsupervised manner, before any task is specified by the user. Downstream tasks can then be solved with model-based control in a zero-shot fashion, without additional environment interactions. We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches in a range of benchmark environments. Our code and videos are available at https://seohong.me/projects/pma/
    Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking. (arXiv:2302.03802v1 [cs.CV])
    This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with object queries. To explicitly use historical cues, our "Past Reasoning" module learns to refine the tracks and enhance the object features by cross-attending to queries from previous frames and other objects. The "Future Reasoning" module digests historical information and predicts robust future trajectories. In the case of long-term occlusions, our method maintains the object positions and enables re-association by integrating motion predictions. On the nuScenes dataset, our method improves AMOTA by a large margin and remarkably reduces ID-Switches by 90% compared to prior approaches, which is an order of magnitude less. The code and models are made available at https://github.com/TRI-ML/PF-Track.
    Layered State Discovery for Incremental Autonomous Exploration. (arXiv:2302.03789v1 [cs.LG])
    We study the autonomous exploration (AX) problem proposed by Lim & Auer (2012). In this setting, the objective is to discover a set of $\epsilon$-optimal policies reaching a set $\mathcal{S}_L^{\rightarrow}$ of incrementally $L$-controllable states. We introduce a novel layered decomposition of the set of incrementally $L$-controllable states that is based on the iterative application of a state-expansion operator. We leverage these results to design Layered Autonomous Exploration (LAE), a novel algorithm for AX that attains a sample complexity of $\tilde{\mathcal{O}}(LS^{\rightarrow}_{L(1+\epsilon)}\Gamma_{L(1+\epsilon)} A \ln^{12}(S^{\rightarrow}_{L(1+\epsilon)})/\epsilon^2)$, where $S^{\rightarrow}_{L(1+\epsilon)}$ is the number of states that are incrementally $L(1+\epsilon)$-controllable, $A$ is the number of actions, and $\Gamma_{L(1+\epsilon)}$ is the branching factor of the transitions over such states. LAE improves over the algorithm of Tarbouriech et al. (2020a) by a factor of $L^2$ and it is the first algorithm for AX that works in a countably-infinite state space. Moreover, we show that, under a certain identifiability assumption, LAE achieves minimax-optimal sample complexity of $\tilde{\mathcal{O}}(LS^{\rightarrow}_{L}A\ln^{12}(S^{\rightarrow}_{L})/\epsilon^2)$, outperforming existing algorithms and matching for the first time the lower bound proved by Cai et al. (2022) up to logarithmic factors.
    Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. (arXiv:2302.03991v1 [q-bio.GN])
    As the availability of omics data has increased in the last few years, more multi-omics data have been generated, that is, high-dimensional molecular data consisting of several types such as genomic, transcriptomic, or proteomic data, all obtained from the same patients. Such data lend themselves to being used as covariates in automatic outcome prediction because each omics type may contribute unique information, possibly improving predictions compared to using only one omics data type. Frequently, however, in the training data and the data to which automatic prediction rules should be applied, the test data, the different omics data types are not available for all patients. We refer to this type of data as block-wise missing multi-omics data. First, we provide a literature review on existing prediction methods applicable to such data. Subsequently, using a collection of 13 publicly available multi-omics data sets, we compare the predictive performances of several of these approaches for different block-wise missingness patterns. Finally, we discuss the results of this empirical comparison study and draw some tentative conclusions.
    Intrinsic Rewards from Self-Organizing Feature Maps for Exploration in Reinforcement Learning. (arXiv:2302.04125v1 [cs.LG])
    We introduce an exploration bonus for deep reinforcement learning methods calculated using self-organising feature maps. Our method uses adaptive resonance theory (ART) providing online, unsupervised clustering to quantify the novelty of a state. This heuristic is used to add an intrinsic reward to the extrinsic reward signal for then to optimize the agent to maximize the sum of these two rewards. We find that this method was able to play the game Ordeal at a human level after a comparable number of training epochs to ICM arXiv:1705.05464. Agents augmented with RND arXiv:1810.12894 were unable to achieve the same level of performance in our space of hyperparameters.
    Robustness to Spurious Correlations Improves Semantic Out-of-Distribution Detection. (arXiv:2302.04132v1 [cs.LG])
    Methods which utilize the outputs or feature representations of predictive models have emerged as promising approaches for out-of-distribution (OOD) detection of image inputs. However, these methods struggle to detect OOD inputs that share nuisance values (e.g. background) with in-distribution inputs. The detection of shared-nuisance out-of-distribution (SN-OOD) inputs is particularly relevant in real-world applications, as anomalies and in-distribution inputs tend to be captured in the same settings during deployment. In this work, we provide a possible explanation for SN-OOD detection failures and propose nuisance-aware OOD detection to address them. Nuisance-aware OOD detection substitutes a classifier trained via empirical risk minimization and cross-entropy loss with one that 1. is trained under a distribution where the nuisance-label relationship is broken and 2. yields representations that are independent of the nuisance under this distribution, both marginally and conditioned on the label. We can train a classifier to achieve these objectives using Nuisance-Randomized Distillation (NuRD), an algorithm developed for OOD generalization under spurious correlations. Output- and feature-based nuisance-aware OOD detection perform substantially better than their original counterparts, succeeding even when detection based on domain generalization algorithms fails to improve performance.
    Concept Algebra for Text-Controlled Vision Models. (arXiv:2302.03693v1 [cs.CL])
    This paper concerns the control of text-guided generative models, where a user provides a natural language prompt and the model generates samples based on this input. Prompting is intuitive, general, and flexible. However, there are significant limitations: prompting can fail in surprising ways, and it is often unclear how to find a prompt that will elicit some desired target behavior. A core difficulty for developing methods to overcome these issues is that failures are know-it-when-you-see-it -- it's hard to fix bugs if you can't state precisely what the model should have done! In this paper, we introduce a formalization of "what the user intended" in terms of latent concepts implicit to the data generating process that the model was trained on. This formalization allows us to identify some fundamental limitations of prompting. We then use the formalism to develop concept algebra to overcome these limitations. Concept algebra is a way of directly manipulating the concepts expressed in the output through algebraic operations on a suitably defined representation of input prompts. We give examples using concept algebra to overcome limitations of prompting, including concept transfer through arithmetic, and concept nullification through projection. Code available at https://github.com/zihao12/concept-algebra.
    Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback. (arXiv:2302.03805v1 [cs.LG])
    In classic reinforcement learning (RL) and decision making problems, policies are evaluated with respect to a scalar reward function, and all optimal policies are the same with regards to their expected return. However, many real-world problems involve balancing multiple, sometimes conflicting, objectives whose relative priority will vary according to the preferences of each user. Consequently, a policy that is optimal for one user might be sub-optimal for another. In this work, we propose a multi-objective decision making framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our model consists of a Markov decision process with a vector-valued reward function, with each user having an unknown preference vector that expresses the relative importance of each objective. The goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models. We first address the case where a user is provided with two policies and returns their preferred policy as feedback. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one. In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries.
    Towards causally linking architectural parametrizations to algorithmic bias in neural networks. (arXiv:2302.03750v1 [cs.CV])
    Training dataset biases are by far the most scrutinized factors when explaining algorithmic biases of neural networks. In contrast, hyperparameters related to the neural network architecture, e.g., the number of layers or choice of activation functions, have largely been ignored even though different network parameterizations are known to induce different implicit biases over learned features. For example, convolutional kernel size has been shown to bias CNNs towards different frequencies. In order to study the effect of these hyperparameters, we designed a causal framework for linking an architectural hyperparameter to algorithmic bias. Our framework is experimental, in that several versions of a network are trained with an intervention to a specific hyperparameter, and the resulting causal effect of this choice on performance bias is measured. We focused on the causal relationship between sensitivity to high-frequency image details and face analysis classification performance across different subpopulations (race/gender). In this work, we show that modifying a CNN hyperparameter (convolutional kernel size), even in one layer of a CNN, will not only change a fundamental characteristic of the learned features (frequency content) but that this change can vary significantly across data subgroups (race/gender populations) leading to biased generalization performance even in the presence of a balanced dataset.
    GraphGUIDE: interpretable and controllable conditional graph generation with discrete Bernoulli diffusion. (arXiv:2302.03790v1 [cs.LG])
    Diffusion models achieve state-of-the-art performance in generating realistic objects and have been successfully applied to images, text, and videos. Recent work has shown that diffusion can also be defined on graphs, including graph representations of drug-like molecules. Unfortunately, it remains difficult to perform conditional generation on graphs in a way which is interpretable and controllable. In this work, we propose GraphGUIDE, a novel framework for graph generation using diffusion models, where edges in the graph are flipped or set at each discrete time step. We demonstrate GraphGUIDE on several graph datasets, and show that it enables full control over the conditional generation of arbitrary structural properties without relying on predefined labels. Our framework for graph diffusion can have a large impact on the interpretable conditional generation of graphs, including the generation of drug-like molecules with desired properties in a way which is informed by experimental evidence.
    Toward a Theory of Causation for Interpreting Neural Code Models. (arXiv:2302.03788v1 [cs.SE])
    Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post-hoc interpretability methodology specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and nine NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax and statistically learn to predict tokens related to blocks of code (e.g., brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful model debugging mechanism that may aid in discovering biases and limitations in NCMs.
    The XPRESS Challenge: Xray Projectomic Reconstruction -- Extracting Segmentation with Skeletons. (arXiv:2302.03819v1 [cs.CV])
    The wiring and connectivity of neurons form a structural basis for the function of the nervous system. Advances in volume electron microscopy (EM) and image segmentation have enabled mapping of circuit diagrams (connectomics) within local regions of the mouse brain. However, applying volume EM over the whole brain is not currently feasible due to technological challenges. As a result, comprehensive maps of long-range connections between brain regions are lacking. Recently, we demonstrated that X-ray holographic nanotomography (XNH) can provide high-resolution images of brain tissue at a much larger scale than EM. In particular, XNH is wellsuited to resolve large, myelinated axon tracts (white matter) that make up the bulk of long-range connections (projections) and are critical for inter-region communication. Thus, XNH provides an imaging solution for brain-wide projectomics. However, because XNH data is typically collected at lower resolutions and larger fields-of-view than EM, accurate segmentation of XNH images remains an important challenge that we present here. In this task, we provide volumetric XNH images of cortical white matter axons from the mouse brain along with ground truth annotations for axon trajectories. Manual voxel-wise annotation of ground truth is a time-consuming bottleneck for training segmentation networks. On the other hand, skeleton-based ground truth is much faster to annotate, and sufficient to determine connectivity. Therefore, we encourage participants to develop methods to leverage skeleton-based training. To this end, we provide two types of ground-truth annotations: a small volume of voxel-wise annotations and a larger volume with skeleton-based annotations. Entries will be evaluated on how accurately the submitted segmentations agree with the ground-truth skeleton annotations.
    AI and Core Electoral Processes: Mapping the Horizons. (arXiv:2302.03774v1 [cs.CY])
    Significant enthusiasm around AI uptake has been witnessed across societies globally. The electoral process -- the time, place and manner of elections within democratic nations -- has been among those very rare sectors in which AI has not penetrated much. Electoral management bodies in many countries have recently started exploring and deliberating over the use of AI in the electoral process. In this paper, we consider five representative avenues within the core electoral process which have potential for AI usage, and map the challenges involved in using AI within them. These five avenues are: voter list maintenance, determining polling booth locations, polling booth protection processes, voter authentication and video monitoring of elections. Within each of these avenues, we lay down the context, illustrate current or potential usage of AI, and discuss extant or potential ramifications of AI usage, and potential directions for mitigating risks while considering AI usage. We believe that the scant current usage of AI within electoral processes provides a very rare opportunity, that of being able to deliberate on the risks and mitigation possibilities, prior to real and widespread AI deployment. This paper is an attempt to map the horizons of risks and opportunities in using AI within the electoral processes and to help shape the debate around the topic.
    Participatory Systems for Personalized Prediction. (arXiv:2302.03874v1 [cs.LG])
    Machine learning models are often personalized based on information that is protected, sensitive, self-reported, or costly to acquire. These models use information about people, but do not facilitate nor inform their \emph{consent}. Individuals cannot opt out of reporting information that a model needs to personalize their predictions, nor tell if they would benefit from personalization in the first place. In this work, we introduce a new family of prediction models, called \emph{participatory systems}, that allow individuals to opt into personalization at prediction time. We present a model-agnostic algorithm to learn participatory systems for supervised learning tasks where models are personalized with categorical group attributes. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks, comparing them to common approaches for personalization and imputation. Our results demonstrate that participatory systems can facilitate and inform consent in a way that improves performance and privacy across all groups who report personal data.
    Data-driven Protection of Transformers, Phase Angle Regulators, and Transmission Lines in Interconnected Power Systems. (arXiv:2302.03826v1 [eess.SP])
    This dissertation highlights the growing interest in and adoption of machine learning (ML) approaches for fault detection in modern power grids. Once a fault has occurred, it must be identified quickly and preventative steps must be taken to remove or insulate it. As a result, detecting, locating, and classifying faults early and accurately can improve safety and dependability while reducing downtime and hardware damage. ML-based solutions and tools to carry out effective data processing and analysis to aid power system operations and decision-making are becoming preeminent with better system condition awareness and data availability. Power transformers, Phase Shift Transformers or Phase Angle Regulators, and transmission lines are critical components in power systems, and ensuring their safety is a primary issue. Differential relays are commonly employed to protect transformers, whereas distance relays are utilized to protect transmission lines. Magnetizing inrush, overexcitation, and current transformer saturation make transformer protection a challenge. Furthermore, non-standard phase shift, series core saturation, low turn-to-turn, and turn-to-ground fault currents are non-traditional problems associated with Phase Angle Regulators. Faults during symmetrical power swings and unstable power swings may cause mal-operation of distance relays and unintentional and uncontrolled islanding. The distance relays also mal-operate for transmission lines connected to type-3 wind farms. The conventional protection techniques would no longer be adequate to address the above challenges due to limitations in handling and analyzing massive amounts of data, limited generalizability, incapability to model non-linear systems, etc. These limitations of differential and distance protection methods bring forward the motivation of using ML in addressing various protection challenges.
    Persuading a Behavioral Agent: Approximately Best Responding and Learning. (arXiv:2302.03719v1 [cs.GT])
    The classic Bayesian persuasion model assumes a Bayesian and best-responding receiver. We study a relaxation of the Bayesian persuasion model where the receiver can approximately best respond to the sender's signaling scheme. We show that, under natural assumptions, (1) the sender can find a signaling scheme that guarantees itself an expected utility almost as good as its optimal utility in the classic model, no matter what approximately best-responding strategy the receiver uses; (2) on the other hand, there is no signaling scheme that gives the sender much more utility than its optimal utility in the classic model, even if the receiver uses the approximately best-responding strategy that is best for the sender. Together, (1) and (2) imply that the approximately best-responding behavior of the receiver does not affect the sender's maximal achievable utility a lot in the Bayesian persuasion problem. The proofs of both results rely on the idea of robustification of a Bayesian persuasion scheme: given a pair of the sender's signaling scheme and the receiver's strategy, we can construct another signaling scheme such that the receiver prefers to use that strategy in the new scheme more than in the original scheme, and the two schemes give the sender similar utilities. As an application of our main result (1), we show that, in a repeated Bayesian persuasion model where the receiver learns to respond to the sender by some algorithms, the sender can do almost as well as in the classic model. Interestingly, unlike (2), with a learning receiver the sender can sometimes do much better than in the classic model.
    Characterizing Financial Market Coverage using Artificial Intelligence. (arXiv:2302.03694v1 [q-fin.ST])
    This paper scrutinizes a database of over 4900 YouTube videos to characterize financial market coverage. Financial market coverage generates a large number of videos. Therefore, watching these videos to derive actionable insights could be challenging and complex. In this paper, we leverage Whisper, a speech-to-text model from OpenAI, to generate a text corpus of market coverage videos from Bloomberg and Yahoo Finance. We employ natural language processing to extract insights regarding language use from the market coverage. Moreover, we examine the prominent presence of trending topics and their evolution over time, and the impacts that some individuals and organizations have on the financial market. Our characterization highlights the dynamics of the financial market coverage and provides valuable insights reflecting broad discussions regarding recent financial events and the world economy.
    Federated Learning as Variational Inference: A Scalable Expectation Propagation Approach. (arXiv:2302.04228v1 [cs.LG])
    The canonical formulation of federated learning treats it as a distributed optimization problem where the model parameters are optimized against a global loss function that decomposes across client loss functions. A recent alternative formulation instead treats federated learning as a distributed inference problem, where the goal is to infer a global posterior from partitioned client data (Al-Shedivat et al., 2021). This paper extends the inference view and describes a variational inference formulation of federated learning where the goal is to find a global variational posterior that well-approximates the true posterior. This naturally motivates an expectation propagation approach to federated learning (FedEP), where approximations to the global posterior are iteratively refined through probabilistic message-passing between the central server and the clients. We conduct an extensive empirical study across various algorithmic considerations and describe practical strategies for scaling up expectation propagation to the modern federated setting. We apply FedEP on standard federated learning benchmarks and find that it outperforms strong baselines in terms of both convergence speed and accuracy.
    How to Trust Your Diffusion Model: A Convex Optimization Approach to Conformal Risk Control. (arXiv:2302.03791v1 [stat.ML])
    Score-based generative modeling, informally referred to as diffusion models, continue to grow in popularity across several important domains and tasks. While they provide high-quality and diverse samples from empirical distributions, important questions remain on the reliability and trustworthiness of these sampling procedures for their responsible use in critical scenarios. Conformal prediction is a modern tool to construct finite-sample, distribution-free uncertainty guarantees for any black-box predictor. In this work, we focus on image-to-image regression tasks and we present a generalization of the Risk-Controlling Prediction Sets (RCPS) procedure, that we term $K$-RCPS, which allows to $(i)$ provide entrywise calibrated intervals for future samples of any diffusion model, and $(ii)$ control a certain notion of risk with respect to a ground truth image with minimal mean interval length. Differently from existing conformal risk control procedures, ours relies on a novel convex optimization approach that allows for multidimensional risk control while provably minimizing the mean interval length. We illustrate our approach on two real-world image denoising problems: on natural images of faces as well as on computed tomography (CT) scans of the abdomen, demonstrating state of the art performance.
    Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion. (arXiv:2302.03775v1 [cs.LG])
    We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a $(\delta,\epsilon)$-stationary point from $O(\epsilon^{-4}\delta^{-1})$ stochastic gradient queries to $O(\epsilon^{-3}\delta^{-1})$, which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of $O(\epsilon^{-1.5}\delta^{-0.5})$. Our techniques also recover all optimal or best-known results for finding $\epsilon$ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.
    Taming Local Effects in Graph-based Spatiotemporal Forecasting. (arXiv:2302.04071v1 [cs.LG])
    Spatiotemporal graph neural networks have shown to be effective in time series forecasting applications, achieving better performance than standard univariate predictors in several settings. These architectures take advantage of a graph structure and relational inductive biases to learn a single (global) inductive model to predict any number of the input time series, each associated with a graph node. Despite the gain achieved in computational and data efficiency w.r.t. fitting a set of local models, relying on a single global model can be a limitation whenever some of the time series are generated by a different spatiotemporal stochastic process. The main objective of this paper is to understand the interplay between globality and locality in graph-based spatiotemporal forecasting, while contextually proposing a methodological framework to rationalize the practice of including trainable node embeddings in such architectures. We ascribe to trainable node embeddings the role of amortizing the learning of specialized components. Moreover, embeddings allow for 1) effectively combining the advantages of shared message-passing layers with node-specific parameters and 2) efficiently transferring the learned model to new node sets. Supported by strong empirical evidence, we provide insights and guidelines for specializing graph-based models to the dynamics of each time series and show how this aspect plays a crucial role in obtaining accurate predictions.
    Policy Evaluation in Decentralized POMDPs with Belief Sharing. (arXiv:2302.04151v1 [cs.LG])
    Most works on multi-agent reinforcement learning focus on scenarios where the state of the environment is fully observable. In this work, we consider a cooperative policy evaluation task in which agents are not assumed to observe the environment state directly. Instead, agents can only have access to noisy observations and to belief vectors. It is well-known that finding global posterior distributions under multi-agent settings is generally NP-hard. As a remedy, we propose a fully decentralized belief forming strategy that relies on individual updates and on localized interactions over a communication network. In addition to the exchange of the beliefs, agents exploit the communication network by exchanging value function parameter estimates as well. We analytically show that the proposed strategy allows information to diffuse over the network, which in turn allows the agents' parameters to have a bounded difference with a centralized baseline. A multi-sensor target tracking application is considered in the simulations.
    A Survey on Event Prediction Methods from a Systems Perspective: Bringing Together Disparate Research Areas. (arXiv:2302.04018v1 [cs.AI])
    Event prediction is the ability of anticipating future events, i.e., future real-world occurrences, and aims to support the user in deciding on actions that change future events towards a desired state. An event prediction method learns the relation between features of past events and future events. It is applied to newly observed events to predict corresponding future events that are evaluated with respect to the user's desired future state. If the predicted future events do not comply with this state, actions are taken towards achieving desirable future states. Evidently, event prediction is valuable in many application domains such as business and natural disasters. The diversity of application domains results in a diverse range of methods that are scattered across various research areas which, in turn, use different terminology for event prediction methods. Consequently, sharing methods and knowledge for developing future event prediction methods is restricted. To facilitate knowledge sharing on account of a comprehensive classification, integration, and assessment of event prediction methods, we combine taxonomies and take a systems perspective to integrate event prediction methods into a single system, elicit requirements and assess existing work with respect to the requirements. Based on the assessment, we identify open challenges and discuss future research directions.
    Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models. (arXiv:2302.04012v1 [cs.CR])
    Recently, large language models for code generation have achieved breakthroughs in several programming language tasks. Their advances in competition-level programming problems have made them an emerging pillar in AI-assisted pair programming. Tools such as GitHub Copilot are already part of the daily programming workflow and are used by more than a million developers. The training data for these models is usually collected from open-source repositories (e.g., GitHub) that contain software faults and security vulnerabilities. This unsanitized training data can lead language models to learn these vulnerabilities and propagate them in the code generation procedure. Given the wide use of these models in the daily workflow of developers, it is crucial to study the security aspects of these models systematically. In this work, we propose the first approach to automatically finding security vulnerabilities in black-box code generation models. To achieve this, we propose a novel black-box inversion approach based on few-shot prompting. We evaluate the effectiveness of our approach by examining code generation models in the generation of high-risk security weaknesses. We show that our approach automatically and systematically finds 1000s of security vulnerabilities in various code generation models, including the commercial black-box model GitHub Copilot.
    Leveraging User-Triggered Supervision in Contextual Bandits. (arXiv:2302.03784v1 [cs.LG])
    We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new framework to leverage such signals, while being robust to their biased nature. We also augment standard CB algorithms to leverage the signal, and show improved regret guarantees for the resulting algorithms under a variety of conditions on the helpfulness of and bias inherent in this feedback.
    A Systematic Performance Analysis of Deep Perceptual Loss Networks Breaks Transfer Learning Conventions. (arXiv:2302.04032v1 [cs.CV])
    Deep perceptual loss is a type of loss function in computer vision that aims to mimic human perception by using the deep features extracted from neural networks. In recent years the method has been applied to great effect on a host of interesting computer vision tasks, especially for tasks with image or image-like outputs. Many applications of the method use pretrained networks, often convolutional networks, for loss calculation. Despite the increased interest and broader use, more effort is needed toward exploring which networks to use for calculating deep perceptual loss and from which layers to extract the features. This work aims to rectify this by systematically evaluating a host of commonly used and readily available, pretrained networks for a number of different feature extraction points on four existing use cases of deep perceptual loss. The four use cases are implementations of previous works where the selected networks and extraction points are evaluated instead of the networks and extraction points used in the original work. The experimental tasks are dimensionality reduction, image segmentation, super-resolution, and perceptual similarity. The performance on these four tasks, attributes of the networks, and extraction points are then used as a basis for an in-depth analysis. This analysis uncovers essential information regarding which architectures provide superior performance for deep perceptual loss and how to choose an appropriate extraction point for a particular task and dataset. Furthermore, the work discusses the implications of the results for deep perceptual loss and the broader field of transfer learning. The results break commonly held assumptions in transfer learning, which imply that deep perceptual loss deviates from most transfer learning settings or that these assumptions need a thorough re-evaluation.
    Futuristic Variations and Analysis in Fundus Images Corresponding to Biological Traits. (arXiv:2302.03839v1 [eess.IV])
    Fundus image captures rear of an eye, and which has been studied for the diseases identification, classification, segmentation, generation, and biological traits association using handcrafted, conventional, and deep learning methods. In biological traits estimation, most of the studies have been carried out for the age prediction and gender classification with convincing results. However, the current study utilizes the cutting-edge deep learning (DL) algorithms to estimate biological traits in terms of age and gender together with associating traits to retinal visuals. For the traits association, our study embeds aging as the label information into the proposed DL model to learn knowledge about the effected regions with aging. Our proposed DL models, named FAG-Net and FGC-Net, correspondingly estimate biological traits (age and gender) and generates fundus images. FAG-Net can generate multiple variants of an input fundus image given a list of ages as conditions. Our study analyzes fundus images and their corresponding association with biological traits, and predicts of possible spreading of ocular disease on fundus images given age as condition to the generative model. Our proposed models outperform the randomly selected state of-the-art DL models.
    DDeMON: Ontology-based function prediction by Deep Learning from Dynamic Multiplex Networks. (arXiv:2302.03907v1 [q-bio.GN])
    Biological systems can be studied at multiple levels of information, including gene, protein, RNA and different interaction networks levels. The goal of this work is to explore how the fusion of systems' level information with temporal dynamics of gene expression can be used in combination with non-linear approximation power of deep neural networks to predict novel gene functions in a non-model organism potato \emph{Solanum tuberosum}. We propose DDeMON (Dynamic Deep learning from temporal Multiplex Ontology-annotated Networks), an approach for scalable, systems-level inference of function annotation using time-dependent multiscale biological information. The proposed method, which is capable of considering billions of potential links between the genes of interest, was applied on experimental gene expression data and the background knowledge network to reliably classify genes with unknown function into five different functional ontology categories, linked to the experimental data set. Predicted novel functions of genes were validated using extensive protein domain search approach.
    Investigating the role of model-based learning in exploration and transfer. (arXiv:2302.04009v1 [cs.LG])
    State of the art reinforcement learning has enabled training agents on tasks of ever increasing complexity. However, the current paradigm tends to favor training agents from scratch on every new task or on collections of tasks with a view towards generalizing to novel task configurations. The former suffers from poor data efficiency while the latter is difficult when test tasks are out-of-distribution. Agents that can effectively transfer their knowledge about the world pose a potential solution to these issues. In this paper, we investigate transfer learning in the context of model-based agents. Specifically, we aim to understand when exactly environment models have an advantage and why. We find that a model-based approach outperforms controlled model-free baselines for transfer learning. Through ablations, we show that both the policy and dynamics model learnt through exploration matter for successful transfer. We demonstrate our results across three domains which vary in their requirements for transfer: in-distribution procedural (Crafter), in-distribution identical (RoboDesk), and out-of-distribution (Meta-World). Our results show that intrinsic exploration combined with environment models present a viable direction towards agents that are self-supervised and able to generalize to novel reward functions.
    Analyzing the Performance of Deep Encoder-Decoder Networks as Surrogates for a Diffusion Equation. (arXiv:2302.03786v1 [cs.LG])
    Neural networks (NNs) have proven to be a viable alternative to traditional direct numerical algorithms, with the potential to accelerate computational time by several orders of magnitude. In the present paper we study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers. The construction of such surrogates requires the selection of an appropriate task, network architecture, training set structure and size, loss function, and training algorithm hyperparameters. It is well known that each of these factors can have a significant impact on the performance of the resultant model. Our approach employs an encoder-decoder CNN architecture, which we posit is particularly well-suited for this task due to its ability to effectively transform data, as opposed to merely compressing it. We systematically evaluate a range of loss functions, hyperparameters, and training set sizes. Our results indicate that increasing the size of the training set has a substantial effect on reducing performance fluctuations and overall error. Additionally, we observe that the performance of the model exhibits a logarithmic dependence on the training set size. Furthermore, we investigate the effect on model performance by using different subsets of data with varying features. Our results highlight the importance of sampling the configurational space in an optimal manner, as this can have a significant impact on the performance of the model and the required training time. In conclusion, our results suggest that training a model with a pre-determined error performance bound is not a viable approach, as it does not guarantee that edge cases with errors larger than the bound do not exist. Furthermore, as most surrogate tasks involve a high dimensional landscape, an ever increasing training set size is, in principle, needed, however it is not a practical solution.
    A Weighted Normalized Boundary Loss for Reducing the Hausdorff Distance in Medical Imaging Segmentation. (arXiv:2302.03868v1 [eess.IV])
    Within medical imaging segmentation, the Dice coefficient and Hausdorff-based metrics are standard measures of success for deep learning models. However, modern loss functions for medical image segmentation often only consider the Dice coefficient or similar region-based metrics during training. As a result, segmentation architectures trained over such loss functions run the risk of achieving high accuracy for the Dice coefficient but low accuracy for Hausdorff-based metrics. Low accuracy on Hausdorff-based metrics can be problematic for applications such as tumor segmentation, where such benchmarks are crucial. For example, high Dice scores accompanied by significant Hausdorff errors could indicate that the predictions fail to detect small tumors. We propose the Weighted Normalized Boundary Loss, a novel loss function to minimize Hausdorff-based metrics with more desirable numerical properties than current methods and with weighting terms for class imbalance. Our loss function outperforms other losses when tested on the BraTS dataset using a standard 3D U-Net and the state-of-the-art nnUNet architectures. These results suggest we can improve segmentation accuracy with our novel loss function.
    SLaM: Student-Label Mixing for Semi-Supervised Knowledge Distillation. (arXiv:2302.03806v1 [cs.LG])
    Semi-supervised knowledge distillation is a powerful training paradigm for generating compact and lightweight student models in settings where the amount of labeled data is limited but one has access to a large pool of unlabeled data. The idea is that a large teacher model is utilized to generate ``smoothed'' pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher's pseudo-labels are often noisy, leading to impaired student performance. In this paper, we present a principled method for semi-supervised knowledge distillation that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called ``forward loss-adjustment" methods.
    Does the End Justify the Means? On the Moral Justification of Fairness-Aware Machine Learning. (arXiv:2202.08536v2 [cs.LG] UPDATED)
    Fairness-aware machine learning (fair-ml) techniques are algorithmic interventions designed to ensure that individuals who are affected by the predictions of a machine learning model are treated fairly, typically measured in terms of a quantitative fairness metric. Despite the multitude of fairness metrics and fair-ml algorithms, there is still little guidance on the suitability of different approaches in practice. In this paper, we present a framework for moral reasoning about the justification of fairness metrics and explore the moral implications of the use of fair-ml algorithms that optimize for them. In particular, we argue that whether a distribution of outcomes is fair, depends not only on the cause of inequalities but also on what moral claims decision subjects have to receive a particular benefit or avoid a burden. We use our framework to analyze the suitability of two fairness metrics under different circumstances. Subsequently, we explore moral arguments that support or reject the use of the fair-ml algorithm introduced by Hardt et al. (2016). We argue that under very specific circumstances, particular metrics correspond to a fair distribution of burdens and benefits. However, we also illustrate that enforcing a fairness metric by means of a fair-ml algorithm may not result in the fair distribution of outcomes and can have several undesirable side effects. We end with a call for a more holistic evaluation of fair-ml algorithms, beyond their direct optimization objectives.
    Finding Short Signals in Long Irregular Time Series with Continuous-Time Attention Policy Networks. (arXiv:2302.04052v1 [cs.LG])
    Irregularly-sampled time series (ITS) are native to high-impact domains like healthcare, where measurements are collected over time at uneven intervals. However, for many classification problems, only small portions of long time series are often relevant to the class label. In this case, existing ITS models often fail to classify long series since they rely on careful imputation, which easily over- or under-samples the relevant regions. Using this insight, we then propose CAT, a model that classifies multivariate ITS by explicitly seeking highly-relevant portions of an input series' timeline. CAT achieves this by integrating three components: (1) A Moment Network learns to seek relevant moments in an ITS's continuous timeline using reinforcement learning. (2) A Receptor Network models the temporal dynamics of both observations and their timing localized around predicted moments. (3) A recurrent Transition Model models the sequence of transitions between these moments, cultivating a representation with which the series is classified. Using synthetic and real data, we find that CAT outperforms ten state-of-the-art methods by finding short signals in long irregular time series.
    Non-zero-sum Game Control for Multi-vehicle Driving via Reinforcement Learning. (arXiv:2302.03958v1 [cs.AI])
    When a vehicle drives on the road, its behaviors will be affected by surrounding vehicles. Prediction and decision should not be considered as two separate stages because all vehicles make decisions interactively. This paper constructs the multi-vehicle driving scenario as a non-zero-sum game and proposes a novel game control framework, which consider prediction, decision and control as a whole. The mutual influence of interactions between vehicles is considered in this framework because decisions are made by Nash equilibrium strategy. To efficiently obtain the strategy, ADP, a model-based reinforcement learning method, is used to solve coupled Hamilton-Jacobi-Bellman equations. Driving performance is evaluated by tracking, efficiency, safety and comfort indices. Experiments show that our algorithm could drive perfectly by directly controlling acceleration and steering angle. Vehicles could learn interactive behaviors such as overtaking and pass. In summary, we propose a non-zero-sum game framework for modeling multi-vehicle driving, provide an effective way to solve the Nash equilibrium driving strategy, and validate at non-signalized intersections.
  • Open

    Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models. (arXiv:2302.03900v1 [cs.CV])
    Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.  ( 2 min )
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v2 [stat.ML] UPDATED)
    Most machine learning methods require tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate an approximate expression for how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection heuristic based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.
    Reactmine: a statistical search algorithm for inferring chemical reactions from time series data. (arXiv:2209.03185v2 [q-bio.QM] UPDATED)
    Inferring chemical reaction networks (CRN) from concentration time series is a challenge encouragedby the growing availability of quantitative temporal data at the cellular level. This motivates thedesign of algorithms to infer the preponderant reactions between the molecular species observed ina given biochemical process, and build CRN structure and kinetics models. Existing ODE-basedinference methods such as SINDy resort to least square regression combined with sparsity-enforcingpenalization, such as Lasso. However, we observe that these methods fail to learn sparse modelswhen the input time series are only available in wild type conditions, i.e. without the possibility toplay with combinations of zeroes in the initial conditions. We present a CRN inference algorithmwhich enforces sparsity by inferring reactions in a sequential fashion within a search tree of boundeddepth, ranking the inferred reaction candidates according to the variance of their kinetics on theirsupporting transitions, and re-optimizing the kinetic parameters of the CRN candidates on the wholetrace in a final pass. We show that Reactmine succeeds both on simulation data by retrievinghidden CRNs where SINDy fails, and on two real datasets, one of fluorescence videomicroscopyof cell cycle and circadian clock markers, the other one of biomedical measurements of systemiccircadian biomarkers possibly acting on clock gene expression in peripheral organs, by inferringpreponderant regulations in agreement with previous model-based analyses. The code is available athttps://gitlab.inria.fr/julmarti/crninf/ together with introductory notebooks.
    Transformers Can Do Bayesian Inference. (arXiv:2112.10510v6 [cs.LG] UPDATED)
    Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released at https://github.com/automl/TransformersCanDoBayesianInference.
    Relative Probability on Finite Outcome Spaces: A Systematic Examination of its Axiomatization, Properties, and Applications. (arXiv:2212.14555v2 [stat.ML] UPDATED)
    This work proposes a view of probability as a relative measure rather than an absolute one. To demonstrate this concept, we focus on finite outcome spaces and develop three fundamental axioms that establish requirements for relative probability functions. We then provide a library of examples of these functions and a system for composing them. Additionally, we discuss a relative version of Bayesian inference and its digital implementation. Finally, we prove the topological closure of the relative probability space, highlighting its ability to preserve information under limits.
    Revisiting the Linear-Programming Framework for Offline RL with General Function Approximation. (arXiv:2212.13861v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to find an optimal policy for sequential decision-making using a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and provide a new reformulation that advances the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, also with careful choices of the function classes and initial state distributions. We hope our insights bring into light the use of LP formulations and the induced primal-dual minimax optimization, in offline RL.
    The Modern Mathematics of Deep Learning. (arXiv:2105.04026v2 [cs.LG] UPDATED)
    We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.
    Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization. (arXiv:2212.09396v2 [stat.ML] UPDATED)
    The nonconvex formulation of matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient descent (GD) is the simplest yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizers to prove the convergence of GD. In this work, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in logarithmic amount of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence and show that a larger initialization can be used as more samples are available. We observe that implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v5 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Riemannian block SPD coupling manifold and its application to optimal transport. (arXiv:2201.12933v2 [math.FA] UPDATED)
    In this work, we study the optimal transport (OT) problem between symmetric positive definite (SPD) matrix-valued measures. We formulate the above as a generalized optimal transport problem where the cost, the marginals, and the coupling are represented as block matrices and each component block is a SPD matrix. The summation of row blocks and column blocks in the coupling matrix are constrained by the given block-SPD marginals. We endow the set of such block-coupling matrices with a novel Riemannian manifold structure. This allows to exploit the versatile Riemannian optimization framework to solve generic SPD matrix-valued OT problems. We illustrate the usefulness of the proposed approach in several applications.
    Identify ambiguous tasks combining crowdsourced labels by weighting Areas Under the Margin. (arXiv:2209.15380v2 [cs.LG] UPDATED)
    In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per-worker trust score. Yet, such worker-centric approaches discard each task's ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful to the learning step. In a standard supervised learning setting - with one label per task - the Area Under the Margin (AUM) is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by task-dependent scores. We show that the WAUM can help discard ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements over existing strategies for learning a crowd, both for simulated settings and for the CIFAR-10H, LabelMe and Music crowdsourced datasets.
    Fast Linear Model Trees by PILOT. (arXiv:2302.03931v1 [stat.ML])
    Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for $PI$ecewise $L$inear $O$rganic $T$ree, where `organic' refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.
    Federated Minimax Optimization with Client Heterogeneity. (arXiv:2302.04249v1 [cs.LG])
    Minimax optimization has seen a surge in interest with the advent of modern applications such as GANs, and it is inherently more challenging than simple minimization. The difficulty is exacerbated by the training data residing at multiple edge devices or \textit{clients}, especially when these clients can have heterogeneous datasets and local computation capabilities. We propose a general federated minimax optimization framework that subsumes such settings and several existing methods like Local SGDA. We show that naive aggregation of heterogeneous local progress results in optimizing a mismatched objective function -- a phenomenon previously observed in standard federated minimization. To fix this problem, we propose normalizing the client updates by the number of local steps undertaken between successive communication rounds. We analyze the convergence of the proposed algorithm for classes of nonconvex-concave and nonconvex-nonconcave functions and characterize the impact of heterogeneous client data, partial client participation, and heterogeneous local computations. Our analysis works under more general assumptions on the intra-client noise and inter-client heterogeneity than so far considered in the literature. For all the function classes considered, we significantly improve the existing computation and communication complexity results. Experimental results support our theoretical claims.
    Inverse Models for Estimating the Initial Condition of Spatio-Temporal Advection-Diffusion Processes. (arXiv:2302.04134v1 [stat.ME])
    Inverse problems involve making inference about unknown parameters of a physical process using observational data. This paper investigates an important class of inverse problems -- the estimation of the initial condition of a spatio-temporal advection-diffusion process using spatially sparse data streams. Three spatial sampling schemes are considered, including irregular, non-uniform and shifted uniform sampling. The irregular sampling scheme is the general scenario, while computationally efficient solutions are available in the spectral domain for non-uniform and shifted uniform sampling. For each sampling scheme, the inverse problem is formulated as a regularized convex optimization problem that minimizes the distance between forward model outputs and observations. The optimization problem is solved by the Alternating Direction Method of Multipliers algorithm, which also handles the situation when a linear inequality constraint (e.g., non-negativity) is imposed on the model output. Numerical examples are presented, code is made available on GitHub, and discussions are provided to generate some useful insights of the proposed inverse modeling approaches.
    TVAE: Triplet-Based Variational Autoencoder using Metric Learning. (arXiv:1802.04403v3 [stat.ML] UPDATED)
    Deep metric learning has been demonstrated to be highly effective in learning semantic representation and encoding information that can be used to measure data similarity, by relying on the embedding learned from metric learning. At the same time, variational autoencoder (VAE) has widely been used to approximate inference and proved to have a good performance for directed probabilistic models. However, for traditional VAE, the data label or feature information are intractable. Similarly, traditional representation learning approaches fail to represent many salient aspects of the data. In this project, we propose a novel integrated framework to learn latent embedding in VAE by incorporating deep metric learning. The features are learned by optimizing a triplet loss on the mean vectors of VAE in conjunction with standard evidence lower bound (ELBO) of VAE. This approach, which we call Triplet based Variational Autoencoder (TVAE), allows us to capture more fine-grained information in the latent embedding. Our model is tested on MNIST data set and achieves a high triplet accuracy of 95.60% while the traditional VAE (Kingma & Welling, 2013) achieves triplet accuracy of 75.08%.
    Connections and Equivalences between the Nystr\"om Method and Sparse Variational Gaussian Processes. (arXiv:2106.01121v2 [stat.ML] UPDATED)
    We investigate the connections between sparse approximation methods for making kernel methods and Gaussian processes (GPs) scalable to large-scale data, focusing on the Nystr\"om method and the Sparse Variational Gaussian Processes (SVGP). While sparse approximation methods for GPs and kernel methods share some algebraic similarities, the literature lacks a deep understanding of how and why they are related. This may pose an obstacle to the communications between the GP and kernel communities, making it difficult to transfer results from one side to the other. Our motivation is to remove this obstacle, by clarifying the connections between the sparse approximations for GPs and kernel methods. In this work, we study the two popular approaches, the Nystr\"om and SVGP approximations, in the context of a regression problem, and establish various connections and equivalences between them. In particular, we provide an RKHS interpretation of the SVGP approximation, and show that the Evidence Lower Bound of the SVGP contains the objective function of the Nystr\"om approximation, revealing the origin of the algebraic equivalence between the two approaches. We also study recently established convergence results for the SVGP and how they are related to the approximation quality of the Nystr\"om method.
    Extragradient-Type Methods with $\mathcal{O}(1/k)$ Convergence Rates for Co-Hypomonotone Inclusions. (arXiv:2302.04099v1 [math.OC])
    In this paper, we develop two ``Nesterov's accelerated'' variants of the well-known extragradient method to approximate a solution of a co-hypomonotone inclusion constituted by the sum of two operators, where one is Lipschitz continuous and the other is possibly multivalued. The first scheme can be viewed as an accelerated variant of Tseng's forward-backward-forward splitting method, while the second one is a variant of the reflected forward-backward splitting method, which requires only one evaluation of the Lipschitz operator, and one resolvent of the multivalued operator. Under a proper choice of the algorithmic parameters and appropriate conditions on the co-hypomonotone parameter, we theoretically prove that both algorithms achieve $\mathcal{O}(1/k)$ convergence rates on the norm of the residual, where $k$ is the iteration counter. Our results can be viewed as alternatives of a recent class of Halpern-type schemes for root-finding problems.
    Compositional Score Modeling for Simulation-based Inference. (arXiv:2209.14249v2 [cs.LG] UPDATED)
    Neural Posterior Estimation methods for simulation-based inference can be ill-suited for dealing with posterior distributions obtained by conditioning on multiple observations, as they tend to require a large number of simulator calls to learn accurate approximations. In contrast, Neural Likelihood Estimation methods can handle multiple observations at inference time after learning from individual observations, but they rely on standard inference methods, such as MCMC or variational inference, which come with certain performance drawbacks. We introduce a new method based on conditional score modeling that enjoys the benefits of both approaches. We model the scores of the (diffused) posterior distributions induced by individual observations, and introduce a way of combining the learned scores to approximately sample from the target posterior distribution. Our approach is sample-efficient, can naturally aggregate multiple observations at inference time, and avoids the drawbacks of standard inference methods.
    Making Progress Based on False Discoveries. (arXiv:2204.08809v2 [cs.LG] UPDATED)
    The study of adaptive data analysis examines how many statistical queries can be answered accurately using a fixed dataset while avoiding false discoveries (statistically inaccurate answers). In this paper, we tackle a question that precedes the field of study: Is data only valuable when it provides accurate answers to statistical queries? To answer this question, we use Stochastic Convex Optimization as a case study. In this model, algorithms are considered as analysts who query an estimate of the gradient of a noisy function at each iteration and move towards its minimizer. It is known that $O(1/\epsilon^2)$ examples can be used to minimize the objective function, but none of the existing methods depend on the accuracy of the estimated gradients along the trajectory. Therefore, we ask: How many samples are needed to minimize a noisy convex function if we require $\epsilon$-accurate estimates of $O(1/\epsilon^2)$ gradients? Or, might it be that inaccurate gradient estimates are \emph{necessary} for finding the minimum of a stochastic convex function at an optimal statistical rate? We provide two partial answers to this question. First, we show that a general analyst (queries that may be maliciously chosen) requires $\Omega(1/\epsilon^3)$ samples, ruling out the possibility of a foolproof mechanism. Second, we show that, under certain assumptions on the oracle, $\tilde \Omega(1/\epsilon^{2.5})$ samples are necessary for gradient descent to interact with the oracle. Our results are in contrast to classical bounds that show that $O(1/\epsilon^2)$ samples can optimize the population risk to an accuracy of $O(\epsilon)$, but with spurious gradients.
    Flow Matching for Generative Modeling. (arXiv:2210.02747v2 [cs.LG] UPDATED)
    We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.
    Evaluating probabilistic forecasts of extremes using continuous ranked probability score distributions. (arXiv:1905.04022v4 [stat.ME] UPDATED)
    Verifying probabilistic forecasts for extreme events is a highly active research area because popular media and public opinions are naturally focused on extreme events, and biased conclusions are readily made. In this context, classical verification methods tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated, and the well-known continuous ranked probability score (CRPS) is no exception. In this paper, we define a formal framework for assessing the behavior of forecast evaluation procedures with respect to extreme events, which we use to demonstrate that assessment based on the expectation of a proper score is not suitable for extremes. Alternatively, we propose studying the properties of the CRPS as a random variable by using extreme value theory to address extreme event verification. An index is introduced to compare calibrated forecasts, which summarizes the ability of probabilistic forecasts for predicting extremes. The strengths and limitations of this method are discussed using both theoretical arguments and simulations.
    Fortuna: A Library for Uncertainty Quantification in Deep Learning. (arXiv:2302.04019v1 [cs.LG])
    We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to Flax-based deep neural networks trained from scratch for improved uncertainty quantification and accuracy. By providing a coherent framework for advanced uncertainty quantification methods, Fortuna simplifies the process of benchmarking and helps practitioners build robust AI systems.
    Exploratory Analysis of Federated Learning Methods with Differential Privacy on MIMIC-III. (arXiv:2302.04208v1 [cs.LG])
    Background: Federated learning methods offer the possibility of training machine learning models on privacy-sensitive data sets, which cannot be easily shared. Multiple regulations pose strict requirements on the storage and usage of healthcare data, leading to data being in silos (i.e. locked-in at healthcare facilities). The application of federated algorithms on these datasets could accelerate disease diagnostic, drug development, as well as improve patient care. Methods: We present an extensive evaluation of the impact of different federation and differential privacy techniques when training models on the open-source MIMIC-III dataset. We analyze a set of parameters influencing a federated model performance, namely data distribution (homogeneous and heterogeneous), communication strategies (communication rounds vs. local training epochs), federation strategies (FedAvg vs. FedProx). Furthermore, we assess and compare two differential privacy (DP) techniques during model training: a stochastic gradient descent-based differential privacy algorithm (DP-SGD), and a sparse vector differential privacy technique (DP-SVT). Results: Our experiments show that extreme data distributions across sites (imbalance either in the number of patients or the positive label ratios between sites) lead to a deterioration of model performance when trained using the FedAvg strategy. This issue is resolved when using FedProx with the use of appropriate hyperparameter tuning. Furthermore, the results show that both differential privacy techniques can reach model performances similar to those of models trained without DP, however at the expense of a large quantifiable privacy leakage. Conclusions: We evaluate empirically the benefits of two federation strategies and propose optimal strategies for the choice of parameters when using differential privacy techniques.
    Operator Shifting for Model-based Policy Evaluation. (arXiv:2110.12658v3 [cs.LG] UPDATED)
    In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.  ( 2 min )
    Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions. (arXiv:2302.03764v1 [stat.ML])
    Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. Our technique allows interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, placing the method on the memory-quality Pareto frontier of several large scale benchmarks.  ( 2 min )
    PASTA: Pessimistic Assortment Optimization. (arXiv:2302.03821v1 [cs.LG])
    We consider a class of assortment optimization problems in an offline data-driven setting. A firm does not know the underlying customer choice model but has access to an offline dataset consisting of the historically offered assortment set, customer choice, and revenue. The objective is to use the offline dataset to find an optimal assortment. Due to the combinatorial nature of assortment optimization, the problem of insufficient data coverage is likely to occur in the offline dataset. Therefore, designing a provably efficient offline learning algorithm becomes a significant challenge. To this end, we propose an algorithm referred to as Pessimistic ASsortment opTimizAtion (PASTA for short) designed based on the principle of pessimism, that can correctly identify the optimal assortment by only requiring the offline data to cover the optimal assortment under general settings. In particular, we establish a regret bound for the offline assortment optimization problem under the celebrated multinomial logit model. We also propose an efficient computational procedure to solve our pessimistic assortment optimization problem. Numerical studies demonstrate the superiority of the proposed method over the existing baseline method.  ( 2 min )
    Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion. (arXiv:2302.03775v1 [cs.LG])
    We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a $(\delta,\epsilon)$-stationary point from $O(\epsilon^{-4}\delta^{-1})$ stochastic gradient queries to $O(\epsilon^{-3}\delta^{-1})$, which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of $O(\epsilon^{-1.5}\delta^{-0.5})$. Our techniques also recover all optimal or best-known results for finding $\epsilon$ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.  ( 2 min )
    Algorithmic Collective Action in Machine Learning. (arXiv:2302.04262v1 [cs.LG])
    We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm's learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: the case of a nonparametric optimal learning algorithm, a parametric risk minimizer, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective's size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform's learning algorithm.  ( 2 min )
    How to Trust Your Diffusion Model: A Convex Optimization Approach to Conformal Risk Control. (arXiv:2302.03791v1 [stat.ML])
    Score-based generative modeling, informally referred to as diffusion models, continue to grow in popularity across several important domains and tasks. While they provide high-quality and diverse samples from empirical distributions, important questions remain on the reliability and trustworthiness of these sampling procedures for their responsible use in critical scenarios. Conformal prediction is a modern tool to construct finite-sample, distribution-free uncertainty guarantees for any black-box predictor. In this work, we focus on image-to-image regression tasks and we present a generalization of the Risk-Controlling Prediction Sets (RCPS) procedure, that we term $K$-RCPS, which allows to $(i)$ provide entrywise calibrated intervals for future samples of any diffusion model, and $(ii)$ control a certain notion of risk with respect to a ground truth image with minimal mean interval length. Differently from existing conformal risk control procedures, ours relies on a novel convex optimization approach that allows for multidimensional risk control while provably minimizing the mean interval length. We illustrate our approach on two real-world image denoising problems: on natural images of faces as well as on computed tomography (CT) scans of the abdomen, demonstrating state of the art performance.  ( 2 min )
    Towards Inferential Reproducibility of Machine Learning Research. (arXiv:2302.04054v1 [cs.LG])
    Reliability of machine learning evaluation -- the consistency of observed evaluation scores across replicated model training runs -- is affected by several sources of nondeterminism which can be regarded as measurement noise. Current tendencies to remove noise in order to enforce reproducibility of research results neglect inherent nondeterminism at the implementation level and disregard crucial interaction effects between algorithmic noise factors and data properties. This limits the scope of conclusions that can be drawn from such experiments. Instead of removing noise, we propose to incorporate several sources of variance, including their interaction with data properties, into an analysis of significance and reliability of machine learning evaluation, with the aim to draw inferences beyond particular instances of trained models. We show how to use linear mixed effects models (LMEMs) to analyze performance evaluation scores, and to conduct statistical inference with a generalized likelihood ratio test (GLRT). This allows us to incorporate arbitrary sources of noise like meta-parameter variations into statistical significance testing, and to assess performance differences conditional on data properties. Furthermore, a variance component analysis (VCA) enables the analysis of the contribution of noise sources to overall variance and the computation of a reliability coefficient by the ratio of substantial to total variance.  ( 2 min )
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v1 [cs.LG])
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as an expensive black-box optimization problem over the discrete chemical space. Computational methods have achieved initial success but still struggle with simultaneously optimizing multiple competing properties in a sample-efficient manner. In this work, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. Inspired by reinforcement learning, we further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. Through synthetic experiments, we illustrate that HN-GFN has adequate capacity to generalize over preferences. Extensive experiments show that our framework outperforms the best baselines by a large margin in terms of hypervolume in various real-world MOBO settings.  ( 2 min )
    On the Richness of Calibration. (arXiv:2302.04118v1 [cs.LG])
    Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.  ( 2 min )
    Cut your Losses with Squentropy. (arXiv:2302.03952v1 [cs.LG])
    Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.  ( 2 min )
    Scalars are universal: Equivariant machine learning, structured like classical physics. (arXiv:2106.06610v4 [cs.LG] UPDATED)
    There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar\'e groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be universally expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. We complement our theory with numerical examples that show that the scalar-based method is simple, efficient, and scalable.  ( 2 min )
    Leveraging User-Triggered Supervision in Contextual Bandits. (arXiv:2302.03784v1 [cs.LG])
    We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new framework to leverage such signals, while being robust to their biased nature. We also augment standard CB algorithms to leverage the signal, and show improved regret guarantees for the resulting algorithms under a variety of conditions on the helpfulness of and bias inherent in this feedback.  ( 2 min )
    Learning How to Infer Partial MDPs for In-Context Adaptation and Exploration. (arXiv:2302.04250v1 [cs.LG])
    To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising approach, but it requires Bayesian inference and dynamic programming, which often involve unknowns (e.g., a prior) and costly computations. To address these difficulties, we use a transformer to learn an inference process from training tasks and consider a hypothesis space of partial models, represented as small Markov decision processes that are cheap for dynamic programming. In our version of the Symbolic Alchemy benchmark, our method's adaptation speed and exploration-exploitation balance approach those of an exact posterior sampling oracle. We also show that even though partial models exclude relevant information from the environment, they can nevertheless lead to good policies.  ( 2 min )
    Improved Langevin Monte Carlo for stochastic optimization via landscape modification. (arXiv:2302.03973v1 [math.PR])
    Given a target function $H$ to minimize or a target Gibbs distribution $\pi_{\beta}^0 \propto e^{-\beta H}$ to sample from in the low temperature, in this paper we propose and analyze Langevin Monte Carlo (LMC) algorithms that run on an alternative landscape as specified by $H^f_{\beta,c,1}$ and target a modified Gibbs distribution $\pi^f_{\beta,c,1} \propto e^{-\beta H^f_{\beta,c,1}}$, where the landscape of $H^f_{\beta,c,1}$ is a transformed version of that of $H$ which depends on the parameters $f,\beta$ and $c$. While the original Log-Sobolev constant affiliated with $\pi^0_{\beta}$ exhibits exponential dependence on both $\beta$ and the energy barrier $M$ in the low temperature regime, with appropriate tuning of these parameters and subject to assumptions on $H$, we prove that the energy barrier of the transformed landscape is reduced which consequently leads to polynomial dependence on both $\beta$ and $M$ in the modified Log-Sobolev constant associated with $\pi^f_{\beta,c,1}$. This yield improved total variation mixing time bounds and improved convergence toward a global minimum of $H$. We stress that the technique developed in this paper is not only limited to LMC and is broadly applicable to other gradient-based optimization or sampling algorithms.  ( 2 min )
    Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps. (arXiv:2302.04065v1 [stat.ML])
    Optimal transport (OT) theory focuses, among all maps $T:\mathbb{R}^d\rightarrow \mathbb{R}^d$ that can morph a probability measure onto another, on those that are the ``thriftiest'', i.e. such that the averaged cost $c(x, T(x))$ between $x$ and its image $T(x)$ be as small as possible. Many computational approaches have been proposed to estimate such Monge maps when $c$ is the $\ell_2^2$ distance, e.g., using entropic maps [Pooladian'22], or neural networks [Makkuva'20, Korotin'20]. We propose a new model for transport maps, built on a family of translation invariant costs $c(x, y):=h(x-y)$, where $h:=\tfrac{1}{2}\|\cdot\|_2^2+\tau$ and $\tau$ is a regularizer. We propose a generalization of the entropic map suitable for $h$, and highlight a surprising link tying it with the Bregman centroids of the divergence $D_h$ generated by $h$, and the proximal operator of $\tau$. We show that choosing a sparsity-inducing norm for $\tau$ results in maps that apply Occam's razor to transport, in the sense that the displacement vectors $\Delta(x):= T(x)-x$ they induce are sparse, with a sparsity pattern that varies depending on $x$. We showcase the ability of our method to estimate meaningful OT maps for high-dimensional single-cell transcription data, in the $34000$-$d$ space of gene counts for cells, without using dimensionality reduction, thus retaining the ability to interpret all displacements at the gene level.  ( 2 min )
    DIFF2: Differential Private Optimization via Gradient Differences for Nonconvex Distributed Learning. (arXiv:2302.03884v1 [cs.LG])
    Differential private optimization for nonconvex smooth objective is considered. In the previous work, the best known utility bound is $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ in terms of the squared full gradient norm, which is achieved by Differential Private Gradient Descent (DP-GD) as an instance, where $n$ is the sample size, $d$ is the problem dimensionality and $\varepsilon_\mathrm{DP}$ is the differential privacy parameter. To improve the best known utility bound, we propose a new differential private optimization framework called \emph{DIFF2 (DIFFerential private optimization via gradient DIFFerences)} that constructs a differential private global gradient estimator with possibly quite small variance based on communicated \emph{gradient differences} rather than gradients themselves. It is shown that DIFF2 with a gradient descent subroutine achieves the utility of $\widetilde O(d^{2/3}/(n\varepsilon_\mathrm{DP})^{4/3})$, which can be significantly better than the previous one in terms of the dependence on the sample size $n$. To the best of our knowledge, this is the first fundamental result to improve the standard utility $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ for nonconvex objectives. Additionally, a more computational and communication efficient subroutine is combined with DIFF2 and its theoretical analysis is also given. Numerical experiments are conducted to validate the superiority of DIFF2 framework.  ( 2 min )
    Gibbsian polar slice sampling. (arXiv:2302.03945v1 [stat.ME])
    Polar slice sampling (Roberts & Rosenthal, 2002) is a Markov chain approach for approximate sampling of distributions that is difficult, if not impossible, to implement efficiently, but behaves provably well with respect to the dimension. By updating the directional and radial components of chain iterates separately, we obtain a family of samplers that mimic polar slice sampling, and yet can be implemented efficiently. Numerical experiments for a variety of settings indicate that our proposed algorithm outperforms the two most closely related approaches, elliptical slice sampling (Murray et al., 2010) and hit-and-run uniform slice sampling (MacKay, 2003). We prove the well-definedness and convergence of our methods under suitable assumptions on the target distribution.  ( 2 min )
    Decision trees compensate for model misspecification. (arXiv:2302.04081v1 [stat.ML])
    The best-performing models in ML are not interpretable. If we can explain why they outperform, we may be able to replicate these mechanisms and obtain both interpretability and performance. One example are decision trees and their descendent gradient boosting machines (GBMs). These perform well in the presence of complex interactions, with tree depth governing the order of interactions. However, interactions cannot fully account for the depth of trees found in practice. We confirm 5 alternative hypotheses about the role of tree depth in performance in the absence of true interactions, and present results from experiments on a battery of datasets. Part of the success of tree models is due to their robustness to various forms of mis-specification. We present two methods for robust generalized linear models (GLMs) addressing the composite and mixed response scenarios.  ( 2 min )
    IRTCI: Item Response Theory for Categorical Imputation. (arXiv:2302.04165v1 [stat.ML])
    Most datasets suffer from partial or complete missing values, which has downstream limitations on the available models on which to test the data and on any statistical inferences that can be made from the data. Several imputation techniques have been designed to replace missing data with stand in values. The various approaches have implications for calculating clinical scores, model building and model testing. The work showcased here offers a novel means for categorical imputation based on item response theory (IRT) and compares it against several methodologies currently used in the machine learning field including k-nearest neighbors (kNN), multiple imputed chained equations (MICE) and Amazon Web Services (AWS) deep learning method, Datawig. Analyses comparing these techniques were performed on three different datasets that represented ordinal, nominal and binary categories. The data were modified so that they also varied on both the proportion of data missing and the systematization of the missing data. Two different assessments of performance were conducted: accuracy in reproducing the missing values, and predictive performance using the imputed data. Results demonstrated that the new method, Item Response Theory for Categorical Imputation (IRTCI), fared quite well compared to currently used methods, outperforming several of them in many conditions. Given the theoretical basis for the new approach, and the unique generation of probabilistic terms for determining category belonging for missing cells, IRTCI offers a viable alternative to current approaches.  ( 2 min )
    Concept Algebra for Text-Controlled Vision Models. (arXiv:2302.03693v1 [cs.CL])
    This paper concerns the control of text-guided generative models, where a user provides a natural language prompt and the model generates samples based on this input. Prompting is intuitive, general, and flexible. However, there are significant limitations: prompting can fail in surprising ways, and it is often unclear how to find a prompt that will elicit some desired target behavior. A core difficulty for developing methods to overcome these issues is that failures are know-it-when-you-see-it -- it's hard to fix bugs if you can't state precisely what the model should have done! In this paper, we introduce a formalization of "what the user intended" in terms of latent concepts implicit to the data generating process that the model was trained on. This formalization allows us to identify some fundamental limitations of prompting. We then use the formalism to develop concept algebra to overcome these limitations. Concept algebra is a way of directly manipulating the concepts expressed in the output through algebraic operations on a suitably defined representation of input prompts. We give examples using concept algebra to overcome limitations of prompting, including concept transfer through arithmetic, and concept nullification through projection. Code available at https://github.com/zihao12/concept-algebra.  ( 2 min )

  • Open

    Where can you create AI music like this? Sounds surprisingly good
    submitted by /u/ziroxonline [link] [comments]  ( 40 min )
    AI software developer: specs -> product
    Today we have many different systems to develop a business application that meets our businesses. For example, many companies use salesforce, ms dynamics, sap and many other systems as ERP; at the same time today, we have systems like airtables, google appsheet - no-low-code platforms - generally abstract configurable databases with low-code features to write custom logic. What if we can develop some "AI specs translator" based on Chat GPT that would be able to parse specification and translate it into the final deployed system (specs into API requests for no code platform)? Example (specs): User roles: Administrator Warehouse manager Sales rep Tables: Product. Fields: id, name, description Warehouse: Fields: id, reference to product, count Order: id, reference to product, count, status Views: Order page: the page allows a sales rep to place an order for a product Kanban board page allows warehouse managers to complete created orders. Do you think it would work? submitted by /u/evgenyzhurko [link] [comments]  ( 41 min )
    How do you discover new AI?
    Is there a consumer hub/site that acts as a directory? So many AI tools and companies popping up every day - I feel like there must be some kind of marketplace or 'App Store' equivalent just for AI? submitted by /u/mesjer123 [link] [comments]  ( 41 min )
    Use florid, baroque language to describe a baked potato
    submitted by /u/Imagine-your-success [link] [comments]  ( 40 min )
    AI certificates?
    So ive heard this is a thing, what kinds of certificates are there and are these enough to enter the workforce/freelance? Im a graduate in Nanotechnology but i couldn't get anything out of it not even unpaid internships, so im looking for other options, specifically certifications as to start small. And since i have some coding knowledge AI certification got me curious. submitted by /u/AffectionateAffect5 [link] [comments]  ( 41 min )
    I made Image analyzing discord bot that can write Poem/Description/Title of your image using GPT3 and CLIP, link in comment
    submitted by /u/red3vil96 [link] [comments]  ( 41 min )
    Good AI for removing vocals from songs?
    Hello everyone, I run a D&D game and there are many songs with lyrics that I want the instrumental version of to play during battles, in towns, etc. Most of them do not have an instrumental version that I can find. Does anyone know any good AI programs that remove vocals from songs? Free is preferable because I’m a broke college student but I am willing to pay if needed as I want a good end product. submitted by /u/Kittykittyredcat [link] [comments]  ( 41 min )
    OpenAI Text Classifier Hands-On Review: ChatGPT’s Own AI Detection - Pros & Cons
    The following guide provides an independent review of how well this OpenAI detection software performs and how its capabilities stack up against competitors (for finding A!-generated text and plagiarism) OpenAI Text Classifier: ChatGPT’s Own AI Detection - Review submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 41 min )
    Microsoft Bing’s Overnight Success Story: 10x Downloads After ChatGPT Integration Announcement
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    what if ai was making the best jokes for each person
    Hello, yall have propably seen that ai generated tv show on twitch Seinfeld. And so i thought about an ai making jokes specifically for you, jokes that you laught at every time u hear them. But i think that after some time, you would stop laughing at them. My question is, will u still find something funny, when the best jokes made specifically for you arent funny anymore ? (Or do u think we will endlessly laught at those jokes?) ​ Sorry for my english xd submitted by /u/maxovina [link] [comments]  ( 41 min )
    "Welcome to Carl's Jr., fuck you, I'm eating."
    submitted by /u/sifeliz [link] [comments]  ( 40 min )
    AI that generates 1-click replies based on auto-categorised emails. If you add in info like your calendar link to send, it's like having an executive assistant for your inbox
    submitted by /u/Ok-Craft-9908 [link] [comments]  ( 41 min )
    Share your fine-tuned GPT model with the world.
    Hey Guys, I and some friends are mostly done building a platform that gives users API access to fine-tuned GPT models. This is something we wanted to see in the world and decided to build ourselves. We're now looking for individuals with high-quality, fine-tuned GPT models that they would like to see on our platform. This is a great opportunity for those who have spent time and effort fine-tuning models to showcase their work to a wider audience, receive recognition, and monetize their work. If you have a fine-tuned GPT model that you would like to share on our platform, please reach out to us. We are always looking for new models to add to our platform, and we are eager to hear from you. Whether you have created a model that is specific to a particular domain or industry, or you have developed a model that is particularly advanced and highly specialized, we would love to hear from you. So, if you have a great fine-tuned GPT model, or you know someone who does, send me a message. We look forward to hearing from you soon! Thanks, submitted by /u/Aggravating_Art_173 [link] [comments]  ( 41 min )
    I put ChatGPT through the Marcel Proust Questionnaire
    submitted by /u/okanaganjournal [link] [comments]  ( 40 min )
    Student hacks new Bing chatbot search aka "Sydney"
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Story-telling AI that I could feed data to and then ask to write a similar replica?
    Hello, I was wondering if there's a free or premium story-telling AI model that I could feed data to, for example, passages from a particular author or pages from their book, and then ask the AI to create a story using that author's writing style, dictionary, or ideas. A while ago I watched a Youtube video, in which a person taught an AI to write screenplays in the style of a certain author and I'd like to do the same, except with short stories. Is it possible to do so without any coding knowledge? Thanks. submitted by /u/Basil1sk17 [link] [comments]  ( 41 min )
    Am I the only one who has become addicted to AI advancements news?
    All the AI news has managed to get me very excited but now I'm craving my next hit! We all know great things that will be possible in the future, but this rapid new cycle has left me craving my next fix of news. before I was content to glance at AI news and go oh that interesting now I'm bloodshot eyed scrolling twitter and YouTube to see if there is any new advancements that I missed... It's not just me right? submitted by /u/Nargodian [link] [comments]  ( 42 min )
    Would this concept be possible with AI?
    I've recently developed an interest in AI and plan on learning how to make my own, starting simple and getting more complex as I go. My question is, if you can train an AI on any data you want to get an output, could you in theory train an AI with source and compiled code with the intention of getting the AI to spit out an approximation source code for a compiled program? Its an interesting concept to toy with, but I'm not sure if it's viable. Let me know what you think! submitted by /u/OmaeWaWakiYaku [link] [comments]  ( 41 min )
    [OC] "Théâtre D'opéra Spatial by Midjourney (with the help of a human)"
    submitted by /u/AdministrativeLet996 [link] [comments]  ( 41 min )
    Microsoft Prometheus is GPT-4 in Software (Is GPT-4 inside Prometheus Model?)
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    Google's Bard AI ChatBot Wiped Off $100 Billion In Market Cap After Factual Error In First Demo
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 43 min )
    TEXTure: Text to 3D textures
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    What are the best resources to stay up to date with the latest news ?
    I am very interested in the field of artificial intelligence. Can you recommend me some YouTube channels, websites/blogs, magazines and other places on the internet where I would be informed of the latest news in this environment ? submitted by /u/HelloWorldCLang [link] [comments]  ( 42 min )
    A.I. Rick & Morty Analyzing Real-Time A.I. Generated Poster Art
    submitted by /u/MindCluster [link] [comments]  ( 40 min )
  • Open

    How do you/should you put less importance on actions which have less steps after them due to arbitrary episode length?
    For example, in an environment a game might take 1,000 steps and actions might take some time to realize reward but realize cost immediately. An action taken at step 998 might be a good action if the game continued beyond 1000 steps, but it could get punished because the game stops arbitrarily early. Is there any logic in just culling the last n actions/discounted returns? submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Curriculum vs. hierarchical RL
    Title says it all. I am wondering about the difference and the relationship (if at all) between curriculum RL and hierarchical RL. Can anyone enlighten me please? The concepts seem to have similar themes Thanks in advance! submitted by /u/acorntje [link] [comments]  ( 42 min )
    What exploration-hard or sparse reward environment do you recommend for research?
    submitted by /u/OutOfCharm [link] [comments]  ( 41 min )
    RL agent beating all main bosses of Mega Man X4
    submitted by /u/victorsevero [link] [comments]  ( 44 min )
  • Open

    Detect signatures on documents or images using the signatures feature in Amazon Textract
    Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Signatures is a feature within Amazon Textract that offers the ability to automatically detect signatures on any document. This can reduce the need for human review, custom code, or ML experience. In this post, […]  ( 7 min )
    Monitoring Lake Mead drought using the new Amazon SageMaker geospatial capabilities
    Earth’s changing climate poses an increased risk of drought due to global warming. Since 1880, the global temperature has increased 1.01 °C. Since 1993, sea levels have risen 102.5 millimeters. Since 2002, the land ice sheets in Antarctica have been losing mass at a rate of 151.0 billion metric tons per year. In 2022, the […]  ( 10 min )
  • Open

    Amplification at the Quantum limit
    Posted by Ted White and Ofer Naaman, Staff Research Scientists, Google Quantum AI The Google Quantum AI team is building quantum computers with superconducting microwave circuits, but much like a classical computer the superconducting processor at the heart of these computers is only part of the story. An entire technology stack of peripheral hardware is required to make the quantum computer work properly. In many cases these parts must be custom designed, requiring extensive research and development to reach the highest levels of performance. In this post, we highlight one aspect of this supplemental hardware: our superconducting microwave amplifiers. In “Readout of a Quantum Processor with High Dynamic Range Josephson Parametric Amplifiers”, published in Applied Physics Letters, w…  ( 92 min )
  • Open

    3 Questions: Leo Anthony Celi on ChatGPT and medicine
    The chatbot’s success on the medical licensing exam shows that the test — and medical education — are flawed, Celi says.  ( 8 min )
  • Open

    [D] Using LLMs as decision engines
    I just finished reading the paper "Pre-Trained Language Models for Interactive Decision Making" (https://arxiv.org/abs/2202.01771). As I understand it, the authors are using a language model to "generate" an optimal path to an objective, in test environments like VirtualHome and BabyAI. Reinforcement and imitation learning are evaluated as ways for the model to self-improve. This is the first time I've seen a language model being used to "solve a problem" that isn't a language one. It seems to open up so many new possibilties. Has this been done before? Are there other examples of LMs being used as decision engines? What's the state of the art? Any interesting applications you've seen? Side question: I imagine there were AI approaches to navigating VirtualHome and BabyAI that were NOT language-model based. What is the standard modeling approach to these kinds of problems? submitted by /u/These-Assignment-936 [link] [comments]  ( 44 min )
    [D] Plot Best Run for Accuracy or Mean across runs?
    I've ran two image classification model 5 times on a dataset. Model A has a mean best accuracy of 95.03% while Model B has a mean best accuracy of 95.3% However, Model A has a max best accuracy of 95.75% while Model B has a mean best accuracy of 95.5%. I am wanting to report these results in a paper to a conference/journal. When plotting the test accuracy per epoch, should I only report the results for the best run or should I take the mean of the test accuracies over all 5 runs per epoch for plotting? submitted by /u/MyActualUserName99 [link] [comments]  ( 43 min )
    [D] Dense correspondence between image/mesh
    The goal is to create a model that can make correspondences between images and meshes. Just like we see in image registration, where we have an image next to each other and then N matches (lines) that show similar features. But in this case will be an image and a mesh of a specific object. Have you some tips and ideas that could help me how to attack this problem? submitted by /u/henistein [link] [comments]  ( 43 min )
    [Discussion] Looking for opinions: Scale Spellbook vs. Snorkel Flow vs....?
    Hey everyone. Has anyone used Snorkel Flow, Scale Spellbook or other alternatives (please advise) to test multiple foundation models and migrate between them? E.g. comparing GPT3 vs GPT-J or GPT-Neo etc. Need help moving to a smaller/cheaper model - cheers! submitted by /u/fourcornerclub [link] [comments]  ( 42 min )
    [D] Similarity b/w two vectors
    how to calculate similarity between two vectors? I want a similarity metric that take into accounts both the directions and magnitudes of vectors. submitted by /u/TKMater [link] [comments]  ( 43 min )
    [D]Image Recognition ability of machine learning in financial markets questions
    I want to understand if it is possible to build an image recognition system. Let me explain. Take a historical price chart of the stock market but make the background black and the price white in the form of bars (OHLC). Keep the scale the same for all, let's say 100 bars of data - the top of the scale is 1 and the bottom 0, as to not bring actual price into play. Now assume you could capture these images over and over again through historic price charts. They all have the same amount of bars, color, and scaling. Now insert a current stock chart 'X'. Is it possible to have this database review all of the charts and give the charts of the past that are most similar to the current chart now? And include some sort of similarities correlation [90%,77%, etc.] Here is what I'm not looking for: For it to predict the future For it to tell me to buy and sell What I am looking for is: Similarities of past prices The correlation of the similarities A list of the top [10,20,30] charts and their correlation of them to the current. I understand past historical prices are different than future and economic data is complex and many variables are at play as well. submitted by /u/Ready-Acanthaceae970 [link] [comments]  ( 45 min )
    [D] Story-telling AI that I could feed data to and then ask to write a similar replica?
    Hello, I was wondering if there's a free or premium story-telling AI model that I could feed data to, for example, passages from a particular author or pages from their book, and then ask the AI to create a story using that author's writing style, dictionary, or ideas. A while ago I watched a Youtube video, in which a person taught an AI to write screenplays in the style of a certain author and I'd like to do the same, except with short stories. Is it possible to do so without any coding knowledge? Thanks. submitted by /u/Basil1sk17 [link] [comments]  ( 43 min )
    [D] Constrained Optimization in Deep Learning
    Clearly, large scale deep learning approaches in image classification or NLP use all sorts of Regularization mechanisms, but the parameters are typically unconstrained (i.e., every weight can theoretically attain any real value). In many Machine Learning domains, constrained optimization (e.g. via Projected Gradient Descent or Frank-Wolfe) plays a huge role. I was wondering whether there are large-scale Deep Learning applications which rely on constrained optimization approaches? When I say large-scale, I mean large CNNs, transformers, diffusion models or the like. Are there settings where constrained optimization would even be a preferred approach, but not efficient/stable enough? Happy for any paper suggestions or thoughts! Thanks! submitted by /u/d0cmorris [link] [comments]  ( 44 min )
    [D] Latent spaces and weather forecasting/nowcasting
    Hi, everyone. First time, long time. My background is weather analysis to DL applications in weather, and I had a question I wanted to ask the community writ large. The question is about latent spaces and how, specifically, the DeepMind group used them in their radar nowcasting model DGMR (see links to prior threads below). In the DGMR paper itself (https://www.nature.com/articles/s41586-021-03854-z), the architecture looks like a U-net with some ConvGRU2D flair in the decoder and some temporal consistency checks from the discriminator. There is also what they call a "latent conditioning stack." From some deeper readings, I think the model is a descendant of BigGAN, since both use an explicit latent space among other similarities. This leads to my question and general curiosity... How is this latent space seeded? My prior experience with latent space toy models (DCGAN, for example) is that unless you seed the RNG explicitly, then performing a restart of the model to continue training mucks up the distribution. Fairly standard RNG issues. Is it really as simple as, for example, latent_vector = tf.random.truncated_normal([batch_size, grid_size_parameters], seed=42) I feel like I'm missing something. Why does this work at all? Why is a latent space necessary in this context? They state explicitly in their paper that they require this stack to generalize results to datasets that are larger (in a HxW sense) than the one on which they trained, but I can't wrap my head around why an extended latent vector for a larger grid size works. If anyone can point me in the right direction or help me understand, I'd greatly appreciate it. Links to prior threads: https://www.reddit.com/r/MachineLearning/comments/pyfjz7/r_deepminds_weather_forecasting_model_nowcasting/ https://old.reddit.com/r/MachineLearning/comments/py0289/r_skilful_precipitation_nowcasting_using_deep/ submitted by /u/PsyEclipse [link] [comments]  ( 44 min )
    [D] Can we use Ray for distributed training on vertex ai ? Can someone provide me examples for the same ? Also which dataframe libraries you guys used for training machine learning models on huge datasets (100 gb+) (because pandas can't handle huge data).
    Same as title. Also the dataframe library should support machine learning libraries submitted by /u/RstarPhoneix [link] [comments]  ( 42 min )
    [D] RTX 3090 with i7 7700k, training bottleneck
    Hey guys I have an older PC(5 years) with an i7 7700k processor. I want to buy an Nvidia RTX 3090 for training large language models. I can t find any benchmark for CPU bottleneck when training, let s say an GPT 2 large model. Has anyone have any experience with this set-up similar set-up ? submitted by /u/Available_Lion_652 [link] [comments]  ( 45 min )
    [R] Research Seminar by Neural Magic: AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks
    At Neural Magic, we are proud to be at the forefront of cutting-edge machine learning research, with a particular focus on model compression. Our internal Lunch and Learn seminars are a weekly opportunity for our team to share their research and collaborate on new ideas. We believe in the importance of open-source contributions, which is why we are thrilled to announce that for a second time, we are opening the seminar to the wider community. On Wednesday, February 23, 2022, I will be sharing our work on AC/DC, a framework for sparse-training models. This research was done in partnership with IST Austria. Join me and the Neural Magic team for this exciting presentation and be sure to keep an eye out for future speakers in the coming months! ​ You can reserve your spot for the presentation here. submitted by /u/dtransposed [link] [comments]  ( 43 min )
    [P] Creating an embedding from a CNN
    Hi all: I have trained a CNN (efficietnet-b3) to classify the degree of a disease on medical images. I would like to create an embedding both to visualize relationships between images (after projecting to 2d or 3d-space) and to find similar images to one given. I have tried using the output of the last convolution both before and after pooling for all train images (~30.000) but the result is mediocre: images non-alike are quite close in the embedding and plotting it in 2 or 3d just show a point cloud with no obvious pattern. I have also tried to use the class activation map (the output of the convolutional layer after pooling and multiplying by the weights of the classifier of the predicted class). This is quite better, but class are not separated too clearly in the scatterplot. Is there any other sensible way to generate the embeddings? I have tried using the hidden representation of earlier convolutional layers, but some of them are so huge (~650.000 features per sample) creating a reasonable sized embedding would require very aggressive PCA. ​ Example of the scatter plot of the heatmap embedding. While it is okayish (classes are more or less spatially localized) it would be great to find an embedding that creates more visible clusters for each class. https://preview.redd.it/l7smdyuml6ha1.png?width=543&format=png&auto=webp&s=fcc757246fca79ca9fd1378562739c05baf6697e ​ submitted by /u/zanzagaes2 [link] [comments]  ( 45 min )
    [PROJECT] Text cluster of MxN dimensions as training set for AI?
    I have a text clustering project. It clusters texts in MxN dimensions. M is a subset of N, where N is the total number of domains. The text corpus is a set of academic papers. The clusters are cross disciplinary subjects, defined by M. Clusters are identified by MANOVA tests of sets of cross products. Goal is to identify texts of interest for research. E.g. identify clusters of papers relevant to a combination of subjects, or identify areas of research by their cluster, or identify outlier research. This is a N versus NP problem. It requires a great deal of processing time to cluster texts. I may do so for a corpus of 10k research papers, but that is a static set, and papers cannot be appended to the corpus without affecting all other clusters of the corpus. So I am considering creating a training set of 10k papers, and writing an AI to identify and cluster texts without comparing it to the rest of the corpus. I want feedback and ideas. I wont specify what I am looking for yet, because I am certain some of the responses here will point out something I did not consider. So please, comment with your thoughts. Tell me what you know. Give me your ideas. submitted by /u/Smedskjaer [link] [comments]  ( 43 min )
    [D] Bees: a new unit of measurement for ML model size
    Would like to hear about what you guys think about this approach? submitted by /u/ThePerson654321 [link] [comments]  ( 43 min )
    [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
    We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl. We focused on high quality transcription in a latency sensitive scenario, meaning: whisper-large-v2 weights beam search 5 (as recommended in the related paper) We measured a 2.3x speedup on Nvidia A100 GPU (2.4x on 3090 RTX) compared to Hugging Face implementation using FP16 mixed precision on transcribing librispeech test set (over 2600 examples). For now, OpenAI implementation is not yet PyTorch 2.0 compliant. In the post below, we discuss what worked (CUDA Graph), our tricks (to significantly reduce memory footprint), and what did not pay off (Flash attention and some other custom Triton kernels). Kernl repository: https://github.com/ELS-RD/kernl Reproduction script: https://github.com/ELS-RD…  ( 62 min )
    [D] Format for ICML tutorial submission?
    Hello! I'm quite new to this. I was wondering what the right format is for submitting a successful tutorial proposal. Should I just use the LaTeX style files but modify the content for a tutorial proposal? submitted by /u/Acceptable_League160 [link] [comments]  ( 6 min )
    [D] Are there emergent abilities of image models?
    Just finished reading the Stanford/Google survey paper (https://arxiv.org/abs/2206.07682) on emergent abilities of large language models. It made me wonder: do image generation models have emergent abilities, too? Do we know? I can't quite wrap my head around what such an ability would even look like. Figured maybe other folks had given this a think. submitted by /u/These-Assignment-936 [link] [comments]  ( 46 min )
    [D] Are there any AI model that I can use to improve very bad quality sound recording? Removing noise and improving overall quality
    I have got old lecture recordings I want to improve their sound quality I have tested adobe AI noise removal but not very good I also tested descript studio sound not very good either I wonder if there are any public model, github repo, github project, hugging face repo that I can use to remove noise and improve sound quality of existing audio recordings? Thank you so much for replies Recordings are in English Here example recording that needs to be cleaned 5 min audio : https://sndup.net/stjs/ full lecture : https://youtu.be/2zY1dQDGl3o submitted by /u/CeFurkan [link] [comments]  ( 45 min )
  • Open

    GPT in 60 Lines of NumPy
    submitted by /u/nickb [link] [comments]  ( 40 min )
    Using Model Weights in Another Model
    Hello everyone, I wrote two models from scratch. I want to use first model weights in the second model. However, these two models have not same architecture. I wrote the code as follows: # Load the pre-trained model model_CovidNet = load_model("CovidNet_Model.h5") # Access the weights CovidNet_weights = model_CovidNet.get_weights() This my first layer in the model and I want to use the weights as follows: x = Conv2D(64, 7, strides = 2, padding = 'same', kernel_initializer = CovidNet_weight)(input) I am getting dimension error. How can I solve this problem? How can I use other model weights in my other model? I'd be appreciate to any kind help. submitted by /u/Hungry-Engineer-5696 [link] [comments]  ( 41 min )
    Researchers Discover a More Flexible Approach to Machine Learning: “Liquid” neural nets, based on a worm’s nervous system
    submitted by /u/nickb [link] [comments]  ( 41 min )
    TEXTure: Text to 3D textures
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Modifying and Reducing Threshold Values with Each Iteration
    Are there any examples of neural networks that, when training, reduce a threshold value when comparing what they get to the correct output each time they iterate? Say a neural network is doing number recognition. Are there any examples of a neural network requiring less and less error in identifying the numbers for more and more iterations through the neural network? This would mean that the neural network must spend more and more time training the parameters(or hyperparameters in some cases I suppose) for every iteration it has performed unless the error is reduced substantially. submitted by /u/AstroBullivant [link] [comments]  ( 41 min )
  • Open

    Crossing Continents: XPENG G9 SUV and P7 Sedan Set Course for Scandinavia, the Netherlands
    Electric automaker XPENG’s flagship G9 SUV and P7 sports sedan are now available for order in Sweden, Denmark, Norway and the Netherlands — an expansion revealed last week at the eCar Expo in Stockholm. The intelligent electric vehicles are built on the high-performance NVIDIA DRIVE Orin centralized compute architecture and deliver AI capabilities that are Read article >  ( 5 min )
    3D Artist Brings Ride and Joy to Automotive Designs With Real-Time Renders Using NVIDIA RTX
    Designing automotive visualizations can be incredibly time consuming. To make the renders look as realistic as possible, artists need to consider material textures, paints, realistic lighting and reflections, and more. For 3D artist David Baylis, it’s important to include these details and still create high-resolution renders in a short amount of time. That’s why he Read article >  ( 6 min )
    Gather Your Party: GFN Thursday Brings ‘Baldur’s Gate 3’ to the Cloud
    Venture to the Forgotten Realms this GFN Thursday in Baldur’s Gate 3, streaming on GeForce NOW. Celebrations for the cloud gaming service’s third anniversary continue with a Dying Light 2 reward that’s to die for. It’s the cherry on top of three new titles joining the GeForce NOW library this week. Roll for Initiative Mysterious Read article >  ( 5 min )
  • Open

    Recognizing squares
    Suppose you’re given a number and you’d like to tell whether its a square, or at least you’d like to be able to determine quickly if it’s not a square. This post began as a thread I wrote on Twitter. For starters, the last digit of a square in base 10 must be 0, 1, […] Recognizing squares first appeared on John D. Cook.  ( 5 min )
  • Open

    Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications. (arXiv:2302.03683v1 [cs.LG])
    Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.  ( 2 min )
    Fuzzy Expert Systems for Prediction of ICU Admission in Patients with COVID-19. (arXiv:2104.12868v2 [cs.LG] UPDATED)
    The pandemic COVID-19 disease has had a dramatic impact on almost all countries around the world so that many hospitals have been overwhelmed with Covid-19 cases. As medical resources are limited, deciding on the proper allocation of these resources is a very crucial issue. Besides, uncertainty is a major factor that can affect decisions, especially in medical fields. To cope with this issue, we use fuzzy logic (FL) as one of the most suitable methods in modeling systems with high uncertainty and complexity. We intend to make use of the advantages of FL in decisions on cases that need to treat in ICU. In this study, an interval type-2 fuzzy expert system is proposed for prediction of ICU admission in COVID-19 patients. For this prediction task, we also developed an adaptive neuro-fuzzy inference system (ANFIS). Finally, the results of these fuzzy systems are compared to some well-known classification methods such as Naive Bayes (NB), Case-Based Reasoning (CBR), Decision Tree (DT), and K Nearest Neighbor (KNN). The results show that the type-2 fuzzy expert system and ANFIS models perform competitively in terms of accuracy and F-measure compared to the other system modeling techniques.  ( 2 min )
    A Survey of Supernet Optimization and its Applications: Spatial and Temporal Optimization for Neural Architecture Search. (arXiv:2204.03916v2 [cs.CV] UPDATED)
    This survey focuses on categorizing and evaluating the methods of supernet optimization in the field of Neural Architecture Search (NAS). Supernet optimization involves training a single, over-parameterized network that encompasses the search space of all possible network architectures. The survey analyses supernet optimization methods based on their approaches to spatial and temporal optimization. Spatial optimization relates to optimizing the architecture and parameters of the supernet and its subnets, while temporal optimization deals with improving the efficiency of selecting architectures from the supernet. The benefits, limitations, and potential applications of these methods in various tasks and settings, including transferability, domain generalization, and Transformer models, are also discussed.  ( 2 min )
    Projective Ranking-based GNN Evasion Attacks. (arXiv:2202.12993v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) offer promising learning methods for graph-related tasks. However, GNNs are at risk of adversarial attacks. Two primary limitations of the current evasion attack methods are highlighted: (1) The current GradArgmax ignores the "long-term" benefit of the perturbation. It is faced with zero-gradient and invalid benefit estimates in certain situations. (2) In the reinforcement learning-based attack methods, the learned attack strategies might not be transferable when the attack budget changes. To this end, we first formulate the perturbation space and propose an evaluation framework and the projective ranking method. We aim to learn a powerful attack strategy then adapt it as little as possible to generate adversarial samples under dynamic budget settings. In our method, based on mutual information, we rank and assess the attack benefits of each perturbation for an effective attack strategy. By projecting the strategy, our method dramatically minimizes the cost of learning a new attack strategy when the attack budget changes. In the comparative assessment with GradArgmax and RL-S2V, the results show our method owns high attack performance and effective transferability. The visualization of our method also reveals various attack patterns in the generation of adversarial samples.  ( 2 min )
    Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions. (arXiv:2205.00415v2 [cs.CL] UPDATED)
    In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.  ( 2 min )
    Recent advances in the Self-Referencing Embedding Strings (SELFIES) library. (arXiv:2302.03620v1 [physics.chem-ph])
    String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencIng Embedded Strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of \selfieslib, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of \selfieslib (version 2.1.1) in this manuscript.  ( 2 min )
    Automated Huntington's Disease Prognosis via Biomedical Signals and Shallow Machine Learning. (arXiv:2302.03605v1 [eess.SP])
    Huntington's disease (HD) is a rare, genetically-determined brain disorder that limits the life of the patient, although early prognosis of HD can substantially improve the patient's quality of life. Current HD prognosis methods include using a variety of complex biomarkers such as clinical and imaging factors, however these methods have many shortfalls, such as their resource demand and failure to distinguish symptomatic and asymptomatic patients. Quantitative biomedical signaling has been used for diagnosis of other neurological disorders such as schizophrenia, and has potential for exposing abnormalities in HD patients. In this project, we used a premade, certified dataset collected at a clinic with 27 HD positive patients, 36 controls, and 6 unknowns with electroencephalography, electrocardiography, and functional near-infrared spectroscopy data. We first preprocessed the data and extracted a variety of features from both the transformed and raw signals, after which we applied a plethora of shallow machine learning techniques. We found the highest accuracy was achieved by a scaled-out Extremely Randomized Trees algorithm, with area under the curve of the receiver operator characteristic of 0.963 and accuracy of 91.353%. The subsequent feature analysis showed that 60.865% of the features had p<0.05, with the features from the raw signal being most significant. The results indicate the promise of neural and cardiac signals for marking abnormalities in HD, as well as evaluating the progression of the disease in  ( 2 min )
    Action Matching: Learning Stochastic Dynamics from Samples. (arXiv:2210.06662v2 [cs.LG] UPDATED)
    Learning the continuous dynamics of a system from snapshots of its time evolution is a problem which appears throughout natural sciences and machine learning, including in quantum systems, single-cell biological data, and generative modeling. In these settings, we assume that only uncorrelated samples rather than full trajectory data are available. In order to better understand the systems under observation, we would like to learn a model of the underlying process that allows us to propagate samples in time and thereby simulate entire individual trajectories. In this work, we propose Action Matching, a method for learning a rich family of dynamics using only independent samples from its time evolution. We derive a tractable training objective, which does not rely on explicit assumptions about the underlying dynamics and does not require back-propagation through differential equation or optimal transport solvers. Inspired by connections with optimal transport, we derive extensions of Action Matching to learn stochastic differential equations and dynamics involving creation or destruction of probability mass. Finally, we showcase applications of Action Matching by achieving competitive performance in a diverse set of experiments from biology, physics, and generative modeling.
    Black-Box Testing of Deep Neural Networks Through Test Case Diversity. (arXiv:2112.12591v5 [cs.SE] UPDATED)
    Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using four datasets and five DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs.
    Deep Linear Networks can Benignly Overfit when Shallow Ones Do. (arXiv:2209.09315v2 [cs.LG] UPDATED)
    We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm's ability to "hide the noise". Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced.
    Accelerated Nonnegative Tensor Completion via Integer Programming. (arXiv:2211.15770v2 [cs.LG] UPDATED)
    The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices.
    Causal Effect Identification in Cluster DAGs. (arXiv:2202.12263v2 [stat.ME] UPDATED)
    Reasoning about the effect of interventions and counterfactuals is a fundamental task found throughout the data sciences. A collection of principles, algorithms, and tools has been developed for performing such tasks in the last decades (Pearl, 2000). One of the pervasive requirements found throughout this literature is the articulation of assumptions, which commonly appear in the form of causal diagrams. Despite the power of this approach, there are significant settings where the knowledge necessary to specify a causal diagram over all variables is not available, particularly in complex, high-dimensional domains. In this paper, we introduce a new graphical modeling tool called cluster DAGs (for short, C-DAGs) that allows for the partial specification of relationships among variables based on limited prior knowledge, alleviating the stringent requirement of specifying a full causal diagram. A C-DAG specifies relationships between clusters of variables, while the relationships between the variables within a cluster are left unspecified, and can be seen as a graphical representation of an equivalence class of causal diagrams that share the relationships among the clusters. We develop the foundations and machinery for valid inferences over C-DAGs about the clusters of variables at each layer of Pearl's Causal Hierarchy (Pearl and Mackenzie 2018; Bareinboim et al. 2020) - L1 (probabilistic), L2 (interventional), and L3 (counterfactual). In particular, we prove the soundness and completeness of d-separation for probabilistic inference in C-DAGs. Further, we demonstrate the validity of Pearl's do-calculus rules over C-DAGs and show that the standard ID identification algorithm is sound and complete to systematically compute causal effects from observational data given a C-DAG. Finally, we show that C-DAGs are valid for performing counterfactual inferences about clusters of variables.
    Distribution estimation and change-point estimation for time series via DNN-based GANs. (arXiv:2211.14577v2 [cs.LG] UPDATED)
    The generative adversarial networks (GANs) have recently been applied to estimating the distribution of independent and identically distributed data, and have attracted a lot of research attention. In this paper, we use the blocking technique to demonstrate the effectiveness of GANs for estimating the distribution of stationary time series. Theoretically, we derive a non-asymptotic error bound for the Deep Neural Network (DNN)-based GANs estimator for the stationary distribution of the time series. Based on our theoretical analysis, we propose an algorithm for estimating the change point in time series distribution. The two main results are verified by two Monte Carlo experiments respectively, one is to estimate the joint stationary distribution of $5$-tuple samples of a 20 dimensional AR(3) model, the other is about estimating the change point at the combination of two different stationary time series. A real world empirical application to the human activity recognition dataset highlights the potential of the proposed methods.
    FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective. (arXiv:2209.13803v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning framework to alleviate the data silos, where decentralized clients collaboratively learn a global model without sharing their private data. However, the clients' Non-Independent and Identically Distributed (Non-IID) data negatively affect the trained model, and clients with different numbers of local updates may cause significant gaps to the local gradients in each communication round. In this paper, we propose a Federated Vectorized Averaging (FedVeca) method to address the above problem on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v3 [cs.LG] UPDATED)
    It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., $\sigma_t \equiv \Omega(1)$) and the benign deterministic regimes (i.e., $\sigma_t \equiv 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a "black-box" manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.
    Compact Graph Representation of crystal structures using Point-wise Distance Distributions. (arXiv:2212.11246v2 [physics.comp-ph] UPDATED)
    Use of graphs to represent crystal structures has become popular in recent years as they provide a natural translation from atoms and bonds to nodes and edges. Graphs capture structure, while remaining invariant to the symmetries that crystals display. Several works in property prediction, including those with state-of-the-art results, make use of the Crystal Graph. The present work offers a graph based on Point-wise Distance Distributions which retains symmetrical invariance, decreases computational load, and yields similar or better prediction accuracy on both experimental and simulated crystals.
    Learning Continuous Rotation Canonicalization with Radial Beam Sampling. (arXiv:2206.10690v2 [cs.CV] UPDATED)
    Nearly all state of the art vision models are sensitive to image rotations. Existing methods often compensate for missing inductive biases by using augmented training data to learn pseudo-invariances. Alongside the resource demanding data inflation process, predictions often poorly generalize. The inductive biases inherent to convolutional neural networks allow for translation equivariance through kernels acting parallely to the horizontal and vertical axes of the pixel grid. This inductive bias, however, does not allow for rotation equivariance. We propose a radial beam sampling strategy along with radial kernels operating on these beams to inherently incorporate center-rotation covariance. Together with an angle distance loss, we present a radial beam-based image canonicalization model, short BIC. Our model allows for maximal continuous angle regression and canonicalizes arbitrary center-rotated input images. As a pre-processing model, this enables rotation-invariant vision pipelines with model-agnostic rotation-sensitive downstream predictions. We show that our end-to-end trained angle regressor is able to predict continuous rotation angles on several vision datasets, i.e. FashionMNIST, CIFAR10, COIL100, and LFW.
    Towards Out-of-Distribution Adversarial Robustness. (arXiv:2210.03150v3 [cs.LG] UPDATED)
    Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach. Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4% the best existing baseline to 25.9% on MNIST, and from 16.9% to 23.5% on CIFAR10.
    Take One Gram of Neural Features, Get Enhanced Group Robustness. (arXiv:2208.12625v2 [cs.LG] UPDATED)
    Predictive performance of machine learning models trained with empirical risk minimization (ERM) can degrade considerably under distribution shifts. The presence of spurious correlations in training datasets leads ERM-trained models to display high loss when evaluated on minority groups not presenting such correlations. Extensive attempts have been made to develop methods improving worst-group robustness. However, they require group information for each training input or at least, a validation set with group labels to tune their hyperparameters, which may be expensive to get or unknown a priori. In this paper, we address the challenge of improving group robustness without group annotation during training or validation. To this end, we propose to partition the training dataset into groups based on Gram matrices of features extracted by an ``identification'' model and to apply robust optimization based on these pseudo-groups. In the realistic context where no group labels are available, our experiments show that our approach not only improves group robustness over ERM but also outperforms all recent baselines
    Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting. (arXiv:2206.04038v4 [cs.LG] UPDATED)
    The performance of time series forecasting has recently been greatly improved by the introduction of transformers. In this paper, we propose a general multi-scale framework that can be applied to the state-of-the-art transformer-based time series forecasting models (FEDformer, Autoformer, etc.). By iteratively refining a forecasted time series at multiple scales with shared weights, introducing architecture adaptations, and a specially-designed normalization scheme, we are able to achieve significant performance improvements, from 5.5% to 38.5% across datasets and transformer architectures, with minimal additional computational overhead. Via detailed ablation studies, we demonstrate the effectiveness of each of our contributions across the architecture and methodology. Furthermore, our experiments on various public datasets demonstrate that the proposed improvements outperform their corresponding baseline counterparts. Our code is publicly available in https://github.com/BorealisAI/scaleformer.
    Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend. (arXiv:2302.02568v2 [cs.CL] UPDATED)
    Word-level textual adversarial attacks have achieved striking performance in fooling natural language processing models. However, the fundamental questions of why these attacks are effective, and the intrinsic properties of the adversarial examples (AEs), are still not well understood. This work attempts to interpret textual attacks through the lens of $n$-gram frequency. Specifically, it is revealed that existing word-level attacks exhibit a strong tendency toward generation of examples with $n$-gram frequency descend ($n$-FD). Intuitively, this finding suggests a natural way to improve model robustness by training the model on the $n$-FD examples. To verify this idea, we devise a model-agnostic and gradient-free AE generation approach that relies solely on the $n$-gram frequency information, and further integrate it into the recently proposed convex hull framework for adversarial training. Surprisingly, the resultant method performs quite similarly to the original gradient-based method in terms of model robustness. These findings provide a human-understandable perspective for interpreting word-level textual adversarial attacks, and a new direction to improve model robustness.
    A Data Driven Method for Multi-step Prediction of Ship Roll Motion in High Sea States. (arXiv:2207.12673v3 [cs.LG] UPDATED)
    Ship roll motion in high sea states has large amplitudes and nonlinear dynamics, and its prediction is significant for operability, safety, and survivability. This paper presents a novel data-driven methodology to provide a multi-step prediction of ship roll motions in high sea states. A hybrid neural network is proposed that combines long short-term memory (LSTM) and convolutional neural network (CNN) in parallel. The motivation is to extract the nonlinear dynamic characteristics and the hydrodynamic memory information through the advantage of CNN and LSTM, respectively. For the feature selection, the time histories of motion states and wave heights are selected to involve sufficient information. Taken a scaled KCS as the study object, the ship motions in sea state 7 irregular long-crested waves are simulated and used for the validation. The results show that at least one period of roll motion can be accurately predicted. Compared with the single LSTM and CNN methods, the proposed method has better performance in predicting the amplitude of roll angles. Besides, the comparison results also demonstrate that selecting motion states and wave heights as feature space improves the prediction accuracy, verifying the effectiveness of the proposed method.
    Symmetric Pruning in Quantum Neural Networks. (arXiv:2208.14057v2 [quant-ph] UPDATED)
    Many fundamental properties of a quantum system are captured by its Hamiltonian and ground state. Despite the significance of ground states preparation (GSP), this task is classically intractable for large-scale Hamiltonians. Quantum neural networks (QNNs), which exert the power of modern quantum machines, have emerged as a leading protocol to conquer this issue. As such, how to enhance the performance of QNNs becomes a crucial topic in GSP. Empirical evidence showed that QNNs with handcraft symmetric ansatzes generally experience better trainability than those with asymmetric ansatzes, while theoretical explanations have not been explored. To fill this knowledge gap, here we propose the effective quantum neural tangent kernel (EQNTK) and connect this concept with over-parameterization theory to quantify the convergence of QNNs towards the global optima. We uncover that the advance of symmetric ansatzes attributes to their large EQNTK value with low effective dimension, which requests few parameters and quantum circuit depth to reach the over-parameterization regime permitting a benign loss landscape and fast convergence. Guided by EQNTK, we further devise a symmetric pruning (SP) scheme to automatically tailor a symmetric ansatz from an over-parameterized and asymmetric one to greatly improve the performance of QNNs when the explicit symmetry information of Hamiltonian is unavailable. Extensive numerical simulations are conducted to validate the analytical results of EQNTK and the effectiveness of SP.
    N-Gram Nearest Neighbor Machine Translation. (arXiv:2301.12866v2 [cs.CL] UPDATED)
    Nearest neighbor machine translation augments the Autoregressive Translation~(AT) with $k$-nearest-neighbor retrieval, by comparing the similarity between the token-level context representations of the target tokens in the query and the datastore. However, the token-level representation may introduce noise when translating ambiguous words, or fail to provide accurate retrieval results when the representation generated by the model contains indistinguishable context information, e.g., Non-Autoregressive Translation~(NAT) models. In this paper, we propose a novel $n$-gram nearest neighbor retrieval method that is model agnostic and applicable to both AT and NAT models. Specifically, we concatenate the adjacent $n$-gram hidden representations as the key, while the tuple of corresponding target tokens is the value. In inference, we propose tailored decoding algorithms for AT and NAT models respectively. We demonstrate that the proposed method consistently outperforms the token-level method on both AT and NAT models as well on general as on domain adaptation translation tasks. On domain adaptation, the proposed method brings $1.03$ and $2.76$ improvements regarding the average BLEU score on AT and NAT models respectively.
    Incremental Spectral Learning in Fourier Neural Operator. (arXiv:2211.15188v3 [cs.LG] UPDATED)
    Recently, neural networks have proven their impressive ability to solve partial differential equations (PDEs). Among them, Fourier neural operator (FNO) has shown success in learning solution operators for highly non-linear problems such as turbulence flow. FNO learns weights over different frequencies and as a regularization procedure, it only retains frequencies below a fixed threshold. However, manually selecting such an appropriate threshold for frequencies can be challenging, as an incorrect threshold can lead to underfitting or overfitting. To this end, we propose Incremental Fourier Neural Operator (IFNO) that incrementally adds frequency modes by increasing the truncation threshold adaptively during training. We show that IFNO reduces the testing loss by more than 10% while using 20% fewer frequency modes, compared to the standard FNO training on the Kolmogorov Flow (with Reynolds number up to 5000) under the few-data regime.
    Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization. (arXiv:2210.01908v2 [cs.LG] UPDATED)
    There is extensive interest in metric learning methods for image retrieval. Many metric learning loss functions focus on learning a correct ranking of training samples, but strongly overfit semantically inconsistent labels and require a large amount of data. To address these shortcomings, we propose a new metric learning method, called contextual loss, which optimizes contextual similarity in addition to cosine similarity. Our contextual loss implicitly enforces semantic consistency among neighbors while converging to the correct ranking. We empirically show that the proposed loss is more robust to label noise, and is less prone to overfitting even when a large portion of train data is withheld. Extensive experiments demonstrate that our method achieves a new state-of-the-art across four image retrieval benchmarks and multiple different evaluation settings. Code is available at: https://github.com/Chris210634/metric-learning-using-contextual-similarity
    Integral Continual Learning Along the Tangent Vector Field of Tasks. (arXiv:2211.13108v2 [cs.LG] UPDATED)
    We propose a lightweight continual learning method which incorporates information from specialized datasets incrementally, by integrating it along the vector field of "generalist" models. The tangent plane to the specialist model acts as a generalist guide and avoids the kind of over-fitting that leads to catastrophic forgetting, while exploiting the convexity of the optimization landscape in the tangent plane. It maintains a small fixed-size memory buffer, as low as 0.4% of the source datasets, which is updated by simple resampling. Our method achieves state-of-the-art across various buffer sizes for different datasets. Specifically, in the class-incremental setting we outperform the existing methods that do not require distillation by an average of 18.77% and 28.48%, for Seq-CIFAR-10 and Seq-TinyImageNet respectively. Our method can easily be combined with existing replay-based continual learning methods. When memory buffer constraints are relaxed to allow storage of metadata such as logits, we attain state-of-the-art accuracy with an error reduction of 17.84% towards the paragon performance on Seq-CIFAR-10.
    Neural-network solutions to stochastic reaction networks. (arXiv:2210.01169v2 [q-bio.MN] UPDATED)
    The stochastic reaction network in which chemical species evolve through a set of reactions is widely used to model stochastic processes in physics, chemistry and biology. To characterize the evolving joint probability distribution in the state space of species counts requires solving a system of ordinary differential equations, the chemical master equation, where the size of the counting state space increases exponentially with the type of species, making it challenging to investigate the stochastic reaction network. Here, we propose a machine-learning approach using the variational autoregressive network to solve the chemical master equation. Training the autoregressive network employs the policy gradient algorithm in the reinforcement learning framework, which does not require any data simulated in prior by another method. Different from simulating single trajectories, the approach tracks the time evolution of the joint probability distribution, and supports direct sampling of configurations and computing their normalized joint probabilities. We apply the approach to representative examples in physics and biology, and demonstrate that it accurately generates the probability distribution over time. The variational autoregressive network exhibits a plasticity in representing the multimodal distribution, cooperates with the conservation law, enables time-dependent reaction rates, and is efficient for high-dimensional reaction networks with allowing a flexible upper count limit. The results suggest a general approach to investigate stochastic reaction networks based on modern machine learning.
    Graph Neural Networks for Molecules. (arXiv:2209.05582v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs), which are capable of learning representations from graphical data, are naturally suitable for modeling molecular systems. This review introduces GNNs and their various applications for small organic molecules. GNNs rely on message-passing operations, a generic yet powerful framework, to update node features iteratively. Many researches design GNN architectures to effectively learn topological information of 2D molecule graphs as well as geometric information of 3D molecular systems. GNNs have been implemented in a wide variety of molecular applications, including molecular property prediction, molecular scoring and docking, molecular optimization and de novo generation, molecular dynamics simulation, etc. Besides, the review also summarizes the recent development of self-supervised learning for molecules with GNNs.
    RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness. (arXiv:2206.14502v2 [cs.LG] UPDATED)
    We show that the effectiveness of the well celebrated Mixup [Zhang et al., 2018] can be further improved if instead of using it as the sole learning objective, it is utilized as an additional regularizer to the standard cross-entropy loss. This simple change not only provides much improved accuracy but also significantly improves the quality of the predictive uncertainty estimation of Mixup in most cases under various forms of covariate shifts and out-of-distribution detection experiments. In fact, we observe that Mixup yields much degraded performance on detecting out-of-distribution samples possibly, as we show empirically, because of its tendency to learn models that exhibit high-entropy throughout; making it difficult to differentiate in-distribution samples from out-distribution ones. To show the efficacy of our approach (RegMixup), we provide thorough analyses and experiments on vision datasets (ImageNet & CIFAR-10/100) and compare it with a suite of recent approaches for reliable uncertainty estimation.
    Deep Reinforcement Learning for Traffic Light Control in Intelligent Transportation Systems. (arXiv:2302.03669v1 [cs.LG])
    Smart traffic lights in intelligent transportation systems (ITSs) are envisioned to greatly increase traffic efficiency and reduce congestion. Deep reinforcement learning (DRL) is a promising approach to adaptively control traffic lights based on the real-time traffic situation in a road network. However, conventional methods may suffer from poor scalability. In this paper, we investigate deep reinforcement learning to control traffic lights, and both theoretical analysis and numerical experiments show that the intelligent behavior ``greenwave" (i.e., a vehicle will see a progressive cascade of green lights, and not have to brake at any intersection) emerges naturally a grid road network, which is proved to be the optimal policy in an avenue with multiple cross streets. As a first step, we use two DRL algorithms for the traffic light control problems in two scenarios. In a single road intersection, we verify that the deep Q-network (DQN) algorithm delivers a thresholding policy; and in a grid road network, we adopt the deep deterministic policy gradient (DDPG) algorithm. Secondly, numerical experiments show that the DQN algorithm delivers the optimal control, and the DDPG algorithm with passive observations has the capability to produce on its own a high-level intelligent behavior in a grid road network, namely, the ``greenwave" policy emerges. We also verify the ``greenwave" patterns in a $5 \times 10$ grid road network. Thirdly, the ``greenwave" patterns demonstrate that DRL algorithms produce favorable solutions since the ``greenwave" policy shown in experiment results is proved to be optimal in a specified traffic model (an avenue with multiple cross streets). The delivered policies both in a single road intersection and a grid road network demonstrate the scalability of DRL algorithms.
    Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation. (arXiv:2302.03673v1 [cs.LG])
    We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters.
    Data-driven anisotropic finite viscoelasticity using neural ordinary differential equations. (arXiv:2302.03598v1 [cond-mat.soft])
    We develop a fully data-driven model of anisotropic finite viscoelasticity using neural ordinary differential equations as building blocks. We replace the Helmholtz free energy function and the dissipation potential with data-driven functions that a priori satisfy physics-based constraints such as objectivity and the second law of thermodynamics. Our approach enables modeling viscoelastic behavior of materials under arbitrary loads in three-dimensions even with large deformations and large deviations from the thermodynamic equilibrium. The data-driven nature of the governing potentials endows the model with much needed flexibility in modeling the viscoelastic behavior of a wide class of materials. We train the model using stress-strain data from biological and synthetic materials including humain brain tissue, blood clots, natural rubber and human myocardium and show that the data-driven method outperforms traditional, closed-form models of viscoelasticity.
    MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning. (arXiv:2209.08244v2 [cs.LG] UPDATED)
    Decentralized learning has shown great promise for cooperative multi-agent reinforcement learning (MARL). However, non-stationarity remains a significant challenge in fully decentralized learning. In the paper, we tackle the non-stationarity problem in the simplest and fundamental way and propose multi-agent alternate Q-learning (MA2QL), where agents take turns updating their Q-functions by Q-learning. MA2QL is a minimalist approach to fully decentralized cooperative MARL but is theoretically grounded. We prove that when each agent guarantees $\varepsilon$-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL). We empirically evaluate MA2QL on a variety of cooperative multi-agent tasks. Results show MA2QL consistently outperforms IQL, which verifies the effectiveness of MA2QL, despite such minimal changes.
    A Cloud-Based Energy Management Strategy for Hybrid Electric City Bus Considering Real-Time Passenger Load Prediction. (arXiv:2010.15239v2 [eess.SY] UPDATED)
    Electric city bus gains popularity in recent years for its low greenhouse gas emission, low noise level, etc. Different from a passenger car, the weight of a city bus varies significantly with different amounts of onboard passengers. After analyzing the importance of battery aging and passenger load effects on an optimal energy management strategy, this study introduces the passenger load prediction into the hybrid-electric city buses energy management problem, which is not well studied in the existing literature. The average model, Decision Tree, Gradient Boost Decision Tree, and Neural Networks models are compared in the passenger load prediction. The Gradient Boost Decision Tree model is selected due to its best accuracy and high stability. Given the predicted passenger load, a dynamic programming algorithm determines the optimal power demand for supercapacitor and battery by optimizing the battery aging and energy usage leveraging cloud techniques. Then, rule extraction is conducted on dynamic programming results, and the rule is real-time loaded to the vehicle onboard controller to handle prediction errors and uncertainties. The proposed cloud-based Dynamic Programming and rule extraction framework with the passenger load prediction show 4% and 11% lower bus operating costs in off-peak and peak hours, respectively. The operating cost by the proposed framework is less than 1% of the dynamic programming with the true passenger load information.
    A Fuzzy-set-based Joint Distribution Adaptation Method for Regression and its Application to Online Damage Quantification for Structural Digital Twin. (arXiv:2211.02656v2 [cs.CE] UPDATED)
    Online damage quantification suffers from insufficient labeled data that weakens its accuracy. In this context, adopting the domain adaptation on historical labeled data from similar structures/damages or simulated digital twin data to assist the current diagnosis task would be beneficial. However, most domain adaptation methods are designed for classification and cannot efficiently address damage quantification, a regression problem with continuous real-valued labels. This study first proposes a novel domain adaptation method, the Online Fuzzy-set-based Joint Distribution Adaptation for Regression, to address this challenge. By converting the continuous real-valued labels to fuzzy class labels via fuzzy sets, the marginal and conditional distribution discrepancy are simultaneously measured to achieve the domain adaptation for the damage quantification task. Thanks to the superiority of the proposed method, a state-of-the-art online damage quantification framework based on domain adaptation is presented. Finally, the framework has been comprehensively demonstrated with a damaged helicopter panel, in which three types of damage domain adaptations (across different damage locations, across different damage types, and from simulation to experiment) are all conducted, proving the accuracy of damage quantification can be significantly improved in a realistic environment. It is expected that the proposed approach to be applied to the fleet-level digital twin considering the individual differences.
    Long Horizon Temperature Scaling. (arXiv:2302.03686v1 [cs.LG])
    Temperature scaling is a popular technique for tuning the sharpness of a model distribution. It is used extensively for sampling likely generations and calibrating model uncertainty, and even features as a controllable parameter to many large language models in deployment. However, autoregressive models rely on myopic temperature scaling that greedily optimizes the next token. To address this, we propose Long Horizon Temperature Scaling (LHTS), a novel approach for sampling from temperature-scaled joint distributions. LHTS is compatible with all likelihood-based models, and optimizes for the long-horizon likelihood of samples. We derive a temperature-dependent LHTS objective, and show that fine-tuning a model on a range of temperatures produces a single model capable of generation with a controllable long-horizon temperature parameter. We experiment with LHTS on image diffusion models and character/language autoregressive models, demonstrating advantages over myopic temperature scaling in likelihood and sample quality, and showing improvements in accuracy on a multiple choice analogy task by $10\%$.
    Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel. (arXiv:2205.12904v2 [cs.LG] UPDATED)
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Federated Learning with Regularized Client Participation. (arXiv:2302.03662v1 [cs.LG])
    Federated Learning (FL) is a distributed machine learning approach where multiple clients work together to solve a machine learning task. One of the key challenges in FL is the issue of partial participation, which occurs when a large number of clients are involved in the training process. The traditional method to address this problem is randomly selecting a subset of clients at each communication round. In our research, we propose a new technique and design a novel regularized client participation scheme. Under this scheme, each client joins the learning process every $R$ communication rounds, which we refer to as a meta epoch. We have found that this participation scheme leads to a reduction in the variance caused by client sampling. Combined with the popular FedAvg algorithm (McMahan et al., 2017), it results in superior rates under standard assumptions. For instance, the optimization term in our main convergence bound decreases linearly with the product of the number of communication rounds and the size of the local dataset of each client, and the statistical term scales with step size quadratically instead of linearly (the case for client sampling with replacement), leading to better convergence rate $\mathcal{O}\left(\frac{1}{T^2}\right)$ compared to $\mathcal{O}\left(\frac{1}{T}\right)$, where $T$ is the total number of communication rounds. Furthermore, our results permit arbitrary client availability as long as each client is available for training once per each meta epoch.
    Generalization Bounds of Nonconvex-(Strongly)-Concave Stochastic Minimax Optimization. (arXiv:2205.14278v2 [math.OC] UPDATED)
    This paper takes an initial step to systematically investigate the generalization bounds of algorithms for solving nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization measured by the stationarity of primal functions. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem. The sample complexities for achieving $\epsilon$-generalization are $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{-2})$ and $\tilde{\mathcal{O}}(d\epsilon^{-4})$ for NC-SC and NC-C settings, respectively, where $d$ is the dimension and $\kappa$ is the condition number. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between generalization bounds and the stability notion. As a result, we establish algorithm-dependent generalization bounds for stochastic gradient descent ascent (SGDA) algorithm and the more general sampling-determined algorithms.
    Graph Kernels Based on Multi-scale Graph Embeddings. (arXiv:2206.00979v2 [cs.LG] UPDATED)
    Graph kernels are conventional methods for computing graph similarities. However, most of the R-convolution graph kernels face two challenges: 1) They cannot compare graphs at multiple different scales, and 2) they do not consider the distributions of substructures when computing the kernel matrix. These two challenges limit their performances. To mitigate the two challenges, we propose a novel graph kernel called the Multi-scale Path-pattern Graph kernel (MPG), at the heart of which is the multi-scale path-pattern node feature map. Each element of the path-pattern node feature map is the number of occurrences of a path-pattern around a node. A path-pattern is constructed by the concatenation of all the node labels in a path of a truncated BFS tree rooted at each node. Since the path-pattern node feature map can only compare graphs at local scales, we incorporate into it the multiple different scales of the graph structure, which are captured by the truncated BFS trees of different depth. We use the Wasserstein distance to compute the similarity between the multi-scale path-pattern node feature maps of two graphs, considering the distributions of path-patterns. We empirically validate MPG on various benchmark graph datasets and demonstrate that it achieves state-of-the-art performance.
    Deep Ensembles for Graphs with Higher-order Dependencies. (arXiv:2205.13988v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) continue to achieve state-of-the-art performance on many graph learning tasks, but rely on the assumption that a given graph is a sufficient approximation of the true neighborhood structure. When a system contains higher-order sequential dependencies, we show that the tendency of traditional graph representations to underfit each node's neighborhood causes existing GNNs to generalize poorly. To address this, we propose a novel Deep Graph Ensemble (DGE), which captures neighborhood variance by training an ensemble of GNNs on different neighborhood subspaces of the same node within a higher-order network structure. We show that DGE consistently outperforms existing GNNs on semisupervised and supervised tasks on six real-world data sets with known higher-order dependencies, even under a similar parameter budget. We demonstrate that learning diverse and accurate base classifiers is central to DGE's success, and discuss the implications of these findings for future work on ensembles of GNNs.
    Population-size-Aware Policy Optimization for Mean-Field Games. (arXiv:2302.03364v1 [cs.LG])
    In this work, we attempt to bridge the two fields of finite-agent and infinite-agent games, by studying how the optimal policies of agents evolve with the number of agents (population size) in mean-field games, an agent-centric perspective in contrast to the existing works focusing typically on the convergence of the empirical distribution of the population. To this end, the premise is to obtain the optimal policies of a set of finite-agent games with different population sizes. However, either deriving the closed-form solution for each game is theoretically intractable, training a distinct policy for each game is computationally intensive, or directly applying the policy trained in a game to other games is sub-optimal. We address these challenges through the Population-size-Aware Policy Optimization (PAPO). Our contributions are three-fold. First, to efficiently generate efficient policies for games with different population sizes, we propose PAPO, which unifies two natural options (augmentation and hypernetwork) and achieves significantly better performance. PAPO consists of three components: i) the population-size encoding which transforms the original value of population size to an equivalent encoding to avoid training collapse, ii) a hypernetwork to generate a distinct policy for each game conditioned on the population size, and iii) the population size as an additional input to the generated policy. Next, we construct a multi-task-based training procedure to efficiently train the neural networks of PAPO by sampling data from multiple games with different population sizes. Finally, extensive experiments on multiple environments show the significant superiority of PAPO over baselines, and the analysis of the evolution of the generated policies further deepens our understanding of the two fields of finite-agent and infinite-agent games.
    Co-Imitation: Learning Design and Behaviour by Imitation. (arXiv:2209.01207v2 [cs.LG] UPDATED)
    The co-adaptation of robots has been a long-standing research endeavour with the goal of adapting both body and behaviour of a system for a given task, inspired by the natural evolution of animals. Co-adaptation has the potential to eliminate costly manual hardware engineering as well as improve the performance of systems. The standard approach to co-adaptation is to use a reward function for optimizing behaviour and morphology. However, defining and constructing such reward functions is notoriously difficult and often a significant engineering effort. This paper introduces a new viewpoint on the co-adaptation problem, which we call co-imitation: finding a morphology and a policy that allow an imitator to closely match the behaviour of a demonstrator. To this end we propose a co-imitation methodology for adapting behaviour and morphology by matching state distributions of the demonstrator. Specifically, we focus on the challenging scenario with mismatched state- and action-spaces between both agents. We find that co-imitation increases behaviour similarity across a variety of tasks and settings, and demonstrate co-imitation by transferring human walking, jogging and kicking skills onto a simulated humanoid.
    Merging satellite and gauge-measured precipitation using LightGBM with an emphasis on extreme quantiles. (arXiv:2302.03606v1 [eess.SP])
    Knowing the actual precipitation in space and time is critical in hydrological modelling applications, yet the spatial coverage with rain gauge stations is limited due to economic constraints. Gridded satellite precipitation datasets offer an alternative option for estimating the actual precipitation by covering uniformly large areas, albeit related estimates are not accurate. To improve precipitation estimates, machine learning is applied to merge rain gauge-based measurements and gridded satellite precipitation products. In this context, observed precipitation plays the role of the dependent variable, while satellite data play the role of predictor variables. Random forests is the dominant machine learning algorithm in relevant applications. In those spatial predictions settings, point predictions (mostly the mean or the median of the conditional distribution) of the dependent variable are issued. Here we propose, issuing probabilistic spatial predictions of precipitation using Light Gradient Boosting Machine (LightGBM). LightGBM is a boosting algorithm, highlighted by prize-winning entries in prediction and forecasting competitions. To assess LightGBM, we contribute a large-scale application that includes merging daily precipitation measurements in contiguous US with PERSIANN and GPM-IMERG satellite precipitation data. We focus on extreme quantiles of the probability distribution of the dependent variable, where LightGBM outperforms quantile regression forests (QRF, a variant of random forests) in terms of quantile score. LightGBM and QRF show similar performance when predicting functionals at the centre of the conditional probability distribution, including the conditional median. Our study offers understanding of probabilistic predictions in spatial settings using machine learning.
    Deep Class-Incremental Learning: A Survey. (arXiv:2302.03648v1 [cs.CV])
    Deep models, e.g., CNNs and Vision Transformers, have achieved impressive achievements in many vision tasks in the closed world. However, novel classes emerge from time to time in our ever-changing world, requiring a learning system to acquire new knowledge continually. For example, a robot needs to understand new instructions, and an opinion monitoring system should analyze emerging topics every day. Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally and build a universal classifier among all seen classes. Correspondingly, when directly training the model with new class instances, a fatal problem occurs -- the model tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades. There have been numerous efforts to tackle catastrophic forgetting in the machine learning community. In this paper, we survey comprehensively recent advances in deep class-incremental learning and summarize these methods from three aspects, i.e., data-centric, model-centric, and algorithm-centric. We also provide a rigorous and unified evaluation of 16 methods in benchmark image classification tasks to find out the characteristics of different algorithms empirically. Furthermore, we notice that the current comparison protocol ignores the influence of memory budget in model storage, which may result in unfair comparison and biased results. Hence, we advocate fair comparison by aligning the memory budget in evaluation, as well as several memory-agnostic performance measures. The source code to reproduce these evaluations is available at https://github.com/zhoudw-zdw/CIL_Survey/
    Black Box Optimization Using QUBO and the Cross Entropy Method. (arXiv:2206.12510v2 [cs.LG] UPDATED)
    Black-box optimization (BBO) can be used to optimize functions whose analytic form is unknown. A common approach to realising BBO is to learn a surrogate model which approximates the target black-box function which can then be solved via white-box optimization methods. In this paper, we present our approach BOX-QUBO, where the surrogate model is a QUBO matrix. However, unlike in previous state-of-the-art approaches, this matrix is not trained entirely by regression, but mostly by classification between 'good' and 'bad' solutions. This better accounts for the low capacity of the QUBO matrix, resulting in significantly better solutions overall. We tested our approach against the state-of-the-art on four domains and in all of them BOX-QUBO showed better results. A second contribution of this paper is the idea to also solve white-box problems, i.e. problems which could be directly formulated as QUBO, by means of black-box optimization in order to reduce the size of the QUBOs to the information-theoretic minimum. Experiments show that this significantly improves the results for MAX-k-SAT.
    Backpropagation on Dynamical Networks. (arXiv:2207.03093v2 [math.DS] UPDATED)
    Dynamical networks are versatile models that can describe a variety of behaviours such as synchronisation and feedback. However, applying these models in real world contexts is difficult as prior information pertaining to the connectivity structure or local dynamics is often unknown and must be inferred from time series observations of network states. Additionally, the influence of coupling interactions between nodes further complicates the isolation of local node dynamics. Given the architectural similarities between dynamical networks and recurrent neural networks (RNN), we propose a network inference method based on the backpropagation through time (BPTT) algorithm commonly used to train recurrent neural networks. This method aims to simultaneously infer both the connectivity structure and local node dynamics purely from observation of node states. An approximation of local node dynamics is first constructed using a neural network. This is alternated with an adapted BPTT algorithm to regress corresponding network weights by minimising prediction errors of the dynamical network based on the previously constructed local models until convergence is achieved. This method was found to be succesful in identifying the connectivity structure for coupled networks of Lorenz, Chua and FitzHugh-Nagumo oscillators. Freerun prediction performance with the resulting local models and weights was found to be comparable to the true system with noisy initial conditions. The method is also extended to non-conventional network couplings such as asymmetric negative coupling.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v4 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Adaptive Aggregation for Safety-Critical Control. (arXiv:2302.03586v1 [cs.LG])
    Safety has been recognized as the central obstacle to preventing the use of reinforcement learning (RL) for real-world applications. Different methods have been developed to deal with safety concerns in RL. However, learning reliable RL-based solutions usually require a large number of interactions with the environment. Likewise, how to improve the learning efficiency, specifically, how to utilize transfer learning for safe reinforcement learning, has not been well studied. In this work, we propose an adaptive aggregation framework for safety-critical control. Our method comprises two key techniques: 1) we learn to transfer the safety knowledge by aggregating the multiple source tasks and a target task through the attention network; 2) we separate the goal of improving task performance and reducing constraint violations by utilizing a safeguard. Experiment results demonstrate that our algorithm can achieve fewer safety violations while showing better data efficiency compared with several baselines.
    Online Bayesian Meta-Learning for Cognitive Tracking Radar. (arXiv:2207.06917v2 [cs.IT] UPDATED)
    A key component of cognitive radar is the ability to generalize, or achieve consistent performance across a range of sensing environments, since aspects of the physical scene may vary over time. This presents a challenge for learning-based waveform selection approaches, since transmission policies which are effective in one scene may be highly suboptimal in another. We address this problem by strategically biasing a learning algorithm by exploiting high-level structure across tracking instances, referred to as meta-learning. In this work, we develop an online meta-learning approach for waveform-agile tracking. This approach uses information gained from previous target tracks to speed up and enhance learning in new tracking instances. This results in sample-efficient learning across a class of finite state target channels by exploiting inherent similarity across tracking scenes, attributed to common physical elements such as target type or clutter statistics. We formulate the online waveform selection problem within the framework of Bayesian learning, and provide prior-dependent performance bounds for the meta-learning problem using Probability Approximately Correct (PAC)-Bayes theory. We present a computationally feasible meta-posterior sampling algorithm and study the performance in a simulation study consisting of diverse scenes. Finally, we examine the potential performance benefits and practical challenges associated with online meta-learning for waveform-agile tracking.
    Novel Fundus Image Preprocessing for Retcam Images to Improve Deep Learning Classification of Retinopathy of Prematurity. (arXiv:2302.02524v2 [eess.IV] UPDATED)
    Retinopathy of Prematurity (ROP) is a potentially blinding eye disorder because of damage to the eye's retina which can affect babies born prematurely. Screening of ROP is essential for early detection and treatment. This is a laborious and manual process which requires trained physician performing dilated ophthalmological examination which can be subjective resulting in lower diagnosis success for clinically significant disease. Automated diagnostic methods can assist ophthalmologists increase diagnosis accuracy using deep learning. Several research groups have highlighted various approaches. This paper proposes the use of new novel fundus preprocessing methods using pretrained transfer learning frameworks to create hybrid models to give higher diagnosis accuracy. The evaluations show that these novel methods in comparison to traditional imaging processing contribute to higher accuracy in classifying Plus disease, Stages of ROP and Zones. We achieve accuracy of 97.65% for Plus disease, 89.44% for Stage, 90.24% for Zones with limited training dataset.
    Subtyping patients with chronic disease using longitudinal BMI patterns. (arXiv:2111.05385v2 [cs.LG] UPDATED)
    Obesity is a major health problem, increasing the risk of various major chronic diseases, such as diabetes, cancer, and stroke. While the role of obesity identified by cross-sectional BMI recordings has been heavily studied, the role of BMI trajectories is much less explored. In this study, we use a machine-learning approach to subtype individuals' risk of developing 18 major chronic diseases by using their BMI trajectories extracted from a large and geographically diverse EHR dataset capturing the health status of around two million individuals for a period of six years. We define nine new interpretable and evidence-based variables based on the BMI trajectories to cluster the patients into subgroups using the k-means clustering method. We thoroughly review each cluster's characteristics in terms of demographic, socioeconomic, and physiological measurement variables to specify the distinct properties of the patients in the clusters. In our experiments, the direct relationship of obesity with diabetes, hypertension, Alzheimer's, and dementia has been re-established and distinct clusters with specific characteristics for several of the chronic diseases have been found to be conforming or complementary to the existing body of knowledge.
    Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality. (arXiv:2110.10745v3 [stat.ML] UPDATED)
    Parameter learning for high-dimensional, partially observed, and nonlinear stochastic processes is a methodological challenge. Spatiotemporal disease transmission systems provide examples of such processes giving rise to open inference problems. We propose the iterated block particle filter (IBPF) algorithm for learning high-dimensional parameters over graphical state space models with general state spaces, measures, transition densities and graph structure. Theoretical performance guarantees are obtained on beating the curse of dimensionality (COD), algorithm convergence, and likelihood maximization. Experiments on a highly nonlinear and non-Gaussian spatiotemporal model for measles transmission reveal that the iterated ensemble Kalman filter algorithm (Li et al. (2020)) is ineffective and the iterated filtering algorithm (Ionides et al. (2015)) suffers from the COD, while our IBPF algorithm beats COD consistently across various experiments with different metrics.
    Mitigating Algorithmic Bias with Limited Annotations. (arXiv:2207.10018v2 [cs.LG] UPDATED)
    Existing work on fairness modeling commonly assumes that sensitive attributes for all instances are fully available, which may not be true in many real-world applications due to the high cost of acquiring sensitive information. When sensitive attributes are not disclosed or available, it is needed to manually annotate a small part of the training data to mitigate bias. However, the skewed distribution across different sensitive groups preserves the skewness of the original dataset in the annotated subset, which leads to non-optimal bias mitigation. To tackle this challenge, we propose Active Penalization Of Discrimination (APOD), an interactive framework to guide the limited annotations towards maximally eliminating the effect of algorithmic bias. The proposed APOD integrates discrimination penalization with active instance selection to efficiently utilize the limited annotation budget, and it is theoretically proved to be capable of bounding the algorithmic bias. According to the evaluation on five benchmark datasets, APOD outperforms the state-of-the-arts baseline methods under the limited annotation budget, and shows comparable performance to fully annotated bias mitigation, which demonstrates that APOD could benefit real-world applications when sensitive information is limited.
    A Lightweight, Efficient and Explainable-by-Design Convolutional Neural Network for Internet Traffic Classification. (arXiv:2202.05535v3 [cs.LG] UPDATED)
    Traffic classification, i.e. the identification of the type of applications flowing in a network, is a strategic task for numerous activities (e.g., intrusion detection, routing). This task faces some critical challenges that current deep learning approaches do not address. The design of current approaches do not take into consideration the fact that networking hardware (e.g., routers) often runs with limited computational resources. Further, they do not meet the need for faithful explainability highlighted by regulatory bodies. Finally, these traffic classifiers are evaluated on small datasets which fail to reflect the diversity of applications in real-world settings. Therefore, this paper introduces a new Lightweight, Efficient and eXplainable-by-design convolutional neural network (LEXNet) for Internet traffic classification, which relies on a new residual block (for lightweight and efficiency purposes) and prototype layer (for explainability). Based on a commercial-grade dataset, our evaluation shows that LEXNet succeeds to maintain the same accuracy as the best performing state-of-the-art neural network, while providing the additional features previously mentioned. Moreover, we illustrate the explainability feature of our approach, which stems from the communication of detected application prototypes to the end-user, and we highlight the faithfulness of LEXNet explanations through a comparison with post hoc methods.
    CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets. (arXiv:2302.02551v2 [cs.CV] UPDATED)
    Open vocabulary models (e.g. CLIP) have shown strong performance on zero-shot classification through their ability generate embeddings for each class based on their (natural language) names. Prior work has focused on improving the accuracy of these models through prompt engineering or by incorporating a small amount of labeled downstream data (via finetuning). However, there has been little focus on improving the richness of the class names themselves, which can pose issues when class labels are coarsely-defined and uninformative. We propose Classification with Hierarchical Label Sets (or CHiLS), an alternative strategy for zero-shot classification specifically designed for datasets with implicit semantic hierarchies. CHiLS proceeds in three steps: (i) for each class, produce a set of subclasses, using either existing label hierarchies or by querying GPT-3; (ii) perform the standard zero-shot CLIP procedure as though these subclasses were the labels of interest; (iii) map the predicted subclass back to its parent to produce the final prediction. Across numerous datasets with underlying hierarchical structure, CHiLS leads to improved accuracy in situations both with and without ground-truth hierarchical information. CHiLS is simple to implement within existing CLIP pipelines and requires no additional training cost. Code is available at: https://github.com/acmi-lab/CHILS.
    Toward Face Biometric De-identification using Adversarial Examples. (arXiv:2302.03657v1 [cs.CV])
    The remarkable success of face recognition (FR) has endangered the privacy of internet users particularly in social media. Recently, researchers turned to use adversarial examples as a countermeasure. In this paper, we assess the effectiveness of using two widely known adversarial methods (BIM and ILLC) for de-identifying personal images. We discovered, unlike previous claims in the literature, that it is not easy to get a high protection success rate (suppressing identification rate) with imperceptible adversarial perturbation to the human visual system. Finally, we found out that the transferability of adversarial examples is highly affected by the training parameters of the network with which they are generated.
    DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. (arXiv:2212.03597v2 [cs.LG] UPDATED)
    Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two novel data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. DeepSpeed Data Efficiency also takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.
    Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery. (arXiv:2302.03668v1 [cs.LG])
    The strength of modern generative models lies in their ability to be controlled through text-based prompts. Typical "hard" prompts are made from interpretable words and tokens, and must be hand-crafted by humans. There are also "soft" prompts, which consist of continuous feature vectors. These can be discovered using powerful optimization methods, but they cannot be easily interpreted, re-used across models, or plugged into a text-based interface. We describe an approach to robustly optimize hard text prompts through efficient gradient-based optimization. Our approach automatically generates hard text-based prompts for both text-to-image and text-to-text applications. In the text-to-image setting, the method creates hard prompts for diffusion models, allowing API users to easily generate, discover, and mix and match image concepts without prior knowledge on how to prompt the model. In the text-to-text setting, we show that hard prompts can be automatically discovered that are effective in tuning LMs for classification.
    Incentive-aware Contextual Pricing with Non-parametric Market Noise. (arXiv:1911.03508v3 [cs.LG] UPDATED)
    We consider a dynamic pricing problem for repeated contextual second-price auctions with multiple strategic buyers who aim to maximize their long-term time discounted utility. The seller has limited information on buyers' overall demand curves which depends on a non-parametric market-noise distribution, and buyers may potentially submit corrupted bids (relative to true valuations) to manipulate the seller's pricing policy for more favorable reserve prices in the future. We focus on designing the seller's learning policy to set contextual reserve prices where the seller's goal is to minimize regret compared to the revenue of a benchmark clairvoyant policy that has full information of buyers' demand. We propose a policy with a phased-structure that incorporates randomized "isolation" periods, during which a buyer is randomly chosen to solely participate in the auction. We show that this design allows the seller to control the number of periods in which buyers significantly corrupt their bids. We then prove that our policy enjoys a $T$-period regret of $\widetilde{\mathcal{O}}(\sqrt{T})$ facing strategic buyers. Finally, we conduct numerical simulations to compare our proposed algorithm to standard pricing policies. Our numerical results show that our algorithm outperforms these policies under various buyer bidding behavior.
    A Scalable and Efficient Iterative Method for Copying Machine Learning Classifiers. (arXiv:2302.02667v2 [cs.LG] UPDATED)
    Differential replication through copying refers to the process of replicating the decision behavior of a machine learning model using another model that possesses enhanced features and attributes. This process is relevant when external constraints limit the performance of an industrial predictive system. Under such circumstances, copying enables the retention of original prediction capabilities while adapting to new demands. Previous research has focused on the single-pass implementation for copying. This paper introduces a novel sequential approach that significantly reduces the amount of computational resources needed to train or maintain a copy, leading to reduced maintenance costs for companies using machine learning models in production. The effectiveness of the sequential approach is demonstrated through experiments with synthetic and real-world datasets, showing significant reductions in time and resources, while maintaining or improving accuracy.
    TransPolymer: a Transformer-based Language Model for Polymer Property Predictions. (arXiv:2209.01307v3 [cs.LG] UPDATED)
    Accurate and efficient prediction of polymer properties is of great significance in polymer development and design. Conventionally, expensive and time-consuming experiments or simulations are required to evaluate the function of polymers. Recently, Transformer models, equipped with attention mechanisms, have exhibited superior performance in various natural language processing tasks. However, such methods have not been investigated in polymer sciences. Herein, we report TransPolymer, a Transformer-based language model for polymer property prediction. Owing to our proposed polymer tokenizer with chemical awareness, TransPolymer can learn representations directly from polymer sequences. The model learns expressive representations by pretraining on a large unlabeled dataset via masked language modeling, followed by finetuning the model on downstream datasets concerning various polymer properties. TransPolymer achieves superior performance in all ten downstream tasks and surpasses other baselines significantly on most downstream tasks. Moreover, the improvement by the pretrained TransPolymer over supervised TransPolymer and other language models strengthens the significant benefits of pretraining on large unlabeled data in representation learning. Experiment results further demonstrate the important role of the attention mechanism in understanding polymer sequences. We highlight this model as a promising computational tool for promoting rational polymer design and understanding structure-property relationships in a data science view.
    PartitionVAE -- a human-interpretable VAE. (arXiv:2302.03689v1 [cs.LG])
    VAEs, or variational autoencoders, are autoencoders that explicitly learn the distribution of the input image space rather than assuming no prior information about the distribution. This allows it to classify similar samples close to each other in the latent space's distribution. VAEs classically assume the latent space is normally distributed, though many distribution priors work, and they encode this assumption through a K-L divergence term in the loss function. While VAEs learn the distribution of the latent space and naturally make each dimension in the latent space as disjoint from the others as possible, they do not group together similar features -- the image space feature represented by one unit of the representation layer does not necessarily have high correlation with the feature…
    CRU: A Novel Neural Architecture for Improving the Predictive Performance of Time-Series Data. (arXiv:2211.16653v2 [cs.LG] UPDATED)
    The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend decomposition using Loess (STL) to ensure improved predictive accuracy. However, because this approach is learned in an independent model for each component, it cannot learn the relationships between time-series components. In this study, we propose a new neural architecture called a correlation recurrent unit (CRU) that can perform time series decomposition within a neural cell and learn correlations (autocorrelation and correlation) between each decomposition component. The proposed neural architecture was evaluated through comparative experiments with previous studies using five univariate time-series datasets and four multivariate time-series data. The results showed that long- and short-term predictive performance was improved by more than 10%. The experimental results show that the proposed CRU is an excellent method for TSF problems compared to other neural architectures.
    Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective. (arXiv:2302.03561v1 [cs.LG])
    We study the problem of optimizing a recommender system for outcomes that occur over several weeks or months. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationships with a recommender system. Measurement, attribution, and coordination challenges complicate algorithm design. We describe careful modeling -- including a new representation of user state and key conditional independence assumptions -- which overcomes these challenges and leads to simple, testable recommender system prototypes. We apply our approach to a podcast recommender system that makes personalized recommendations to hundreds of millions of listeners. A/B tests demonstrate that purposefully optimizing for long-term outcomes leads to large performance gains over conventional approaches that optimize for short-term proxies.
    Riemannian Flow Matching on General Geometries. (arXiv:2302.03660v1 [cs.LG])
    We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, inherently cannot scale to high dimensions, or use approximations to limiting quantities that result in biased objectives. Riemannian Flow Matching bypasses these inconveniences and exhibits multiple benefits over prior approaches: It is completely simulation-free on simple geometries, it does not require divergence computation, and its target vector field is computed in closed form even on general geometries. The key ingredient behind RFM is the construction of a simple kernel function for defining per-sample vector fields, which subsumes existing Euclidean cases. Extending to general geometries, we rely on the use of spectral decompositions to efficiently compute kernel functions. Our method achieves state-of-the-art performance on real-world non-Euclidean datasets, and we showcase, for the first time, tractable training on general geometries, including on triangular meshes and maze-like manifolds with boundaries.
    Local Neural Descriptor Fields: Locally Conditioned Object Representations for Manipulation. (arXiv:2302.03573v1 [cs.RO])
    A robot operating in a household environment will see a wide range of unique and unfamiliar objects. While a system could train on many of these, it is infeasible to predict all the objects a robot will see. In this paper, we present a method to generalize object manipulation skills acquired from a limited number of demonstrations, to novel objects from unseen shape categories. Our approach, Local Neural Descriptor Fields (L-NDF), utilizes neural descriptors defined on the local geometry of the object to effectively transfer manipulation demonstrations to novel objects at test time. In doing so, we leverage the local geometry shared between objects to produce a more general manipulation framework. We illustrate the efficacy of our approach in manipulating novel objects in novel poses -- both in simulation and in the real world.
    Ethical Considerations for Collecting Human-Centric Image Datasets. (arXiv:2302.03629v1 [cs.CV])
    Human-centric image datasets are critical to the development of computer vision technologies. However, recent investigations have foregrounded significant ethical issues related to privacy and bias, which have resulted in the complete retraction, or modification, of several prominent datasets. Recent works have tried to reverse this trend, for example, by proposing analytical frameworks for ethically evaluating datasets, the standardization of dataset documentation and curation practices, privacy preservation methodologies, as well as tools for surfacing and mitigating representational biases. Little attention, however, has been paid to the realities of operationalizing ethical data collection. To fill this gap, we present a set of key ethical considerations and practical recommendations for collecting more ethically-minded human-centric image data. Our research directly addresses issues of privacy and bias by contributing to the research community best practices for ethical data collection, covering purpose, privacy and consent, as well as diversity. We motivate each consideration by drawing on lessons from current practices, dataset withdrawals and audits, and analytical ethical frameworks. Our research is intended to augment recent scholarship, representing an important step toward more responsible data curation practices.
    Feature Necessity & Relevancy in ML Classifier Explanations. (arXiv:2210.15675v2 [cs.LG] UPDATED)
    Given a machine learning (ML) model and a prediction, explanations can be defined as sets of features which are sufficient for the prediction. In some applications, and besides asking for an explanation, it is also critical to understand whether sensitive features can occur in some explanation, or whether a non-interesting feature must occur in all explanations. This paper starts by relating such queries respectively with the problems of relevancy and necessity in logic-based abduction. The paper then proves membership and hardness results for several families of ML classifiers. Afterwards the paper proposes concrete algorithms for two classes of classifiers. The experimental results confirm the scalability of the proposed algorithms.
    Planted Bipartite Graph Detection. (arXiv:2302.03658v1 [cs.DS])
    We consider the task of detecting a hidden bipartite subgraph in a given random graph. Specifically, under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph over $n$ vertices with edge density $q$. Under the alternative, there exists a planted $k_{\mathsf{R}} \times k_{\mathsf{L}}$ bipartite subgraph with edge density $p>q$. We derive asymptotically tight upper and lower bounds for this detection problem in both the dense regime, where $q,p = \Theta\left(1\right)$, and the sparse regime where $q,p = \Theta\left(n^{-\alpha}\right), \alpha \in \left(0,2\right]$. Moreover, we consider a variant of the above problem, where one can only observe a relatively small part of the graph, by using at most $\mathsf{Q}$ edge queries. For this problem, we derive upper and lower bounds in both the dense and sparse regimes.
    Meta-Learning Biologically Plausible Plasticity Rules with Random Feedback Pathways. (arXiv:2210.16414v5 [q-bio.NC] UPDATED)
    Backpropagation is widely used to train artificial neural networks, but its relationship to synaptic plasticity in the brain is unknown. Some biological models of backpropagation rely on feedback projections that are symmetric with feedforward connections, but experiments do not corroborate the existence of such symmetric backward connectivity. Random feedback alignment offers an alternative model in which errors are propagated backward through fixed, random backward connections. This approach successfully trains shallow models, but learns slowly and does not perform well with deeper models or online learning. In this study, we develop a meta-learning approach to discover interpretable, biologically plausible plasticity rules that improve online learning performance with fixed random feedback connections. The resulting plasticity rules show improved online training of deep models in the low data regime. Our results highlight the potential of meta-learning to discover effective, interpretable learning rules satisfying biological constraints.
    Two Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy. (arXiv:2302.03542v1 [cs.LG])
    We present an algorithm for minimizing an objective with hard-to-compute gradients by using a related, easier-to-access function as a proxy. Our algorithm is based on approximate proximal point iterations on the proxy combined with relatively few stochastic gradients from the objective. When the difference between the objective and the proxy is $\delta$-smooth, our algorithm guarantees convergence at a rate matching stochastic gradient descent on a $\delta$-smooth objective, which can lead to substantially better sample efficiency. Our algorithm has many potential applications in machine learning, and provides a principled means of leveraging synthetic data, physics simulators, mixed public and private data, and more.
    Learning to cooperatively estimate road surface friction. (arXiv:2302.03560v1 [math.OC])
    We present a system for estimating the friction of the pavement surface at any curved road section, by arriving at a consensus estimate, based on data from vehicles that have recently passed through that section. This estimate can help following vehicles. To keep costs down, we depend only on standard automotive sensors, such as the IMU, and sensors for the steering angle and wheel speeds. Our system's workflow consists of: (i) processing of measurements from existing vehicular sensors, to implement a virtual sensor that captures the effect of low friction on the vehicle, (ii) transmitting short kinematic summaries from vehicles to a road side unit (RSU), using V2X communication, and (iii) estimating the friction coefficients, by running a machine learning regressor at the RSU, on summaries from individual vehicles, and then combining several such estimates. In designing and implementing our system over a road network, we face two key questions: (i) should each individual road section have a local friction coefficient regressor, or can we use a global regressor that covers all the possible road sections? and (ii) how accurate are the resulting regressor estimates? We test the performance of design variations of our solution, using simulations on the commercial package Dyna4. We consider a single vehicle type with varying levels of tyre wear, and a range of road friction coefficients. We find that: (a) only a marginal loss of accuracy is incurred in using a global regressor as compared to local regressors, (b) the consensus estimate at the RSU has a worst case error of about ten percent, if the combination is based on at least fifty recently passed vehicles, and (c) our regressors have root mean square (RMS) errors that are less than five percent. The RMS error rate of our system is half as that of a commercial friction estimation service.
    Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification. (arXiv:2210.07983v3 [cs.CV] UPDATED)
    In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.
    From Utilitarian to Rawlsian Designs for Algorithmic Fairness. (arXiv:2302.03567v1 [cs.CY])
    There is a lack of consensus within the literature as to how `fairness' of algorithmic systems can be measured, and different metrics can often be at odds. In this paper, we approach this task by drawing on the ethical frameworks of utilitarianism and John Rawls. Informally, these two theories of distributive justice measure the `good' as either a population's sum of utility, or worst-off outcomes, respectively. We present a parameterized class of objective functions that interpolates between these two (possibly) conflicting notions of the `good'. This class is shown to represent a relaxation of the Rawlsian `veil of ignorance', and its sequence of optimal solutions converges to both a utilitarian and Rawlsian optimum. Several other properties of this class are studied, including: 1) a relationship to regularized optimization, 2) feasibility of consistent estimation, and 3) algorithmic cost. In several real-world datasets, we compute optimal solutions and construct the tradeoff between utilitarian and Rawlsian notions of the `good'. Empirically, we demonstrate that increasing model complexity can manifest strict improvements to both measures of the `good'. This work suggests that the proper degree of `fairness' can be informed by a designer's preferences over the space of induced utilitarian and Rawlsian `good'.
    Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models. (arXiv:2302.03491v1 [cs.CL])
    Learned metrics such as BLEURT have in recent years become widely employed to evaluate the quality of machine translation systems. Training such metrics requires data which can be expensive and difficult to acquire, particularly for lower-resource languages. We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon such learned metrics without requiring human annotators, by creating synthetic datasets which can be mixed into existing datasets, requiring only a corpus of text in the target language. We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
    Graph Generation with Destination-Driven Diffusion Mixture. (arXiv:2302.03596v1 [cs.LG])
    Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures. Although diffusion models have achieved notable success in graph generation recently, they are ill-suited for modeling the structural information of graphs since learning to denoise the noisy samples does not explicitly capture the graph topology. To tackle this limitation, we propose a novel generative process that models the topology of graphs by predicting the destination of the process. Specifically, we design the generative process as a mixture of diffusion processes conditioned on the endpoint in the data distribution, which drives the process toward the probable destination. Further, we introduce new training objectives for learning to predict the destination, and discuss the advantages of our generative framework that can explicitly model the graph topology and exploit the inductive bias of the data. Through extensive experimental validation on general graph and 2D/3D molecular graph generation tasks, we show that our method outperforms previous generative models, generating graphs with correct topology with both continuous and discrete features.
    Robustness Implies Fairness in Casual Algorithmic Recourse. (arXiv:2302.03465v1 [cs.LG])
    Algorithmic recourse aims to disclose the inner workings of the black-box decision process in situations where decisions have significant consequences, by providing recommendations to empower beneficiaries to achieve a more favorable outcome. To ensure an effective remedy, suggested interventions must not only be low-cost but also robust and fair. This goal is accomplished by providing similar explanations to individuals who are alike. This study explores the concept of individual fairness and adversarial robustness in causal algorithmic recourse and addresses the challenge of achieving both. To resolve the challenges, we propose a new framework for defining adversarially robust recourse. The new setting views the protected feature as a pseudometric and demonstrates that individual fairness is a special case of adversarial robustness. Finally, we introduce the fair robust recourse problem to achieve both desirable properties and show how it can be satisfied both theoretically and empirically.
    Multi-Scale Message Passing Neural PDE Solvers. (arXiv:2302.03580v1 [cs.LG])
    We propose a novel multi-scale message passing neural network algorithm for learning the solutions of time-dependent PDEs. Our algorithm possesses both temporal and spatial multi-scale resolution features by incorporating multi-scale sequence models and graph gating modules in the encoder and processor, respectively. Benchmark numerical experiments are presented to demonstrate that the proposed algorithm outperforms baselines, particularly on a PDE with a range of spatial and temporal scales.
    A Bayesian Optimization approach for calibrating large-scale activity-based transport models. (arXiv:2302.03480v1 [cs.LG])
    The use of Agent-Based and Activity-Based modeling in transportation is rising due to the capability of addressing complex applications such as disruptive trends (e.g., remote working and automation) or the design and assessment of disaggregated management strategies. Still, the broad adoption of large-scale disaggregate models is not materializing due to the inherently high complexity and computational needs. Activity-based models focused on behavioral theory, for example, may involve hundreds of parameters that need to be calibrated to match the detailed socio-economical characteristics of the population for any case study. This paper tackles this issue by proposing a novel Bayesian Optimization approach incorporating a surrogate model in the form of an improved Random Forest, designed to automate the calibration process of the behavioral parameters. The proposed method is tested on a case study for the city of Tallinn, Estonia, where the model to be calibrated consists of 477 behavioral parameters, using the SimMobility MT software. Satisfactory performance is achieved in the major indicators defined for the calibration process: the error for the overall number of trips is equal to 4% and the average error in the OD matrix is 15.92 vehicles per day.
    Efficient Parametric Approximations of Neural Network Function Space Distance. (arXiv:2302.03519v1 [cs.LG])
    It is often useful to compactly summarize important properties of model parameters and training data so that they can be used later without storing and/or iterating over the entire dataset. As a specific case, we consider estimating the Function Space Distance (FSD) over a training set, i.e. the average discrepancy between the outputs of two neural networks. We propose a Linearized Activation Function TRick (LAFTR) and derive an efficient approximation to FSD for ReLU neural networks. The key idea is to approximate the architecture as a linear network with stochastic gating. Despite requiring only one parameter per unit of the network, our approach outcompetes other parametric approximations with larger memory requirements. Applied to continual learning, our parametric approximation is competitive with state-of-the-art nonparametric approximations, which require storing many training examples. Furthermore, we show its efficacy in estimating influence functions accurately and detecting mislabeled examples without expensive iterations over the entire dataset.
    Uncoupled Learning of Differential Stackelberg Equilibria with Commitments. (arXiv:2302.03438v1 [cs.LG])
    A natural solution concept for many multiagent settings is the Stackelberg equilibrium, under which a ``leader'' agent selects a strategy that maximizes its own payoff assuming the ``follower'' chooses their best response to this strategy. Recent work has presented asymmetric learning updates that can be shown to converge to the \textit{differential} Stackelberg equilibria of two-player differentiable games. These updates are ``coupled'' in the sense that the leader requires some information about the follower's payoff function. Such coupled learning rules cannot be applied to \textit{ad hoc} interactive learning settings, and can be computationally impractical even in centralized training settings where the follower's payoffs are known. In this work, we present an ``uncoupled'' learning process under which each player's learning update only depends on their observations of the other's behavior. We prove that this process converges to a local Stackelberg equilibrium under similar conditions as previous coupled methods. We conclude with a discussion of the potential applications of our approach to human--AI cooperation and multi-agent reinforcement learning.
    PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications. (arXiv:2302.01876v1 [cs.AR] CROSS LISTED)
    Posit has been a promising alternative to the IEEE-754 floating point format for deep learning applications due to its better trade-off between dynamic range and accuracy. However, hardware implementation of posit arithmetic requires further exploration, especially for the dot-product operations dominated in deep neural networks (DNNs). It has been implemented by either the combination of multipliers and an adder tree or cascaded fused multiply-add units, leading to poor computational efficiency and excessive hardware overhead. To address this issue, we propose an open-source posit dot-product unit, namely PDPU, that facilitates resource-efficient and high-throughput dot-product hardware implementation. PDPU not only features the fused and mixed-precision architecture that eliminates redundant latency and hardware resources, but also has a fine-grained 6-stage pipeline, improving computational efficiency. A configurable PDPU generator is further developed to meet the diverse needs of various DNNs for computational accuracy. Experimental results evaluated under the 28nm CMOS process show that PDPU reduces area, latency, and power by up to 43%, 64%, and 70%, respectively, compared to the existing implementations. Hence, PDPU has great potential as the computing core of posit-based accelerators for deep learning applications.
    OPORP: One Permutation + One Random Projection. (arXiv:2302.03505v1 [stat.ML])
    Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show that the estimation variance is essentially: $(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a function of the data ($u,v$). This formula reveals several key properties: (1) We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is actually the asymptotic variance of the classical correlation estimator. We illustrate that by letting the $k$ in OPORP to be $k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.
    Towards Robust Inductive Graph Incremental Learning via Experience Replay. (arXiv:2302.03534v1 [cs.LG])
    Inductive node-wise graph incremental learning is a challenging task due to the dynamic nature of evolving graphs and the dependencies between nodes. In this paper, we propose a novel experience replay framework, called Structure-Evolution-Aware Experience Replay (SEA-ER), that addresses these challenges by leveraging the topological awareness of GNNs and importance reweighting technique. Our framework effectively addresses the data dependency of node prediction problems in evolving graphs, with a theoretical guarantee that supports its effectiveness. Through empirical evaluation, we demonstrate that our proposed framework outperforms the current state-of-the-art GNN experience replay methods on several benchmark datasets, as measured by metrics such as accuracy and forgetting.
    Towards Better Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach. (arXiv:2302.03357v1 [cs.LG])
    Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that impair the quality of time series representation learned through contrastive learning ($i.e.$, noisy positive pair and faulty positive pair). We show that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy Alignment). Meanwhile, when faulty positive pairs arise, the model spends considerable efforts aligning non-representative patterns (Faulty Alignment). To address this problem, we propose a Dynamic Bad Pair Mining (DBPM) algorithm, which reliably identifies and suppresses bad positive pairs in time series contrastive learning. DBPM utilizes a memory module to track the training behavior of each positive pair along training process. This allows us to identify potential bad positive pairs at each epoch based on their historical training behaviors. The identified bad pairs are then down-weighted using a transformation module. Our experimental results show that DBPM effectively mitigates the negative impacts of bad pairs, and can be easily used as a plug-in to boost performance of state-of-the-art methods. Codes will be made publicly available.
    Membership Inference Attacks against Diffusion Models. (arXiv:2302.03262v1 [cs.CR])
    Diffusion models have attracted attention in recent years as innovative generative models. In this paper, we investigate whether a diffusion model is resistant to a membership inference attack, which evaluates the privacy leakage of a machine learning model. We primarily discuss the diffusion model from the standpoints of comparison with a generative adversarial network (GAN) as conventional models and hyperparameters unique to the diffusion model, i.e., time steps, sampling steps, and sampling variances. We conduct extensive experiments with DDIM as a diffusion model and DCGAN as a GAN on the CelebA and CIFAR-10 datasets in both white-box and black-box settings and then confirm if the diffusion model is comparably resistant to a membership inference attack as GAN. Next, we demonstrate that the impact of time steps is significant and intermediate steps in a noise schedule are the most vulnerable to the attack. We also found two key insights through further analysis. First, we identify that DDIM is vulnerable to the attack for small sample sizes instead of achieving a lower FID. Second, sampling steps in hyperparameters are important for resistance to the attack, whereas the impact of sampling variances is quite limited.
    Federated Variational Inference Methods for Structured Latent Variable Models. (arXiv:2302.03314v1 [stat.ML])
    Federated learning methods, that is, methods that perform model training using data situated across different sources, whilst simultaneously not having the data leave their original source, are of increasing interest in a number of fields. However, despite this interest, the classes of models for which easily-applicable and sufficiently general approaches are available is limited, excluding many structured probabilistic models. We present a general yet elegant resolution to the aforementioned issue. The approach is based on adopting structured variational inference, an approach widely used in Bayesian machine learning, to the federated setting. Additionally, a communication-efficient variant analogous to the canonical FedAvg algorithm is explored. The effectiveness of the proposed algorithms are demonstrated, and their performance is compared on Bayesian multinomial regression, topic modelling, and mixed model examples.
    Realization of Causal Representation Learning to Adjust Confounding Bias in Latent Space. (arXiv:2211.08573v2 [cs.LG] UPDATED)
    Causal graphs are usually considered in a 2D plane, but it has rarely been noticed that within multiple relatively independent timelines, which is comparatively common in causality machine learning, the individual-level differences may lead to Causal Representation Bias (CRB). More importantly, such a blind spot has brought obstacles to interdisciplinary applications. Deep Learning (DL) methods overlooking CRBs confront the trouble of models' generalizability, while statistical analytics face difficulties in modeling individual-level features without a geometric global view. In this paper, we initially discuss the Geometric Meaning of causal graphs regarding multi-dimensional timelines; and, accordingly, analyze the scheme of CRB and explicitly define causal model generalization and individualization from a geometric perspective. We also spearhead a novel framework, Causal Representation Learning (CRL), to construct a valid learning plane (in latent space) for causal graphs, propose a particular autoencoder architecture to realize it, and experimentally prove the feasibility. Involved causal data includes Electronic Healthcare Records (EHR) to estimate medical effects and a hydrology dataset to forecast the environmentally influenced streamflow.
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v1 [stat.ML])
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. We derive time-uniform generalizations of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds and, more importantly, general techniques for constructing them. Despite being anytime-valid, our extensions remain as tight as their fixed-time counterparts. Moreover, they enable us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
    Convergence rates for momentum stochastic gradient descent with noise of machine learning type. (arXiv:2302.03550v1 [math.OC])
    We consider the momentum stochastic gradient descent scheme (MSGD) and its continuous-in-time counterpart in the context of non-convex optimization. We show almost sure exponential convergence of the objective function value for target functions that are Lipschitz continuous and satisfy the Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on the stochastic noise that are motivated by overparameterized supervised learning applications. Moreover, we optimize the convergence rate over the set of friction parameters and show that the MSGD process almost surely converges.
    Towards Skilled Population Curriculum for Multi-Agent Reinforcement Learning. (arXiv:2302.03429v1 [cs.AI])
    Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments.
    Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors. (arXiv:2302.03493v1 [cs.LG])
    Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.
    AMFPMC -- An improved method of detecting multiple types of drug-drug interactions using only known drug-drug interactions. (arXiv:2302.03355v1 [cs.LG])
    Adverse drug interactions are largely preventable causes of medical accidents, which frequently result in physician and emergency room encounters. The detection of drug interactions in a lab, prior to a drug's use in medical practice, is essential, however it is costly and time-consuming. Machine learning techniques can provide an efficient and accurate means of predicting possible drug-drug interactions and combat the growing problem of adverse drug interactions. Most existing models for predicting interactions rely on the chemical properties of drugs. While such models can be accurate, the required properties are not always available.
    A Categorical Archive of ChatGPT Failures. (arXiv:2302.03494v1 [cs.CL])
    Large language models have been demonstrated to be valuable in different fields. ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation by comprehending context and generating appropriate responses. It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries, with fluent and comprehensive answers surpassing prior public chatbots in both security and usefulness. However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study. Ten categories of failures, including reasoning, factual errors, math, coding, and bias, are presented and discussed. The risks, limitations, and societal implications of ChatGPT are also highlighted. The goal of this study is to assist researchers and developers in enhancing future language models and chatbots.
    Sparse GEMINI for Joint Discriminative Clustering and Feature Selection. (arXiv:2302.03391v1 [stat.ML])
    Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on $p(\pmb{x})$, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple $\ell_1$ penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a clustering model $p_\theta(y|\pmb{x})$. We demonstrate the performances of Sparse GEMINI on synthetic datasets as well as large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.
    Mismatched estimation of non-symmetric rank-one matrices corrupted by structured noise. (arXiv:2302.03306v1 [cs.IT])
    We study the performance of a Bayesian statistician who estimates a rank-one signal corrupted by non-symmetric rotationally invariant noise with a generic distribution of singular values. As the signal-to-noise ratio and the noise structure are unknown, a Gaussian setup is incorrectly assumed. We derive the exact analytic expression for the error of the mismatched Bayes estimator and also provide the analysis of an approximate message passing (AMP) algorithm. The first result exploits the asymptotic behavior of spherical integrals for rectangular matrices and of low-rank matrix perturbations; the second one relies on the design and analysis of an auxiliary AMP. The numerical experiments show that there is a performance gap between the AMP and Bayes estimators, which is due to the incorrect estimation of the signal norm.
    Diverse Probabilistic Trajectory Forecasting with Admissibility Constraints. (arXiv:2302.03462v1 [cs.LG])
    Predicting multiple trajectories for road users is important for automated driving systems: ego-vehicle motion planning indeed requires a clear view of the possible motions of the surrounding agents. However, the generative models used for multiple-trajectory forecasting suffer from a lack of diversity in their proposals. To avoid this form of collapse, we propose a novel method for structured prediction of diverse trajectories. To this end, we complement an underlying pretrained generative model with a diversity component, based on a determinantal point process (DPP). We balance and structure this diversity with the inclusion of knowledge-based quality constraints, independent from the underlying generative model. We combine these two novel components with a gating operation, ensuring that the predictions are both diverse and within the drivable area. We demonstrate on the nuScenes driving dataset the relevance of our compound approach, which yields significant improvements in the diversity and the quality of the generated trajectories.
    Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers. (arXiv:2302.02596v2 [cs.LG] UPDATED)
    This article does not propose any novel algorithm or new hardware for sparsity. Instead, it aims to serve the "common good" for the increasingly prosperous Sparse Neural Network (SNN) research community. We attempt to summarize some most common confusions in SNNs, that one may come across in various scenarios such as paper review/rebuttal and talks - many drawn from the authors' own bittersweet experiences! We feel that doing so is meaningful and timely, since the focus of SNN research is notably shifting from traditional pruning to more diverse and profound forms of sparsity before, during, and after training. The intricate relationships between their scopes, assumptions, and approaches lead to misunderstandings, for non-experts or even experts in SNNs. In response, we summarize ten Q\&As of SNNs from many key aspects, including dense vs. sparse, unstructured sparse vs. structured sparse, pruning vs. sparse training, dense-to-sparse training vs. sparse-to-sparse training, static sparsity vs. dynamic sparsity, before-training/during-training vs. post-training sparsity, and many more. We strive to provide proper and generically applicable answers to clarify those confusions to the best extent possible. We hope our summary provides useful general knowledge for people who want to enter and engage with this exciting community; and also provides some "mind of ease" convenience for SNN researchers to explain their work in the right contexts. At the very least (and perhaps as this article's most insignificant target functionality), if you are writing/planning to write a paper or rebuttal in the field of SNNs, we hope some of our answers could help you!
    Deep-OSG: A deep learning approach for approximating a family of operators in semigroup to model unknown autonomous systems. (arXiv:2302.03358v1 [cs.LG])
    This paper proposes a novel deep learning approach for approximating evolution operators and modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z. Chen, V. Churchill, K. Wu, and D. Xiu, J. Comput. Phys., 449:110782, 2022], which focused on learning single evolution operator with a fixed time step. This paper aims to learn a family of evolution operators with variable time steps, which constitute a semigroup for an autonomous system. The semigroup property is very crucial and links the system's evolutionary behaviors across varying time scales, but it was not considered in the previous works. We propose for the first time a framework of embedding the semigroup property into the data-driven learning process, through a novel neural network architecture and new loss functions. The framework is very feasible, can be combined with any suitable neural networks, and is applicable to learning general autonomous ODEs and PDEs. We present the rigorous error estimates and variance analysis to understand the prediction accuracy and robustness of our approach, showing the remarkable advantages of semigroup awareness in our model. Moreover, our approach allows one to arbitrarily choose the time steps for prediction and ensures that the predicted results are well self-matched and consistent. Extensive numerical experiments demonstrate that embedding the semigroup property notably reduces the data dependency of deep learning models and greatly improves the accuracy, robustness, and stability for long-time prediction.
    Towards Meaningful Anomaly Detection: The Effect of Counterfactual Explanations on the Investigation of Anomalies in Multivariate Time Series. (arXiv:2302.03302v1 [cs.LG])
    Detecting rare events is essential in various fields, e.g., in cyber security or maintenance. Often, human experts are supported by anomaly detection systems as continuously monitoring the data is an error-prone and tedious task. However, among the anomalies detected may be events that are rare, e.g., a planned shutdown of a machine, but are not the actual event of interest, e.g., breakdowns of a machine. Therefore, human experts are needed to validate whether the detected anomalies are relevant. We propose to support this anomaly investigation by providing explanations of anomaly detection. Related work only focuses on the technical implementation of explainable anomaly detection and neglects the subsequent human anomaly investigation. To address this research gap, we conduct a behavioral experiment using records of taxi rides in New York City as a testbed. Participants are asked to differentiate extreme weather events from other anomalous events such as holidays or sporting events. Our results show that providing counterfactual explanations do improve the investigation of anomalies, indicating potential for explainable anomaly detection in general.
    Learning Discretized Neural Networks under Ricci Flow. (arXiv:2302.03390v1 [cs.LG])
    In this paper, we consider Discretized Neural Networks (DNNs) consisting of low-precision weights and activations, which suffer from either infinite or zero gradients caused by the non-differentiable discrete function in the training process. In this case, most training-based DNNs use the standard Straight-Through Estimator (STE) to approximate the gradient w.r.t. discrete value. However, the standard STE will cause the gradient mismatch problem, i.e., the approximated gradient direction may deviate from the steepest descent direction. In other words, the gradient mismatch implies the approximated gradient with perturbations. To address this problem, we introduce the duality theory to regard the perturbation of the approximated gradient as the perturbation of the metric in Linearly Nearly Euclidean (LNE) manifolds. Simultaneously, under the Ricci-DeTurck flow, we prove the dynamical stability and convergence of the LNE metric with the $L^2$-norm perturbation, which can provide a theoretical solution for the gradient mismatch problem. In practice, we also present the steepest descent gradient flow for DNNs on LNE manifolds from the viewpoints of the information geometry and mirror descent. The experimental results on various datasets demonstrate that our method achieves better and more stable performance for DNNs than other representative training-based methods.
    Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning. (arXiv:2302.03439v1 [cs.MA])
    Cooperative multi-agent reinforcement learning (MARL) requires agents to explore to learn to cooperate. Existing value-based MARL algorithms commonly rely on random exploration, such as $\epsilon$-greedy, which is inefficient in discovering multi-agent cooperation. Additionally, the environment in MARL appears non-stationary to any individual agent due to the simultaneous training of other agents, leading to highly variant and thus unstable optimisation signals. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to extend any value-based MARL algorithm. EMAX trains ensembles of value functions for each agent to address the key challenges of exploration and non-stationarity: (1) The uncertainty of value estimates across the ensemble is used in a UCB policy to guide the exploration of agents to parts of the environment which require cooperation. (2) Average value estimates across the ensemble serve as target values. These targets exhibit lower variance compared to commonly applied target networks and we show that they lead to more stable gradients during the optimisation. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 54%, 55%, and 844%, respectively, averaged all 21 tasks.
    Multi-Task Recommendations with Reinforcement Learning. (arXiv:2302.03328v1 [cs.IR])
    In recent years, Multi-task Learning (MTL) has yielded immense success in Recommender System (RS) applications. However, current MTL-based recommendation models tend to disregard the session-wise patterns of user-item interactions because they are predominantly constructed based on item-wise datasets. Moreover, balancing multiple objectives has always been a challenge in this field, which is typically avoided via linear estimations in existing works. To address these issues, in this paper, we propose a Reinforcement Learning (RL) enhanced MTL framework, namely RMTL, to combine the losses of different recommendation tasks using dynamic weights. To be specific, the RMTL structure can address the two aforementioned issues by (i) constructing an MTL environment from session-wise interactions and (ii) training multi-task actor-critic network structure, which is compatible with most existing MTL-based recommendation models, and (iii) optimizing and fine-tuning the MTL loss function using the weights generated by critic networks. Experiments on two real-world public datasets demonstrate the effectiveness of RMTL with a higher AUC against state-of-the-art MTL-based recommendation models. Additionally, we evaluate and validate RMTL's compatibility and transferability across various MTL models.
    Towards a User Privacy-Aware Mobile Gaming App Installation Prediction Model. (arXiv:2302.03332v1 [cs.LG])
    Over the past decade, programmatic advertising has received a great deal of attention in the online advertising industry. A real-time bidding (RTB) system is rapidly becoming the most popular method to buy and sell online advertising impressions. Within the RTB system, demand-side platforms (DSP) aim to spend advertisers' campaign budgets efficiently while maximizing profit, seeking impressions that result in high user responses, such as clicks or installs. In the current study, we investigate the process of predicting a mobile gaming app installation from the point of view of a particular DSP, while paying attention to user privacy, and exploring the trade-off between privacy preservation and model performance. There are multiple levels of potential threats to user privacy, depending on the privacy leaks associated with the data-sharing process, such as data transformation or de-anonymization. To address these concerns, privacy-preserving techniques were proposed, such as cryptographic approaches, for training privacy-aware machine-learning models. However, the ability to train a mobile gaming app installation prediction model without using user-level data, can prevent these threats and protect the users' privacy, even though the model's ability to predict may be impaired. Additionally, current laws might force companies to declare that they are collecting data, and might even give the user the option to opt out of such data collection, which might threaten companies' business models in digital advertising, which are dependent on the collection and use of user-level data. We conclude that privacy-aware models might still preserve significant capabilities, enabling companies to make better decisions, dependent on the privacy-efficacy trade-off utility function of each case.
    Decentralized Inexact Proximal Gradient Method With Network-Independent Stepsizes for Convex Composite Optimization. (arXiv:2302.03238v1 [math.OC])
    This paper considers decentralized convex composite optimization over undirected and connected networks, where the local loss function contains both smooth and nonsmooth terms. For this problem, a novel CTA (Combine-Then-Adapt)-based decentralized algorithm is proposed under uncoordinated network-independent constant stepsizes. Particularly, the proposed algorithm only needs to approximately solve a sequence of proximal mappings, which benefits the decentralized composite optimization where the proximal mappings of the nonsmooth loss functions may not have analytic solutions. For the general convex case, we prove the O(1/k) convergence rate of the proposed algorithm, which can be improved to o(1/k) if the proximal mappings are solved exactly. Moreover, with metric subregularity, we establish the linear convergence rate. Finally, the numerical experiments demonstrate the efficiency of the algorithm.
    An Informative Path Planning Framework for Active Learning in UAV-based Semantic Mapping. (arXiv:2302.03347v1 [cs.RO])
    Unmanned aerial vehicles (UAVs) are crucial for aerial mapping and general monitoring tasks. Recent progress in deep learning enabled automated semantic segmentation of imagery to facilitate the interpretation of large-scale complex environments. Commonly used supervised deep learning for segmentation relies on large amounts of pixel-wise labelled data, which is tedious and costly to annotate. The domain-specific visual appearance of aerial environments often prevents the usage of models pre-trained on a static dataset. To address this, we propose a novel general planning framework for UAVs to autonomously acquire informative training images for model re-training. We leverage multiple acquisition functions and fuse them into probabilistic terrain maps. Our framework combines the mapped acquisition function information into the UAV's planning objectives. In this way, the UAV adaptively acquires informative aerial images to be manually labelled for model re-training. Experimental results on real-world data and in a photorealistic simulation show that our framework maximises model performance and drastically reduces labelling efforts. Our map-based planners outperform state-of-the-art local planning.
    A conceptual model for leaving the data-centric approach in machine learning. (arXiv:2302.03361v1 [cs.LG])
    For a long time, machine learning (ML) has been seen as the abstract problem of learning relationships from data independent of the surrounding settings. This has recently been challenged, and methods have been proposed to include external constraints in the machine learning models. These methods usually come from application-specific fields, such as de-biasing algorithms in the field of fairness in ML or physical constraints in the fields of physics and engineering. In this paper, we present and discuss a conceptual high-level model that unifies these approaches in a common language. We hope that this will enable and foster exchange between the different fields and their different methods for including external constraints into ML models, and thus leaving purely data-centric approaches.
    Online Reinforcement Learning with Uncertain Episode Lengths. (arXiv:2302.03608v1 [cs.LG])
    Existing episodic reinforcement algorithms assume that the length of an episode is fixed across time and known a priori. In this paper, we consider a general framework of episodic reinforcement learning when the length of each episode is drawn from a distribution. We first establish that this problem is equivalent to online reinforcement learning with general discounting where the learner is trying to optimize the expected discounted sum of rewards over an infinite horizon, but where the discounting function is not necessarily geometric. We show that minimizing regret with this new general discounting is equivalent to minimizing regret with uncertain episode lengths. We then design a reinforcement learning algorithm that minimizes regret with general discounting but acts for the setting with uncertain episode lengths. We instantiate our general bound for different types of discounting, including geometric and polynomial discounting. We also show that we can obtain similar regret bounds even when the uncertainty over the episode lengths is unknown, by estimating the unknown distribution over time. Finally, we compare our learning algorithms with existing value-iteration based episodic RL algorithms in a grid-world environment.
    A Privacy-Preserving Hybrid Federated Learning Framework for Financial Crime Detection. (arXiv:2302.03654v1 [cs.LG])
    The recent decade witnessed a surge of increase in financial crimes across the public and private sectors, with an average cost of scams of \$102m to financial institutions in 2022. Developing a mechanism for battling financial crimes is an impending task that requires in-depth collaboration from multiple institutions, and yet such collaboration imposed significant technical challenges due to the privacy and security requirements of distributed financial data. For example, consider the Society for Worldwide Interbank Financial Telecommunications (SWIFT) system, which generates 42 million transactions per day across its 11,000 global institutions. Training a detection model of fraudulent transactions requires not only secured SWIFT transactions but also the private account activities of those involved in each transaction from corresponding bank systems. The distributed nature of both samples and features prevents most existing learning systems from being directly adopted to handle the data mining task. In this paper, we collectively address these challenges by proposing a hybrid federated learning system that offers secure and privacy-aware learning and inference for financial crime detection. We conduct extensive empirical studies to evaluate the proposed framework's detection performance and privacy-protection capability, evaluating its robustness against common malicious attacks of collaborative learning. We release our source code at https://github.com/illidanlab/HyFL .
    Performative Reinforcement Learning. (arXiv:2207.00046v2 [cs.LG] UPDATED)
    We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.
    Equivariant Representation Learning via Class-Pose Decomposition. (arXiv:2207.03116v3 [cs.LG] UPDATED)
    We introduce a general method for learning representations that are equivariant to symmetries of data. Our central idea is to decompose the latent space into an invariant factor and the symmetry group itself. The components semantically correspond to intrinsic data classes and poses respectively. The learner is trained on a loss encouraging equivariance based on supervision from relative symmetry information. The approach is motivated by theoretical results from group theory and guarantees representations that are lossless, interpretable and disentangled. We provide an empirical investigation via experiments involving datasets with a variety of symmetries. Results show that our representations capture the geometry of data and outperform other equivariant representation learning frameworks.
    Heterophily-Aware Graph Attention Network. (arXiv:2302.03228v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown remarkable success in graph representation learning. Unfortunately, current weight assignment schemes in standard GNNs, such as the calculation based on node degrees or pair-wise representations, can hardly be effective in processing the networks with heterophily, in which the connected nodes usually possess different labels or features. Existing heterophilic GNNs tend to ignore the modeling of heterophily of each edge, which is also a vital part in tackling the heterophily problem. In this paper, we firstly propose a heterophily-aware attention scheme and reveal the benefits of modeling the edge heterophily, i.e., if a GNN assigns different weights to edges according to different heterophilic types, it can learn effective local attention patterns, which enable nodes to acquire appropriate information from distinct neighbors. Then, we propose a novel Heterophily-Aware Graph Attention Network (HA-GAT) by fully exploring and utilizing the local distribution as the underlying heterophily, to handle the networks with different homophily ratios. To demonstrate the effectiveness of the proposed HA-GAT, we analyze the proposed heterophily-aware attention scheme and local distribution exploration, by seeking for an interpretation from their mechanism. Extensive results demonstrate that our HA-GAT achieves state-of-the-art performances on eight datasets with different homophily ratios in both the supervised and semi-supervised node classification tasks.
    On the symmetries in the dynamics of wide two-layer neural networks. (arXiv:2211.08771v3 [cs.LG] UPDATED)
    We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $f^*$ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $f^*$ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $f^*$ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.
    Elucidating Robust Learning with Uncertainty-Aware Corruption Pattern Estimation. (arXiv:2111.01632v2 [cs.LG] UPDATED)
    Robust learning methods aim to learn a clean target distribution from noisy and corrupted training data where a specific corruption pattern is often assumed a priori. Our proposed method can not only successfully learn the clean target distribution from a dirty dataset but also can estimate the underlying noise pattern. To this end, we leverage a mixture-of-experts model that can distinguish two different types of predictive uncertainty, aleatoric and epistemic uncertainty. We show that the ability to estimate the uncertainty plays a significant role in elucidating the corruption patterns as these two objectives are tightly intertwined. We also present a novel validation scheme for evaluating the performance of the corruption pattern estimation. Our proposed method is extensively assessed in terms of both robustness and corruption pattern estimation through a number of domains, including computer vision and natural language processing.
    Gradient-based Bi-level Optimization for Deep Learning: A Survey. (arXiv:2207.11719v3 [cs.LG] UPDATED)
    Bi-level optimization, especially the gradient-based category, has been widely used in the deep learning community including hyperparameter optimization and meta knowledge extraction. Bi-level optimization embeds one problem within another and the gradient-based category solves the outer level task by computing the hypergradient, which is much more efficient than classical methods such as the evolutionary algorithm. In this survey, we first give a formal definition of the gradient-based bi-level optimization. Secondly, we illustrate how to formulate a research problem as a bi-level optimization problem, which is of great practical use for beginners. More specifically, there are two formulations: the single-task formulation to optimize hyperparameters such as regularization parameters and the distilled data, and the multi-task formulation to extract meta knowledge such as the model initialization. With a bi-level formulation, we then discuss four bi-level optimization solvers to update the outer variable including explicit gradient update, proxy update, implicit function update, and closed-form update. Last but not least, we conclude the survey by pointing out the great potential of gradient-based bi-level optimization on science problems (AI4Science).
    Automatic Sleep Stage Classification with Cross-modal Self-supervised Features from Deep Brain Signals. (arXiv:2302.03227v1 [cs.LG])
    The detection of human sleep stages is widely used in the diagnosis and intervention of neurological and psychiatric diseases. Some patients with deep brain stimulator implanted could have their neural activities recorded from the deep brain. Sleep stage classification based on deep brain recording has great potential to provide more precise treatment for patients. The accuracy and generalizability of existing sleep stage classifiers based on local field potentials are still limited. We proposed an applicable cross-modal transfer learning method for sleep stage classification with implanted devices. This end-to-end deep learning model contained cross-modal self-supervised feature representation, self-attention, and classification framework. We tested the model with deep brain recording data from 12 patients with Parkinson's disease. The best total accuracy reached 83.2% for sleep stage classification. Results showed speech self-supervised features catch the conversion pattern of sleep stages effectively. We provide a new method on transfer learning from acoustic signals to local field potentials. This method supports an effective solution for the insufficient scale of clinical data. This sleep stage classification model could be adapted to chronic and continuous monitor sleep for Parkinson's patients in daily life, and potentially utilized for more precise treatment in deep brain-machine interfaces, such as closed-loop deep brain stimulation.
    Leveraging Demonstrations to Improve Online Learning: Quality Matters. (arXiv:2302.03319v1 [cs.LG])
    We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.
    Attacking Cooperative Multi-Agent Reinforcement Learning by Adversarial Minority Influence. (arXiv:2302.03322v1 [cs.LG])
    Cooperative multi-agent reinforcement learning (c-MARL) offers a general paradigm for a group of agents to achieve a shared goal by taking individual decisions, yet is found to be vulnerable to adversarial attacks. Though harmful, adversarial attacks also play a critical role in evaluating the robustness and finding blind spots of c-MARL algorithms. However, existing attacks are not sufficiently strong and practical, which is mainly due to the ignorance of complex influence between agents and cooperative nature of victims in c-MARL. In this paper, we propose adversarial minority influence (AMI), the first practical attack against c-MARL by introducing an adversarial agent. AMI addresses the aforementioned problems by unilaterally influencing other cooperative victims to a targeted worst-case cooperation. Technically, to maximally deviate victim policy under complex agent-wise influence, our unilateral attack characterize and maximize the influence from adversary to victims. This is done by adapting a unilateral agent-wise relation metric derived from mutual information, which filters out the detrimental influence from victims to adversary. To fool victims into a jointly worst-case failure, our targeted attack influence victims to a long-term, cooperatively worst case by distracting each victim to a specific target. Such target is learned by a reinforcement learning agent in a trial-and-error process. Extensive experiments in simulation environments, including discrete control (SMAC), continuous control (MAMujoco) and real-world robot swarm control demonstrate the superiority of our AMI approach. Our codes are available in https://anonymous.4open.science/r/AMI.
    Phase Transitions in the Detection of Correlated Databases. (arXiv:2302.03380v1 [cs.LG])
    We study the problem of detecting the correlation between two Gaussian databases $\mathsf{X}\in\mathbb{R}^{n\times d}$ and $\mathsf{Y}^{n\times d}$, each composed of $n$ users with $d$ features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation $\sigma$ over the set of $n$ users (or, row permutation), such that $\mathsf{X}$ is $\rho$-correlated with $\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of $n$ and $d$. Specifically, we prove that if $\rho^2d\to0$, as $d\to\infty$, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of $n$. This compliments the performance of a simple test that thresholds the sum all entries of $\mathsf{X}^T\mathsf{Y}$. Furthermore, when $d$ is fixed, we prove that strong detection (vanishing error probability) is impossible for any $\rho<\rho^\star$, where $\rho^\star$ is an explicit function of $d$, while weak detection is again impossible as long as $\rho^2d\to0$. These results close significant gaps in current recent related studies.
    Reducing SO(3) Convolutions to SO(2) for Efficient Equivariant GNNs. (arXiv:2302.03655v1 [cs.LG])
    Graph neural networks that model 3D data, such as point clouds or atoms, are typically desired to be $SO(3)$ equivariant, i.e., equivariant to 3D rotations. Unfortunately equivariant convolutions, which are a fundamental operation for equivariant networks, increase significantly in computational complexity as higher-order tensors are used. In this paper, we address this issue by reducing the $SO(3)$ convolutions or tensor products to mathematically equivalent convolutions in $SO(2)$ . This is accomplished by aligning the node embeddings' primary axis with the edge vectors, which sparsifies the tensor product and reduces the computational complexity from $O(L^6)$ to $O(L^3)$, where $L$ is the degree of the representation. We demonstrate the potential implications of this improvement by proposing the Equivariant Spherical Channel Network (eSCN), a graph neural network utilizing our novel approach to equivariant convolutions, which achieves state-of-the-art results on the large-scale OC-20 dataset.
    Data augmentation for machine learning of chemical process flowsheets. (arXiv:2302.03379v1 [cs.LG])
    Artificial intelligence has great potential for accelerating the design and engineering of chemical processes. Recently, we have shown that transformer-based language models can learn to auto-complete chemical process flowsheets using the SFILES 2.0 string notation. Also, we showed that language translation models can be used to translate Process Flow Diagrams (PFDs) into Process and Instrumentation Diagrams (P&IDs). However, artificial intelligence methods require big data and flowsheet data is currently limited. To mitigate this challenge of limited data, we propose a new data augmentation methodology for flowsheet data that is represented in the SFILES 2.0 notation. We show that the proposed data augmentation improves the performance of artificial intelligence-based process design models. In our case study flowsheet data augmentation improved the prediction uncertainty of the flowsheet autocompletion model by 14.7%. In the future, our flowsheet data augmentation can be used for other machine learning algorithms on chemical process flowsheets that are based on SFILES notation.
    Reply to: Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. (arXiv:2302.03602v1 [cond-mat.dis-nn])
    We provide a comprehensive reply to the comment written by Chiara Angelini and Federico Ricci-Tersenghi [arXiv:2206.13211] and argue that the comment singles out one particular non-representative example problem, entirely focusing on the maximum independent set (MIS) on sparse graphs, for which greedy algorithms are expected to perform well. Conversely, we highlight the broader algorithmic development underlying our original work, and (within our original framework) provide additional numerical results showing sizable improvements over our original results, thereby refuting the comment's performance statements. We also provide results showing run-time scaling superior to the results provided by Angelini and Ricci-Tersenghi. Furthermore, we show that the proposed set of random d-regular graphs does not provide a universal set of benchmark instances, nor do greedy heuristics provide a universal algorithmic baseline. Finally, we argue that the internal (parallel) anatomy of graph neural networks is very different from the (sequential) nature of greedy algorithms and emphasize that graph neural networks have demonstrated their potential for superior scalability compared to existing heuristics such as parallel tempering. We conclude by discussing the conceptual novelty of our work and outline some potential extensions.
    A One-Size-Fits-All Solution to Conservative Bandit Problems. (arXiv:2012.07341v4 [cs.LG] UPDATED)
    In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees ($T$-independent additive regrets instead of $T$-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with $O(1/T)$ normalized additive regrets ($T$-independent in the cumulative form) and validate this result through empirical evaluation.
    Transfer learning for process design with reinforcement learning. (arXiv:2302.03375v1 [cs.LG])
    Process design is a creative task that is currently performed manually by engineers. Artificial intelligence provides new potential to facilitate process design. Specifically, reinforcement learning (RL) has shown some success in automating process design by integrating data-driven models that learn to build process flowsheets with process simulation in an iterative design process. However, one major challenge in the learning process is that the RL agent demands numerous process simulations in rigorous process simulators, thereby requiring long simulation times and expensive computational power. Therefore, typically short-cut simulation methods are employed to accelerate the learning process. Short-cut methods can, however, lead to inaccurate results. We thus propose to utilize transfer learning for process design with RL in combination with rigorous simulation methods. Transfer learning is an established approach from machine learning that stores knowledge gained while solving one problem and reuses this information on a different target domain. We integrate transfer learning in our RL framework for process design and apply it to an illustrative case study comprising equilibrium reactions, azeotropic separation, and recycles, our method can design economically feasible flowsheets with stable interaction with DWSIM. Our results show that transfer learning enables RL to economically design feasible flowsheets with DWSIM, resulting in a flowsheet with an 8% higher revenue. And the learning time can be reduced by a factor of 2.
    Machine learning benchmarks for the classification of equivalent circuit models from solid-state electrochemical impedance spectra. (arXiv:2302.03362v1 [cs.LG])
    Analysis of Electrochemical Impedance Spectroscopy (EIS) data for electrochemical systems often consists of defining an Equivalent Circuit Model (ECM) using expert knowledge and then optimizing the model parameters to deconvolute various resistance, capacitive, inductive, or diffusion responses. For small data sets, this procedure can be conducted manually; however, it is not feasible to manually define a proper ECM for extensive data sets with a wide range of EIS responses. Automatic identification of an ECM would substantially accelerate the analysis of large sets of EIS data. Here, we showcase machine learning methods developed during the BatteryDEV hackathon to classify the ECMs of 9,300 EIS measurements provided by QuantumScape. The best-performing approach is a gradient-boosted tree model utilizing a library to automatically generate features, followed by a random forest model using the raw spectral data. A convolutional neural network using boolean images of Nyquist representations is presented as an alternative, although it achieves a lower accuracy. We publish the data and open source the associated code. The approaches described in this article can serve as benchmarks for further studies. A key remaining challenge is that the labels contain uncertainty and human bias, underlined by the performance of the trained models.
    Med-NCA: Robust and Lightweight Segmentation with Neural Cellular Automata. (arXiv:2302.03473v1 [eess.IV])
    Access to the proper infrastructure is critical when performing medical image segmentation with Deep Learning. This requirement makes it difficult to run state-of-the-art segmentation models in resource-constrained scenarios like primary care facilities in rural areas and during crises. The recently emerging field of Neural Cellular Automata (NCA) has shown that locally interacting one-cell models can achieve competitive results in tasks such as image generation or segmentations in low-resolution inputs. However, they are constrained by high VRAM requirements and the difficulty of reaching convergence for high-resolution images. To counteract these limitations we propose Med-NCA, an end-to-end NCA training pipeline for high-resolution image segmentation. Our method follows a two-step process. Global knowledge is first communicated between cells across the downscaled image. Following that, patch-based segmentation is performed. Our proposed Med-NCA outperforms the classic UNet by 2% and 3% Dice for hippocampus and prostate segmentation, respectively, while also being 500 times smaller. We also show that Med-NCA is by design invariant with respect to image scale, shape and translation, experiencing only slight performance degradation even with strong shifts; and is robust against MRI acquisition artefacts. Med-NCA enables high-resolution medical image segmentation even on a Raspberry Pi B+, arguably the smallest device able to run PyTorch and that can be powered by a standard power bank.
    Representation Theory for Geometric Quantum Machine Learning. (arXiv:2210.07980v2 [quant-ph] UPDATED)
    Recent advances in classical machine learning have shown that creating models with inductive biases encoding the symmetries of a problem can greatly improve performance. Importation of these ideas, combined with an existing rich body of work at the nexus of quantum theory and symmetry, has given rise to the field of Geometric Quantum Machine Learning (GQML). Following the success of its classical counterpart, it is reasonable to expect that GQML will play a crucial role in developing problem-specific and quantum-aware models capable of achieving a computational advantage. Despite the simplicity of the main idea of GQML -- create architectures respecting the symmetries of the data -- its practical implementation requires a significant amount of knowledge of group representation theory. We present an introduction to representation theory tools from the optics of quantum learning, driven by key examples involving discrete and continuous groups. These examples are sewn together by an exposition outlining the formal capture of GQML symmetries via "label invariance under the action of a group representation", a brief (but rigorous) tour through finite and compact Lie group representation theory, a reexamination of ubiquitous tools like Haar integration and twirling, and an overview of some successful strategies for detecting symmetries.  ( 2 min )
    Undersampling and Cumulative Class Re-decision Methods to Improve Detection of Agitation in People with Dementia. (arXiv:2302.03224v1 [cs.LG])
    Agitation is one of the most prevalent symptoms in people with dementia (PwD) that can place themselves and the caregiver's safety at risk. Developing objective agitation detection approaches is important to support health and safety of PwD living in a residential setting. In a previous study, we collected multimodal wearable sensor data from 17 participants for 600 days and developed machine learning models for predicting agitation in one-minute windows. However, there are significant limitations in the dataset, such as imbalance problem and potential imprecise labels as the occurrence of agitation is much rarer in comparison to the normal behaviours. In this paper, we first implement different undersampling methods to eliminate the imbalance problem, and come to the conclusion that only 20% of normal behaviour data are adequate to train a competitive agitation detection model. Then, we design a weighted undersampling method to evaluate the manual labeling mechanism given the ambiguous time interval (ATI) assumption. After that, the postprocessing method of cumulative class re-decision (CCR) is proposed based on the historical sequential information and continuity characteristic of agitation, improving the decision-making performance for the potential application of agitation detection system. The results show that a combination of undersampling and CCR improves best F1-score by 26.6% and other metrics to varying degrees with less training time and data used, and inspires a way to find the potential range of optimal threshold reference for clinical purpose.  ( 2 min )
    Autodecompose: A generative self-supervised model for semantic decomposition. (arXiv:2302.03124v1 [cs.LG])
    We introduce Autodecompose, a novel self-supervised generative model that decomposes data into two semantically independent properties: the desired property, which captures a specific aspect of the data (e.g. the voice in an audio signal), and the context property, which aggregates all other information (e.g. the content of the audio signal), without any labels given. Autodecompose uses two complementary augmentations, one that manipulates the context while preserving the desired property and the other that manipulates the desired property while preserving the context. The augmented variants of the data are encoded by two encoders and reconstructed by a decoder. We prove that one of the encoders embeds the desired property while the other embeds the context property. We apply Autodecompose to audio signals to encode sound source (human voice) and content. We pre-trained the model on YouTube and LibriSpeech datasets and fine-tuned in a self-supervised manner without exposing the labels. Our results showed that, using the sound source encoder of pre-trained Autodecompose, a linear classifier achieves F1 score of 97.6\% in recognizing the voice of 30 speakers using only 10 seconds of labeled samples, compared to 95.7\% for supervised models. Additionally, our experiments showed that Autodecompose is robust against overfitting even when a large model is pre-trained on a small dataset. A large Autodecompose model was pre-trained from scratch on 60 seconds of audio from 3 speakers achieved over 98.5\% F1 score in recognizing those three speakers in other unseen utterances. We finally show that the context encoder embeds information about the content of the speech and ignores the sound source information. Our sample code for training the model, as well as examples for using the pre-trained models are available here: \url{https://github.com/rezabonyadi/autodecompose}  ( 2 min )
    LUT-NN: Towards Unified Neural Network Inference by Table Lookup. (arXiv:2302.03213v1 [cs.LG])
    DNN inference requires huge effort of system development and resource cost. This drives us to propose LUT-NN, the first trial towards empowering deep neural network (DNN) inference by table lookup, to eliminate the diverse computation kernels as well as save running cost. Based on the feature similarity of each layer, LUT-NN can learn the typical features, named centroids, of each layer from the training data, precompute them with model weights, and save the results in tables. For future input, the results of the closest centroids with the input features can be directly read from the table, as the approximation of layer output. We propose the novel centroid learning technique for DNN, which enables centroid learning through backpropagation, and adapts three levels of approximation to minimize the model loss. By this technique, LUT-NN achieves comparable accuracy (<5% difference) with original models on real complex dataset, including CIFAR, ImageNet, and GLUE. LUT-NN simplifies the computing operators to only two: closest centroid search and table lookup. We implement them for Intel and ARM CPUs. The model size is reduced by up to 3.5x for CNN models and 7x for BERT. Latency-wise, the real speedup of LUT-NN is up to 7x for BERT and 2x for ResNet, much lower than theoretical results because of the current unfriendly hardware design for table lookup. We expect firstclass table lookup support in the future to unleash the potential of LUT-NN.  ( 2 min )
    Delving Deep into Simplicity Bias for Long-Tailed Image Recognition. (arXiv:2302.03264v1 [cs.CV])
    Simplicity Bias (SB) is a phenomenon that deep neural networks tend to rely favorably on simpler predictive patterns but ignore some complex features when applied to supervised discriminative tasks. In this work, we investigate SB in long-tailed image recognition and find the tail classes suffer more severely from SB, which harms the generalization performance of such underrepresented classes. We empirically report that self-supervised learning (SSL) can mitigate SB and perform in complementary to the supervised counterpart by enriching the features extracted from tail samples and consequently taking better advantage of such rare samples. However, standard SSL methods are designed without explicitly considering the inherent data distribution in terms of classes and may not be optimal for long-tailed distributed data. To address this limitation, we propose a novel SSL method tailored to imbalanced data. It leverages SSL by triple diverse levels, i.e., holistic-, partial-, and augmented-level, to enhance the learning of predictive complex patterns, which provides the potential to overcome the severe SB on tail data. Both quantitative and qualitative experimental results on five long-tailed benchmark datasets show our method can effectively mitigate SB and significantly outperform the competing state-of-the-arts.  ( 2 min )
    Self-learning Machines based on Hamiltonian Echo Backpropagation. (arXiv:2103.04992v2 [cs.LG] UPDATED)
    A physical self-learning machine can be defined as a nonlinear dynamical system that can be trained on data (similar to artificial neural networks), but where the update of the internal degrees of freedom that serve as learnable parameters happens autonomously. In this way, neither external processing and feedback nor knowledge of (and control of) these internal degrees of freedom is required. We introduce a general scheme for self-learning in any time-reversible Hamiltonian system. We illustrate the training of such a self-learning machine numerically for the case of coupled nonlinear wave fields.  ( 2 min )
    Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR. (arXiv:2302.03201v1 [cs.LG])
    In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $\tau$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $\Omega(\sqrt{\tau^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $\Omega(\sqrt{\tau^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{\tau^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(\tau^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $\tau$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.  ( 2 min )
    Scalable Gaussian process regression enables accurate prediction of protein and small molecule properties with uncertainty quantitation. (arXiv:2302.03294v1 [cs.LG])
    Gaussian process (GP) is a Bayesian model which provides several advantages for regression tasks in machine learning such as reliable quantitation of uncertainty and improved interpretability. Their adoption has been precluded by their excessive computational cost and by the difficulty in adapting them for analyzing sequences (e.g. amino acid and nucleotide sequences) and graphs (e.g. ones representing small molecules). In this study, we develop efficient and scalable approaches for fitting GP models as well as fast convolution kernels which scale linearly with graph or sequence size. We implement these improvements by building an open-source Python library called xGPR. We compare the performance of xGPR with the reported performance of various deep learning models on 20 benchmarks, including small molecule, protein sequence and tabular data. We show that xGRP achieves highly competitive performance with much shorter training time. Furthermore, we also develop new kernels for sequence and graph data and show that xGPR generally outperforms convolutional neural networks on predicting key properties of proteins and small molecules. Importantly, xGPR provides uncertainty information not available from typical deep learning models. Additionally, xGPR provides a representation of the input data that can be used for clustering and data visualization. These results demonstrate that xGPR provides a powerful and generic tool that can be broadly useful in protein engineering and drug discovery.  ( 2 min )
    Data Selection for Language Models via Importance Resampling. (arXiv:2302.03169v1 [cs.CL])
    Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.  ( 2 min )
    On the Ideal Number of Groups for Isometric Gradient Propagation. (arXiv:2302.03193v1 [cs.LG])
    Recently, various normalization layers have been proposed to stabilize the training of deep neural networks. Among them, group normalization is a generalization of layer normalization and instance normalization by allowing a degree of freedom in the number of groups it uses. However, to determine the optimal number of groups, trial-and-error-based hyperparameter tuning is required, and such experiments are time-consuming. In this study, we discuss a reasonable method for setting the number of groups. First, we find that the number of groups influences the gradient behavior of the group normalization layer. Based on this observation, we derive the ideal number of groups, which calibrates the gradient scale to facilitate gradient descent optimization. Our proposed number of groups is theoretically grounded, architecture-aware, and can provide a proper value in a layer-wise manner for all layers. The proposed method exhibited improved performance over existing methods in numerous neural network architectures, tasks, and datasets.  ( 2 min )
    SDYN-GANs: Adversarial Learning Methods for Multistep Generative Models for General Order Stochastic Dynamics. (arXiv:2302.03663v1 [cs.LG])
    We introduce adversarial learning methods for data-driven generative modeling of the dynamics of $n^{th}$-order stochastic systems. Our approach builds on Generative Adversarial Networks (GANs) with generative model classes based on stable $m$-step stochastic numerical integrators. We introduce different formulations and training methods for learning models of stochastic dynamics based on observation of trajectory samples. We develop approaches using discriminators based on Maximum Mean Discrepancy (MMD), training protocols using conditional and marginal distributions, and methods for learning dynamic responses over different time-scales. We show how our approaches can be used for modeling physical systems to learn force-laws, damping coefficients, and noise-related parameters. The adversarial learning approaches provide methods for obtaining stable generative models for dynamic tasks including long-time prediction and developing simulations for stochastic systems.  ( 2 min )
    How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?. (arXiv:2302.03679v1 [cs.LG])
    Many important computer vision applications are naturally formulated as regression problems. Within medical imaging, accurate regression models have the potential to automate various tasks, helping to lower costs and improve patient outcomes. Such safety-critical deployment does however require reliable estimation of model uncertainty, also under the wide variety of distribution shifts that might be encountered in practice. Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets. This uncovers important limitations of current uncertainty estimation methods, and the proposed benchmark therefore serves as a challenge to the research community. We hope that our benchmark will spur more work on how to develop truly reliable regression uncertainty estimation methods. Code is available at https://github.com/fregu856/regression_uncertainty.  ( 2 min )
    Making Intelligence: Ethical Values in IQ and ML Benchmarks. (arXiv:2209.00692v3 [cs.LG] UPDATED)
    In recent years, ML researchers have wrestled with defining and improving machine learning (ML) benchmarks and datasets. In parallel, some have trained a critical lens on the ethics of dataset creation and ML research. In this position paper, we highlight the entanglement of ethics with seemingly ``technical'' or ``scientific'' decisions about the design of ML benchmarks. Our starting point is the existence of multiple overlooked structural similarities between human intelligence benchmarks and ML benchmarks. Both types of benchmarks set standards for describing, evaluating, and comparing performance on tasks relevant to intelligence -- standards that many scholars of human intelligence have long recognized as value-laden. We use perspectives from feminist philosophy of science on IQ benchmarks and thick concepts in social science to argue that values need to be considered and documented when creating ML benchmarks. It is neither possible nor desirable to avoid this choice by creating value-neutral benchmarks. Finally, we outline practical recommendations for ML benchmark research ethics and ethics review.  ( 2 min )
    Mind the Gap! Bridging Explainable Artificial Intelligence and Human Understanding with Luhmann's Functional Theory of Communication. (arXiv:2302.03460v1 [cs.CY])
    Over the past decade explainable artificial intelligence has evolved from a predominantly technical discipline into a field that is deeply intertwined with social sciences. Insights such as human preference for contrastive -- more precisely, counterfactual -- explanations have played a major role in this transition, inspiring and guiding the research in computer science. Other observations, while equally important, have received much less attention. The desire of human explainees to communicate with artificial intelligence explainers through a dialogue-like interaction has been mostly neglected by the community. This poses many challenges for the effectiveness and widespread adoption of such technologies as delivering a single explanation optimised according to some predefined objectives may fail to engender understanding in its recipients and satisfy their unique needs given the diversity of human knowledge and intention. Using insights elaborated by Niklas Luhmann and, more recently, Elena Esposito we apply social systems theory to highlight challenges in explainable artificial intelligence and offer a path forward, striving to reinvigorate the technical research in this direction. This paper aims to demonstrate the potential of systems theoretical approaches to communication in understanding problems and limitations of explainable artificial intelligence.  ( 2 min )
    Can gamification reduce the burden of self-reporting in mHealth applications? Feasibility study using machine learning from smartwatch data to estimate cognitive load. (arXiv:2302.03616v1 [cs.LG])
    The effectiveness of digital treatments can be measured by requiring patients to self-report their mental and physical state through mobile applications. However, self-reporting can be overwhelming and may cause patients to disengage from the intervention. In order to address this issue, we conduct a feasibility study to explore the impact of gamification on the cognitive burden of self-reporting. Our approach involves the creation of a system to assess cognitive burden through the analysis of photoplethysmography (PPG) signals obtained from a smartwatch. The system is built by collecting PPG data during both cognitively demanding tasks and periods of rest. The obtained data is utilized to train a machine learning model to detect cognitive load (CL). Subsequently, we create two versions of health surveys: a gamified version and a traditional version. Our aim is to estimate the cognitive load experienced by participants while completing these surveys using their mobile devices. We find that CL detector performance can be enhanced via pre-training on stress detection tasks and requires capturing of a minimum 30 seconds of PPG signal to work adequately. For 10 out of 13 participants, a personalized cognitive load detector can achieve an F1 score above 0.7. We find no difference between the gamified and non-gamified mobile surveys in terms of time spent in the state of high cognitive load but participants prefer the gamified version. The average time spent on each question is 5.5 for gamified survey vs 6 seconds for the non-gamified version.  ( 2 min )
    On the relationship between multivariate splines and infinitely-wide neural networks. (arXiv:2302.03459v1 [cs.LG])
    We consider multivariate splines and show that they have a random feature expansion as infinitely wide neural networks with one-hidden layer and a homogeneous activation function which is the power of the rectified linear unit. We show that the associated function space is a Sobolev space on a Euclidean ball, with an explicit bound on the norms of derivatives. This link provides a new random feature expansion for multivariate splines that allow efficient algorithms. This random feature expansion is numerically better behaved than usual random Fourier features, both in theory and practice. In particular, in dimension one, we compare the associated leverage scores to compare the two random expansions and show a better scaling for the neural network expansion.  ( 2 min )
    Temporal Robustness against Data Poisoning. (arXiv:2302.03684v1 [cs.LG])
    Data poisoning considers cases when an adversary maliciously inserts and removes training data to manipulate the behavior of machine learning algorithms. Traditional threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, existing defenses are essentially vulnerable in practice when poisoning more samples remains a feasible option for attackers. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning and derive two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. With these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal modeling of data poisoning.  ( 2 min )
    Averaged Method of Multipliers for Bi-Level Optimization without Lower-Level Strong Convexity. (arXiv:2302.03407v1 [math.OC])
    Gradient methods have become mainstream techniques for Bi-Level Optimization (BLO) in learning fields. The validity of existing works heavily rely on either a restrictive Lower- Level Strong Convexity (LLSC) condition or on solving a series of approximation subproblems with high accuracy or both. In this work, by averaging the upper and lower level objectives, we propose a single loop Bi-level Averaged Method of Multipliers (sl-BAMM) for BLO that is simple yet efficient for large-scale BLO and gets rid of the limited LLSC restriction. We further provide non-asymptotic convergence analysis of sl-BAMM towards KKT stationary points, and the comparative advantage of our analysis lies in the absence of strong gradient boundedness assumption, which is always required by others. Thus our theory safely captures a wider variety of applications in deep learning, especially where the upper-level objective is quadratic w.r.t. the lower-level variable. Experimental results demonstrate the superiority of our method.  ( 2 min )
    Natural Language Processing for Policymaking. (arXiv:2302.03490v1 [cs.CL])
    Language is the medium for many political activities, from campaigns to news reports. Natural language processing (NLP) uses computational tools to parse text into key information that is needed for policymaking. In this chapter, we introduce common methods of NLP, including text classification, topic modeling, event extraction, and text scaling. We then overview how these methods can be used for policymaking through four major applications including data collection for evidence-based policymaking, interpretation of political decisions, policy communication, and investigation of policy effects. Finally, we highlight some potential limitations and ethical concerns when using NLP for policymaking. This text is from Chapter 7 (pages 141-162) of the Handbook of Computational Social Science for Policy (2023). Open Access on Springer: https://doi.org/10.1007/978-3-031-16624-2
    Efficient XAI Techniques: A Taxonomic Survey. (arXiv:2302.03225v1 [cs.LG])
    Recently, there has been a growing demand for the deployment of Explainable Artificial Intelligence (XAI) algorithms in real-world applications. However, traditional XAI methods typically suffer from a high computational complexity problem, which discourages the deployment of real-time systems to meet the time-demanding requirements of real-world scenarios. Although many approaches have been proposed to improve the efficiency of XAI methods, a comprehensive understanding of the achievements and challenges is still needed. To this end, in this paper we provide a review of efficient XAI. Specifically, we categorize existing techniques of XAI acceleration into efficient non-amortized and efficient amortized methods. The efficient non-amortized methods focus on data-centric or model-centric acceleration upon each individual instance. In contrast, amortized methods focus on learning a unified distribution of model explanations, following the predictive, generative, or reinforcement frameworks, to rapidly derive multiple model explanations. We also analyze the limitations of an efficient XAI pipeline from the perspectives of the training phase, the deployment phase, and the use scenarios. Finally, we summarize the challenges of deploying XAI acceleration methods to real-world scenarios, overcoming the trade-off between faithfulness and efficiency, and the selection of different acceleration methods.
    An End-to-End Two-Phase Deep Learning-Based workflow to Segment Man-made Objects Around Reservoirs. (arXiv:2302.03282v1 [cs.CV])
    Reservoirs are fundamental infrastructures for the management of water resources. Constructions around them can negatively impact their quality. Such unauthorized constructions can be monitored by land cover mapping (LCM) remote sensing (RS) images. In this paper, we develop a new approach based on DL and image processing techniques for man-made object segmentation around the reservoirs. In order to segment man-made objects around the reservoirs in an end-to-end procedure, segmenting reservoirs and identifying the region of interest (RoI) around them are essential. In the proposed two-phase workflow, the reservoir is initially segmented using a DL model. A post-processing stage is proposed to remove errors such as floating vegetation. Next, the RoI around the reservoir (RoIaR) is identified using the proposed image processing techniques. Finally, the man-made objects in the RoIaR are segmented using a DL architecture. We trained the proposed workflow using collected Google Earth (GE) images of eight reservoirs in Brazil over two different years. The U-Net-based and SegNet-based architectures are trained to segment the reservoirs. To segment man-made objects in the RoIaR, we trained and evaluated four possible architectures, U-Net, FPN, LinkNet, and PSPNet. Although the collected data has a high diversity (for example, they belong to different states, seasons, resolutions, etc.), we achieved good performances in both phases. Furthermore, applying the proposed post-processing to the output of reservoir segmentation improves the precision in all studied reservoirs except two cases. We validated the prepared workflow with a reservoir dataset outside the training reservoirs. The results show high generalization ability of the prepared workflow.  ( 2 min )
    Unsupervised Deep Learning for IoT Time Series. (arXiv:2302.03284v1 [cs.LG])
    IoT time series analysis has found numerous applications in a wide variety of areas, ranging from health informatics to network security. Nevertheless, the complex spatial temporal dynamics and high dimensionality of IoT time series make the analysis increasingly challenging. In recent years, the powerful feature extraction and representation learning capabilities of deep learning (DL) have provided an effective means for IoT time series analysis. However, few existing surveys on time series have systematically discussed unsupervised DL-based methods. To fill this void, we investigate unsupervised deep learning for IoT time series, i.e., unsupervised anomaly detection and clustering, under a unified framework. We also discuss the application scenarios, public datasets, existing challenges, and future research directions in this area.  ( 2 min )
    IB-UQ: Information bottleneck based uncertainty quantification for neural function regression and neural operator learning. (arXiv:2302.03271v1 [math.NA])
    In this paper, a novel framework is established for uncertainty quantification via information bottleneck (IB-UQ) for scientific machine learning tasks, including deep neural network (DNN) regression and neural operator learning (DeepONet). Specifically, we first employ the General Incompressible-Flow Networks (GIN) model to learn a "wide" distribution fromnoisy observation data. Then, following the information bottleneck objective, we learn a stochastic map from input to some latent representation that can be used to predict the output. A tractable variational bound on the IB objective is constructed with a normalizing flow reparameterization. Hence, we can optimize the objective using the stochastic gradient descent method. IB-UQ can provide both mean and variance in the label prediction by explicitly modeling the representation variables. Compared to most DNN regression methods and the deterministic DeepONet, the proposed model can be trained on noisy data and provide accurate predictions with reliable uncertainty estimates on unseen noisy data. We demonstrate the capability of the proposed IB-UQ framework via several representative examples, including discontinuous function regression, real-world dataset regression and learning nonlinear operators for diffusion-reaction partial differential equation.  ( 2 min )
    Utility-based Perturbed Gradient Descent: An Optimizer for Continual Learning. (arXiv:2302.03281v1 [cs.LG])
    Modern representation learning methods may fail to adapt quickly under non-stationarity since they suffer from the problem of catastrophic forgetting and decaying plasticity. Such problems prevent learners from fast adaptation to changes since they result in increasing numbers of saturated features and forgetting useful features when presented with new experiences. Hence, these methods are rendered ineffective for continual learning. This paper proposes Utility-based Perturbed Gradient Descent (UPGD), an online representation-learning algorithm well-suited for continual learning agents with no knowledge about task boundaries. UPGD protects useful weights or features from forgetting and perturbs less useful ones based on their utilities. Our empirical results show that UPGD alleviates catastrophic forgetting and decaying plasticity, enabling modern representation learning methods to work in the continual learning setting.  ( 2 min )
    Learning to Count Isomorphisms with Graph Neural Networks. (arXiv:2302.03266v1 [cs.LG])
    Subgraph isomorphism counting is an important problem on graphs, as many graph-based tasks exploit recurring subgraph patterns. Classical methods usually boil down to a backtracking framework that needs to navigate a huge search space with prohibitive computational costs. Some recent studies resort to graph neural networks (GNNs) to learn a low-dimensional representation for both the query and input graphs, in order to predict the number of subgraph isomorphisms on the input graph. However, typical GNNs employ a node-centric message passing scheme that receives and aggregates messages on nodes, which is inadequate in complex structure matching for isomorphism counting. Moreover, on an input graph, the space of possible query graphs is enormous, and different parts of the input graph will be triggered to match different queries. Thus, expecting a fixed representation of the input graph to match diversely structured query graphs is unrealistic. In this paper, we propose a novel GNN called Count-GNN for subgraph isomorphism counting, to deal with the above challenges. At the edge level, given that an edge is an atomic unit of encoding graph structures, we propose an edge-centric message passing scheme, where messages on edges are propagated and aggregated based on the edge adjacency to preserve fine-grained structural information. At the graph level, we modulate the input graph representation conditioned on the query, so that the input graph can be adapted to each query individually to improve their matching. Finally, we conduct extensive experiments on a number of benchmark datasets to demonstrate the superior performance of Count-GNN.  ( 2 min )
    Climate Intervention Analysis using AI Model Guided by Statistical Physics Principles. (arXiv:2302.03258v1 [cs.LG])
    The availability of training data remains a significant obstacle for the implementation of machine learning in scientific applications. In particular, estimating how a system might respond to external forcings or perturbations requires specialized labeled data or targeted simulations, which may be computationally intensive to generate at scale. In this study, we propose a novel solution to this challenge by utilizing a principle from statistical physics known as the Fluctuation-Dissipation Theorem (FDT) to discover knowledge using an AI model that can rapidly produce scenarios for different external forcings. By leveraging FDT, we are able to extract information encoded in a large dataset produced by Earth System Models, which includes 8250 years of internal climate fluctuations, to estimate the climate system's response to forcings. Our model, AiBEDO, is capable of capturing the complex, multi-timescale effects of radiation perturbations on global and regional surface climate, allowing for a substantial acceleration of the exploration of the impacts of spatially-heterogenous climate forcers. To demonstrate the utility of AiBEDO, we use the example of a climate intervention technique called Marine Cloud Brightening, with the ultimate goal of optimizing the spatial pattern of cloud brightening to achieve regional climate targets and prevent known climate tipping points. While we showcase the effectiveness of our approach in the context of climate science, it is generally applicable to other scientific disciplines that are limited by the extensive computational demands of domain simulation models. Source code of AiBEDO framework is made available at https://github.com/kramea/kdd_aibedo. A sample dataset is made available at https://doi.org/10.5281/zenodo.7597027. Additional data available upon request.  ( 2 min )
    ClueGAIN: Application of Transfer Learning On Generative Adversarial Imputation Nets (GAIN). (arXiv:2302.03140v1 [cs.LG])
    Many studies have attempted to solve the problem of missing data using various approaches. Among them, Generative Adversarial Imputation Nets (GAIN) was first used to impute data with Generative Adversarial Nets (GAN) and good results were obtained. Subsequent studies have attempted to combine various approaches to address some of its limitations. ClueGAIN is first proposed in this study, which introduces transfer learning into GAIN to solve the problem of poor imputation performance in high missing rate data sets. ClueGAIN can also be used to measure the similarity between data sets to explore their potential connections.  ( 2 min )
    Spatial Functa: Scaling Functa to ImageNet Classification and Generation. (arXiv:2302.03130v1 [cs.LG])
    Neural fields, also known as implicit neural representations, have emerged as a powerful means to represent complex signals of various modalities. Based on this Dupont et al. (2022) introduce a framework that views neural fields as data, termed *functa*, and proposes to do deep learning directly on this dataset of neural fields. In this work, we show that the proposed framework faces limitations when scaling up to even moderately complex datasets such as CIFAR-10. We then propose *spatial functa*, which overcome these limitations by using spatially arranged latent representations of neural fields, thereby allowing us to scale up the approach to ImageNet-1k at 256x256 resolution. We demonstrate competitive performance to Vision Transformers (Steiner et al., 2022) on classification and Latent Diffusion (Rombach et al., 2022) on image generation respectively.  ( 2 min )
    CDANs: Temporal Causal Discovery from Autocorrelated and Non-Stationary Time Series Data. (arXiv:2302.03246v1 [cs.LG])
    This study presents a novel constraint-based causal discovery approach for autocorrelated and non-stationary time series data (CDANs). Our proposed method addresses several limitations of existing causal discovery methods for autocorrelated and non-stationary time series data, such as high dimensionality, the inability to identify lagged causal relationships, and the overlook of changing modules. Our approach identifies both lagged and instantaneous/contemporaneous causal relationships along with changing modules that vary over time. The method optimizes the conditioning sets in a constraint-based search by considering lagged parents instead of conditioning on the entire past that addresses high dimensionality. The changing modules are detected by considering both contemporaneous and lagged parents. The approach first detects the lagged adjacencies, then identifies the changing modules and contemporaneous adjacencies, and finally determines the causal direction. We extensively evaluated the proposed method using synthetic datasets and a real-world clinical dataset and compared its performance with several baseline approaches. The results demonstrate the effectiveness of the proposed method in detecting causal relationships and changing modules in autocorrelated and non-stationary time series data.  ( 2 min )
    Domain Adaptation for Time Series Under Feature and Label Shifts. (arXiv:2302.03133v1 [cs.LG])
    The transfer of models trained on labeled datasets in a source domain to unlabeled target domains is made possible by unsupervised domain adaptation (UDA). However, when dealing with complex time series models, the transferability becomes challenging due to the dynamic temporal structure that varies between domains, resulting in feature shifts and gaps in the time and frequency representations. Furthermore, tasks in the source and target domains can have vastly different label distributions, making it difficult for UDA to mitigate label shifts and recognize labels that only exist in the target domain. We present RAINCOAT, the first model for both closed-set and universal DA on complex time series. RAINCOAT addresses feature and label shifts by considering both temporal and frequency features, aligning them across domains, and correcting for misalignments to facilitate the detection of private labels. Additionally,RAINCOAT improves transferability by identifying label shifts in target domains. Our experiments with 5 datasets and 13 state-of-the-art UDA methods demonstrate that RAINCOAT can achieve an improvement in performance of up to 16.33%, and can effectively handle both closed-set and universal adaptation.  ( 2 min )
    Quantum Recurrent Neural Networks for Sequential Learning. (arXiv:2302.03244v1 [quant-ph])
    Quantum neural network (QNN) is one of the promising directions where the near-term noisy intermediate-scale quantum (NISQ) devices could find advantageous applications against classical resources. Recurrent neural networks are the most fundamental networks for sequential learning, but up to now there is still a lack of canonical model of quantum recurrent neural network (QRNN), which certainly restricts the research in the field of quantum deep learning. In the present work, we propose a new kind of QRNN which would be a good candidate as the canonical QRNN model, where, the quantum recurrent blocks (QRBs) are constructed in the hardware-efficient way, and the QRNN is built by stacking the QRBs in a staggered way that can greatly reduce the algorithm's requirement with regard to the coherent time of quantum devices. That is, our QRNN is much more accessible on NISQ devices. Furthermore, the performance of the present QRNN model is verified concretely using three different kinds of classical sequential data, i.e., meteorological indicators, stock price, and text categorization. The numerical experiments show that our QRNN achieves much better performance in prediction (classification) accuracy against the classical RNN and state-of-the-art QNN models for sequential learning, and can predict the changing details of temporal sequence data. The practical circuit structure and superior performance indicate that the present QRNN is a promising learning model to find quantum advantageous applications in the near term.  ( 2 min )
    DivBO: Diversity-aware CASH for Ensemble Learning. (arXiv:2302.03255v1 [cs.LG])
    The Combined Algorithm Selection and Hyperparameters optimization (CASH) problem is one of the fundamental problems in Automated Machine Learning (AutoML). Motivated by the success of ensemble learning, recent AutoML systems build post-hoc ensembles to output the final predictions instead of using the best single learner. However, while most CASH methods focus on searching for a single learner with the best performance, they neglect the diversity among base learners (i.e., they may suggest similar configurations to previously evaluated ones), which is also a crucial consideration when building an ensemble. To tackle this issue and further enhance the ensemble performance, we propose DivBO, a diversity-aware framework to inject explicit search of diversity into the CASH problems. In the framework, we propose to use a diversity surrogate to predict the pair-wise diversity of two unseen configurations. Furthermore, we introduce a temporary pool and a weighted acquisition function to guide the search of both performance and diversity based on Bayesian optimization. Empirical results on 15 public datasets show that DivBO achieves the best average ranks (1.82 and 1.73) on both validation and test errors among 10 compared methods, including post-hoc designs in recent AutoML systems and state-of-the-art baselines for ensemble learning on CASH problems.  ( 2 min )
    Exact Inference in High-order Structured Prediction. (arXiv:2302.03236v1 [cs.LG])
    In this paper, we study the problem of inference in high-order structured prediction tasks. In the context of Markov random fields, the goal of a high-order inference task is to maximize a score function on the space of labels, and the score function can be decomposed into sum of unary and high-order potentials. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also provide a new class of hypergraph structural properties related to hyperedge expansion that drives the success in general high-order inference problems. Finally, we connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.  ( 2 min )
    Continual Learning of Language Models. (arXiv:2302.03241v1 [cs.CL])
    Language models (LMs) have been instrumental for the rapid advance of natural language processing. This paper studies continual learning of LMs, in particular, continual domain-adaptive pre-training (or continual DAP-training). Existing research has shown that further pre-training an LM using a domain corpus to adapt the LM to the domain can improve the end-task performance in the domain. This paper proposes a novel method to continually DAP-train an LM with a sequence of unlabeled domain corpora to adapt the LM to these domains to improve their end-task performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in the pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances. Empirical evaluation demonstrates the effectiveness of the proposed method.  ( 2 min )
    Easy Learning from Label Proportions. (arXiv:2302.03115v1 [cs.LG])
    We consider the problem of Learning from Label Proportions (LLP), a weakly supervised classification setup where instances are grouped into "bags", and only the frequency of class labels at each bag is available. Albeit, the objective of the learner is to achieve low task loss at an individual instance level. Here we propose Easyllp: a flexible and simple-to-implement debiasing approach based on aggregate labels, which operates on arbitrary loss functions. Our technique allows us to accurately estimate the expected loss of an arbitrary model at an individual level. We showcase the flexibility of our approach by applying it to popular learning frameworks, like Empirical Risk Minimization (ERM) and Stochastic Gradient Descent (SGD) with provable guarantees on instance level performance. More concretely, we exhibit a variance reduction technique that makes the quality of LLP learning deteriorate only by a factor of k (k being bag size) in both ERM and SGD setups, as compared to full supervision. Finally, we validate our theoretical results on multiple datasets demonstrating our algorithm performs as well or better than previous LLP approaches in spite of its simplicity.  ( 2 min )
    Linear optimal partial transport embedding. (arXiv:2302.03232v1 [cs.LG])
    Optimal transport (OT) has gained popularity due to its various applications in fields such as machine learning, statistics, and signal processing. However, the balanced mass requirement limits its performance in practical problems. To address these limitations, variants of the OT problem, including unbalanced OT, Optimal partial transport (OPT), and Hellinger Kantorovich (HK), have been proposed. In this paper, we propose the Linear optimal partial transport (LOPT) embedding, which extends the (local) linearization technique on OT and HK to the OPT problem. The proposed embedding allows for faster computation of OPT distance between pairs of positive measures. Besides our theoretical contributions, we demonstrate the LOPT embedding technique in point-cloud interpolation and PCA analysis.  ( 2 min )
    Genetic Programming Based Symbolic Regression for Analytical Solutions to Differential Equations. (arXiv:2302.03175v1 [cs.LG])
    In this paper, we present a machine learning method for the discovery of analytic solutions to differential equations. The method utilizes an inherently interpretable algorithm, genetic programming based symbolic regression. Unlike conventional accuracy measures in machine learning we demonstrate the ability to recover true analytic solutions, as opposed to a numerical approximation. The method is verified by assessing its ability to recover known analytic solutions for two separate differential equations. The developed method is compared to a conventional, purely data-driven genetic programming based symbolic regression algorithm. The reliability of successful evolution of the true solution, or an algebraic equivalent, is demonstrated.  ( 2 min )
    Towards Lightweight Cross-domain Sequential Recommendation via External Attention-enhanced Graph Convolution Network. (arXiv:2302.03221v1 [cs.IR])
    Cross-domain Sequential Recommendation (CSR) is an emerging yet challenging task that depicts the evolution of behavior patterns for overlapped users by modeling their interactions from multiple domains. Existing studies on CSR mainly focus on using composite or in-depth structures that achieve significant improvement in accuracy but bring a huge burden to the model training. Moreover, to learn the user-specific sequence representations, existing works usually adopt the global relevance weighting strategy (e.g., self-attention mechanism), which has quadratic computational complexity. In this work, we introduce a lightweight external attention-enhanced GCN-based framework to solve the above challenges, namely LEA-GCN. Specifically, by only keeping the neighborhood aggregation component and using the Single-Layer Aggregating Protocol (SLAP), our lightweight GCN encoder performs more efficiently to capture the collaborative filtering signals of the items from both domains. To further alleviate the framework structure and aggregate the user-specific sequential pattern, we devise a novel dual-channel External Attention (EA) component, which calculates the correlation among all items via a lightweight linear structure. Extensive experiments are conducted on two real-world datasets, demonstrating that LEA-GCN requires a smaller volume and less training time without affecting the accuracy compared with several state-of-the-art methods.  ( 2 min )
    Optimization using Parallel Gradient Evaluations on Multiple Parameters. (arXiv:2302.03161v1 [cs.LG])
    We propose a first-order method for convex optimization, where instead of being restricted to the gradient from a single parameter, gradients from multiple parameters can be used during each step of gradient descent. This setup is particularly useful when a few processors are available that can be used in parallel for optimization. Our method uses gradients from multiple parameters in synergy to update these parameters together towards the optima. While doing so, it is ensured that the computational and memory complexity is of the same order as that of gradient descent. Empirical results demonstrate that even using gradients from as low as \textit{two} parameters, our method can often obtain significant acceleration and provide robustness to hyper-parameter settings. We remark that the primary goal of this work is less theoretical, and is instead aimed at exploring the understudied case of using multiple gradients during each step of optimization.  ( 2 min )
    Protecting Language Generation Models via Invisible Watermarking. (arXiv:2302.03162v1 [cs.CR])
    Language generation models have been an increasingly powerful enabler for many applications. Many such models offer free or affordable API access, which makes them potentially vulnerable to model extraction attacks through distillation. To protect intellectual property (IP) and ensure fair use of these models, various techniques such as lexical watermarking and synonym replacement have been proposed. However, these methods can be nullified by obvious countermeasures such as "synonym randomization". To address this issue, we propose GINSEW, a novel method to protect text generation models from being stolen through distillation. The key idea of our method is to inject secret signals into the probability vector of the decoding steps for each target token. We can then detect the secret message by probing a suspect model to tell if it is distilled from the protected one. Experimental results show that GINSEW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs. Our method demonstrates an absolute improvement of 19 to 29 points on mean average precision (mAP) in detecting suspects compared to previous methods against watermark removal attacks.  ( 2 min )
    On the Convergence of Federated Averaging with Cyclic Client Participation. (arXiv:2302.03109v1 [cs.LG])
    Federated Averaging (FedAvg) and its variants are the most popular optimization algorithms in federated learning (FL). Previous convergence analyses of FedAvg either assume full client participation or partial client participation where the clients can be uniformly sampled. However, in practical cross-device FL systems, only a subset of clients that satisfy local criteria such as battery status, network connectivity, and maximum participation frequency requirements (to ensure privacy) are available for training at a given time. As a result, client availability follows a natural cyclic pattern. We provide (to our knowledge) the first theoretical framework to analyze the convergence of FedAvg with cyclic client participation with several different client optimizers such as GD, SGD, and shuffled SGD. Our analysis discovers that cyclic client participation can achieve a faster asymptotic convergence rate than vanilla FedAvg with uniform client participation under suitable conditions, providing valuable insights into the design of client sampling protocols.  ( 2 min )
    Exemplars and Counterexemplars Explanations for Image Classifiers, Targeting Skin Lesion Labeling. (arXiv:2302.03033v1 [eess.IV])
    Explainable AI consists in developing mechanisms allowing for an interaction between decision systems and humans by making the decisions of the formers understandable. This is particularly important in sensitive contexts like in the medical domain. We propose a use case study, for skin lesion diagnosis, illustrating how it is possible to provide the practitioner with explanations on the decisions of a state of the art deep neural network classifier trained to characterize skin lesions from examples. Our framework consists of a trained classifier onto which an explanation module operates. The latter is able to offer the practitioner exemplars and counterexemplars for the classification diagnosis thus allowing the physician to interact with the automatic diagnosis system. The exemplars are generated via an adversarial autoencoder. We illustrate the behavior of the system on representative examples.  ( 2 min )
    Learned Accelerator Framework for Angular-Distance-Based High-Dimensional DBSCAN. (arXiv:2302.03136v1 [cs.IR])
    Density-based clustering is a commonly used tool in data science. Today many data science works are utilizing high-dimensional neural embeddings. However, traditional density-based clustering techniques like DBSCAN have a degraded performance on high-dimensional data. In this paper, we propose LAF, a generic learned accelerator framework to speed up the original DBSCAN and the sampling-based variants of DBSCAN on high-dimensional data with angular distance metric. This framework consists of a learned cardinality estimator and a post-processing module. The cardinality estimator can fast predict whether a data point is core or not to skip unnecessary range queries, while the post-processing module detects the false negative predictions and merges the falsely separated clusters. The evaluation shows our LAF-enhanced DBSCAN method outperforms the state-of-the-art efficient DBSCAN variants on both efficiency and quality.  ( 2 min )
    Fair Minimum Representation Clustering. (arXiv:2302.03151v1 [cs.LG])
    Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In many applications, these clusters correspond to real-world constructs (e.g. electoral districts) whose benefit can only be attained by groups when they reach a minimum level of representation (e.g. 50\% to elect their desired candidate). This paper considers the problem of performing k-means clustering while ensuring groups (e.g. demographic groups) have that minimum level of representation in a specified number of clusters. We show that the popular $k$-means algorithm, Lloyd's algorithm, can result in unfair outcomes where certain groups lack sufficient representation past the minimum threshold in a proportional number of clusters. We formulate the problem through a mixed-integer optimization framework and present a variant of Lloyd's algorithm, called MiniReL, that directly incorporates the fairness constraints. We show that incorporating the fairness criteria leads to a NP-Hard sub-problem within Lloyd's algorithm, but we provide computational approaches that make the problem tractable for even large datasets. Numerical results show that the approach is able to create fairer clusters with practically no increase in the k-means clustering cost across standard benchmark datasets.  ( 2 min )
    Predicting Development of Chronic Obstructive Pulmonary Disease and its Risk Factor Analysis. (arXiv:2302.03137v1 [q-bio.QM])
    Chronic Obstructive Pulmonary Disease (COPD) is an irreversible airway obstruction with a high societal burden. Although smoking is known to be the biggest risk factor, additional components need to be considered. In this study, we aim to identify COPD risk factors by applying machine learning models that integrate sociodemographic, clinical, and genetic data to predict COPD development.  ( 2 min )
    State-wise Safe Reinforcement Learning: A Survey. (arXiv:2302.03122v1 [cs.LG])
    Despite the tremendous success of Reinforcement Learning (RL) algorithms in simulation environments, applying RL to real-world applications still faces many challenges. A major concern is safety, in another word, constraint satisfaction. State-wise constraints are one of the most common constraints in real-world applications and one of the most challenging constraints in Safe RL. Enforcing state-wise constraints is necessary and essential to many challenging tasks such as autonomous driving, robot manipulation. This paper provides a comprehensive review of existing approaches that address state-wise constraints in RL. Under the framework of State-wise Constrained Markov Decision Process (SCMDP), we will discuss the connections, differences, and trade-offs of existing approaches in terms of (i) safety guarantee and scalability, (ii) safety and reward performance, and (iii) safety after convergence and during training. We also summarize limitations of current methods and discuss potential future directions.  ( 2 min )
    Importance attribution in neural networks by means of persistence landscapes of time series. (arXiv:2302.03132v1 [cs.LG])
    We propose and implement a method to analyze time series with a neural network using a matrix of area-normalized persistence landscapes obtained through topological data analysis. We include a gating layer in the network's architecture that is able to identify the most relevant landscape levels for the classification task, thus working as an importance attribution system. Next, we perform a matching between the selected landscape functions and the corresponding critical points of the original time series. From this matching we are able to reconstruct an approximate shape of the time series that gives insight into the classification decision. We test this technique with input data from a dataset of electrocardiographic signals.  ( 2 min )
    Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences. (arXiv:2302.03106v1 [cs.CL])
    Pre-trained language models have led to a new state-of-the-art in many NLP tasks. However, for topic modeling, statistical generative models such as LDA are still prevalent, which do not easily allow incorporating contextual word vectors. They might yield topics that do not align very well with human judgment. In this work, we propose a novel topic modeling and inference algorithm. We suggest a bag of sentences (BoS) approach using sentences as the unit of analysis. We leverage pre-trained sentence embeddings by combining generative process models with clustering. We derive a fast inference algorithm based on expectation maximization, hard assignments, and an annealing process. Our evaluation shows that our method yields state-of-the art results with relatively little computational demands. Our methods is more flexible compared to prior works leveraging word embeddings, since it provides the possibility to customize topic-document distributions using priors. Code is at \url{https://github.com/JohnTailor/BertSenClu}.  ( 2 min )
    One-shot Empirical Privacy Estimation for Federated Learning. (arXiv:2302.03098v1 [cs.LG])
    Privacy auditing techniques for differentially private (DP) algorithms are useful for estimating the privacy loss to compare against analytical bounds, or empirically measure privacy in settings where known analytical bounds on the DP loss are not tight. However, existing privacy auditing techniques usually make strong assumptions on the adversary (e.g., knowledge of intermediate model iterates or the training data distribution), are tailored to specific tasks and model architectures, and require retraining the model many times (typically on the order of thousands). These shortcomings make deploying such techniques at scale difficult in practice, especially in federated settings where model training can take days or weeks. In this work, we present a novel "one-shot" approach that can systematically address these challenges, allowing efficient auditing or estimation of the privacy loss of a model during the same, single training run used to fit model parameters. Our privacy auditing method for federated learning does not require a priori knowledge about the model architecture or task. We show that our method provides provably correct estimates for privacy loss under the Gaussian mechanism, and we demonstrate its performance on a well-established FL benchmark dataset under several adversarial models.  ( 2 min )
    Memory-Based Meta-Learning on Non-Stationary Distributions. (arXiv:2302.03067v1 [cs.LG])
    Memory-based meta-learning is a technique for approximating Bayes-optimal predictors. Under fairly general conditions, minimizing sequential prediction error, measured by the log loss, leads to implicit meta-learning. The goal of this work is to investigate how far this interpretation can be realized by current sequence prediction models and training regimes. The focus is on piecewise stationary sources with unobserved switching-points, which arguably capture an important characteristic of natural language and action-observation sequences in partially observable environments. We show that various types of memory-based neural models, including Transformers, LSTMs, and RNNs can learn to accurately approximate known Bayes-optimal algorithms and behave as if performing Bayesian inference over the latent switching-points and the latent parameters governing the data distribution within each segment.  ( 2 min )
    DITTO: Offline Imitation Learning with World Models. (arXiv:2302.03086v1 [cs.LG])
    We propose DITTO, an offline imitation learning algorithm which uses world models and on-policy reinforcement learning to addresses the problem of covariate shift, without access to an oracle or any additional online interactions. We discuss how world models enable offline, on-policy imitation learning, and propose a simple intrinsic reward defined in the world model latent space that induces imitation learning by reinforcement learning. Theoretically, we show that our formulation induces a divergence bound between expert and learner, in turn bounding the difference in reward. We test our method on difficult Atari environments from pixels alone, and achieve state-of-the-art performance in the offline setting.  ( 2 min )
    Five policy uses of algorithmic explainability. (arXiv:2302.03080v1 [cs.LG])
    The notion that algorithmic systems should be "explainable" is common in the many statements of consensus principles developed by governments, companies, and advocacy organizations. But what exactly do these policy and legal actors want from explainability, and how do their desiderata compare with the explainability techniques developed in the machine learning literature? We explore this question in hopes of better connecting the policy and technical communities. We outline five settings in which policymakers seek to use explainability: complying with specific requirements for explanation; helping to obtain regulatory approval in highly regulated settings; enabling or interfacing with liability; flexibly managing risk as part of a self-regulatory process; and providing model and data transparency. We illustrate each setting with an in-depth case study contextualizing the purpose and role of explanation. Drawing on these case studies, we discuss common factors limiting policymakers' use of explanation and promising ways in which explanation can be used in policy. We conclude with recommendations for researchers and policymakers.  ( 2 min )
    Evaluating Self-Supervised Learning via Risk Decomposition. (arXiv:2302.03068v1 [cs.LG])
    Self-supervised learning (SSL) pipelines differ in many design choices such as the architecture, augmentations, or pretraining data. Yet SSL is typically evaluated using a single metric: linear probing on ImageNet. This does not provide much insight into why or when a model is better, now how to improve it. To address this, we propose an SSL risk decomposition, which generalizes the classical supervised approximation-estimation decomposition by considering errors arising from the representation learning step. Our decomposition consists of four error components: approximation, representation usability, probe generalization, and encoder generalization. We provide efficient estimators for each component and use them to analyze the effect of 30 design choices on 169 SSL vision models evaluated on ImageNet. Our analysis gives valuable insights for designing and using SSL models. For example, it highlights the main sources of error and shows how to improve SSL in specific settings (full- vs few-shot) by trading off error components. All results and pretrained models are at https://github.com/YannDubs/SSL-Risk-Decomposition.  ( 2 min )
    Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Imputation. (arXiv:2302.03038v1 [q-bio.GN])
    Spatially resolved transcriptomics brings exciting breakthroughs to single-cell analysis by providing physical locations along with gene expression. However, as a cost of the extremely high spatial resolution, the cellular level spatial transcriptomic data suffer significantly from missing values. While a standard solution is to perform imputation on the missing values, most existing methods either overlook spatial information or only incorporate localized spatial context without the ability to capture long-range spatial information. Using multi-head self-attention mechanisms and positional encoding, transformer models can readily grasp the relationship between tokens and encode location information. In this paper, by treating single cells as spatial tokens, we study how to leverage transformers to facilitate spatial tanscriptomics imputation. In particular, investigate the following two key questions: (1) $\textit{how to encode spatial information of cells in transformers}$, and (2) $\textit{ how to train a transformer for transcriptomic imputation}$. By answering these two questions, we present a transformer-based imputation framework, SpaFormer, for cellular-level spatial transcriptomic data. Extensive experiments demonstrate that SpaFormer outperforms existing state-of-the-art imputation algorithms on three large-scale datasets.  ( 2 min )
    LiteVR: Interpretable and Lightweight Cybersickness Detection using Explainable AI. (arXiv:2302.03037v1 [cs.HC])
    Cybersickness is a common ailment associated with virtual reality (VR) user experiences. Several automated methods exist based on machine learning (ML) and deep learning (DL) to detect cybersickness. However, most of these cybersickness detection methods are perceived as computationally intensive and black-box methods. Thus, those techniques are neither trustworthy nor practical for deploying on standalone energy-constrained VR head-mounted devices (HMDs). In this work, we present an explainable artificial intelligence (XAI)-based framework, LiteVR, for cybersickness detection, explaining the model's outcome and reducing the feature dimensions and overall computational costs. First, we develop three cybersickness DL models based on long-term short-term memory (LSTM), gated recurrent unit (GRU), and multilayer perceptron (MLP). Then, we employed a post-hoc explanation, such as SHapley Additive Explanations (SHAP), to explain the results and extract the most dominant features of cybersickness. Finally, we retrain the DL models with the reduced number of features. Our results show that eye-tracking features are the most dominant for cybersickness detection. Furthermore, based on the XAI-based feature ranking and dimensionality reduction, we significantly reduce the model's size by up to 4.3x, training time by up to 5.6x, and its inference time by up to 3.8x, with higher cybersickness detection accuracy and low regression error (i.e., on Fast Motion Scale (FMS)). Our proposed lite LSTM model obtained an accuracy of 94% in classifying cybersickness and regressing (i.e., FMS 1-10) with a Root Mean Square Error (RMSE) of 0.30, which outperforms the state-of-the-art. Our proposed LiteVR framework can help researchers and practitioners analyze, detect, and deploy their DL-based cybersickness detection models in standalone VR HMDs.  ( 2 min )
  • Open

    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v4 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.  ( 2 min )
    Generalization Bounds of Nonconvex-(Strongly)-Concave Stochastic Minimax Optimization. (arXiv:2205.14278v2 [math.OC] UPDATED)
    This paper takes an initial step to systematically investigate the generalization bounds of algorithms for solving nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization measured by the stationarity of primal functions. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem. The sample complexities for achieving $\epsilon$-generalization are $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{-2})$ and $\tilde{\mathcal{O}}(d\epsilon^{-4})$ for NC-SC and NC-C settings, respectively, where $d$ is the dimension and $\kappa$ is the condition number. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between generalization bounds and the stability notion. As a result, we establish algorithm-dependent generalization bounds for stochastic gradient descent ascent (SGDA) algorithm and the more general sampling-determined algorithms.  ( 2 min )
    A nonparametric extension of randomized response for private confidence sets. (arXiv:2202.08728v2 [stat.ME] UPDATED)
    This work derives methods for performing nonparametric, nonasymptotic statistical inference for population means under the constraint of local differential privacy (LDP). Given bounded observations $(X_1, \dots, X_n)$ with mean $\mu^\star$ that are privatized into $(Z_1, \dots, Z_n)$, we present confidence intervals (CI) and time-uniform confidence sequences (CS) for $\mu^\star$ when only given access to the privatized data. To achieve this, we introduce a nonparametric and sequentially interactive generalization of Warner's famous ``randomized response'' mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. For example, our results yield private analogues of Hoeffding's inequality in both fixed-time and time-uniform regimes. We extend these Hoeffding-type CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests.  ( 2 min )
    Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors. (arXiv:2302.03493v1 [cs.LG])
    Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.  ( 2 min )
    Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality. (arXiv:2110.10745v3 [stat.ML] UPDATED)
    Parameter learning for high-dimensional, partially observed, and nonlinear stochastic processes is a methodological challenge. Spatiotemporal disease transmission systems provide examples of such processes giving rise to open inference problems. We propose the iterated block particle filter (IBPF) algorithm for learning high-dimensional parameters over graphical state space models with general state spaces, measures, transition densities and graph structure. Theoretical performance guarantees are obtained on beating the curse of dimensionality (COD), algorithm convergence, and likelihood maximization. Experiments on a highly nonlinear and non-Gaussian spatiotemporal model for measles transmission reveal that the iterated ensemble Kalman filter algorithm (Li et al. (2020)) is ineffective and the iterated filtering algorithm (Ionides et al. (2015)) suffers from the COD, while our IBPF algorithm beats COD consistently across various experiments with different metrics.  ( 2 min )
    On the symmetries in the dynamics of wide two-layer neural networks. (arXiv:2211.08771v3 [cs.LG] UPDATED)
    We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $f^*$ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $f^*$ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $f^*$ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.  ( 2 min )
    Accelerated Nonnegative Tensor Completion via Integer Programming. (arXiv:2211.15770v2 [cs.LG] UPDATED)
    The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices.  ( 2 min )
    Deep-OSG: A deep learning approach for approximating a family of operators in semigroup to model unknown autonomous systems. (arXiv:2302.03358v1 [cs.LG])
    This paper proposes a novel deep learning approach for approximating evolution operators and modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z. Chen, V. Churchill, K. Wu, and D. Xiu, J. Comput. Phys., 449:110782, 2022], which focused on learning single evolution operator with a fixed time step. This paper aims to learn a family of evolution operators with variable time steps, which constitute a semigroup for an autonomous system. The semigroup property is very crucial and links the system's evolutionary behaviors across varying time scales, but it was not considered in the previous works. We propose for the first time a framework of embedding the semigroup property into the data-driven learning process, through a novel neural network architecture and new loss functions. The framework is very feasible, can be combined with any suitable neural networks, and is applicable to learning general autonomous ODEs and PDEs. We present the rigorous error estimates and variance analysis to understand the prediction accuracy and robustness of our approach, showing the remarkable advantages of semigroup awareness in our model. Moreover, our approach allows one to arbitrarily choose the time steps for prediction and ensures that the predicted results are well self-matched and consistent. Extensive numerical experiments demonstrate that embedding the semigroup property notably reduces the data dependency of deep learning models and greatly improves the accuracy, robustness, and stability for long-time prediction.  ( 2 min )
    Temporal Robustness against Data Poisoning. (arXiv:2302.03684v1 [cs.LG])
    Data poisoning considers cases when an adversary maliciously inserts and removes training data to manipulate the behavior of machine learning algorithms. Traditional threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, existing defenses are essentially vulnerable in practice when poisoning more samples remains a feasible option for attackers. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning and derive two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. With these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal modeling of data poisoning.  ( 2 min )
    Revisiting Discriminative vs. Generative Classifiers: Theory and Implications. (arXiv:2302.02334v1 [cs.LG] CROSS LISTED)
    A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the ``two regimes'' phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.  ( 2 min )
    Representation Theory for Geometric Quantum Machine Learning. (arXiv:2210.07980v2 [quant-ph] UPDATED)
    Recent advances in classical machine learning have shown that creating models with inductive biases encoding the symmetries of a problem can greatly improve performance. Importation of these ideas, combined with an existing rich body of work at the nexus of quantum theory and symmetry, has given rise to the field of Geometric Quantum Machine Learning (GQML). Following the success of its classical counterpart, it is reasonable to expect that GQML will play a crucial role in developing problem-specific and quantum-aware models capable of achieving a computational advantage. Despite the simplicity of the main idea of GQML -- create architectures respecting the symmetries of the data -- its practical implementation requires a significant amount of knowledge of group representation theory. We present an introduction to representation theory tools from the optics of quantum learning, driven by key examples involving discrete and continuous groups. These examples are sewn together by an exposition outlining the formal capture of GQML symmetries via "label invariance under the action of a group representation", a brief (but rigorous) tour through finite and compact Lie group representation theory, a reexamination of ubiquitous tools like Haar integration and twirling, and an overview of some successful strategies for detecting symmetries.  ( 2 min )
    Matrix Estimation for Individual Fairness. (arXiv:2302.02096v1 [cs.LG] CROSS LISTED)
    In recent years, multiple notions of algorithmic fairness have arisen. One such notion is individual fairness (IF), which requires that individuals who are similar receive similar treatment. In parallel, matrix estimation (ME) has emerged as a natural paradigm for handling noisy data with missing values. In this work, we connect the two concepts. We show that pre-processing data using ME can improve an algorithm's IF without sacrificing performance. Specifically, we show that using a popular ME method known as singular value thresholding (SVT) to pre-process the data provides a strong IF guarantee under appropriate conditions. We then show that, under analogous conditions, SVT pre-processing also yields estimates that are consistent and approximately minimax optimal. As such, the ME pre-processing step does not, under the stated conditions, increase the prediction error of the base algorithm, i.e., does not impose a fairness-performance trade-off. We verify these results on synthetic and real data.  ( 2 min )
    Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel. (arXiv:2205.12904v2 [cs.LG] UPDATED)
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.  ( 2 min )
    Deep Linear Networks can Benignly Overfit when Shallow Ones Do. (arXiv:2209.09315v2 [cs.LG] UPDATED)
    We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm's ability to "hide the noise". Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced.  ( 2 min )
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v3 [cs.LG] UPDATED)
    It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., $\sigma_t \equiv \Omega(1)$) and the benign deterministic regimes (i.e., $\sigma_t \equiv 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a "black-box" manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.  ( 2 min )
    Incentive-aware Contextual Pricing with Non-parametric Market Noise. (arXiv:1911.03508v3 [cs.LG] UPDATED)
    We consider a dynamic pricing problem for repeated contextual second-price auctions with multiple strategic buyers who aim to maximize their long-term time discounted utility. The seller has limited information on buyers' overall demand curves which depends on a non-parametric market-noise distribution, and buyers may potentially submit corrupted bids (relative to true valuations) to manipulate the seller's pricing policy for more favorable reserve prices in the future. We focus on designing the seller's learning policy to set contextual reserve prices where the seller's goal is to minimize regret compared to the revenue of a benchmark clairvoyant policy that has full information of buyers' demand. We propose a policy with a phased-structure that incorporates randomized "isolation" periods, during which a buyer is randomly chosen to solely participate in the auction. We show that this design allows the seller to control the number of periods in which buyers significantly corrupt their bids. We then prove that our policy enjoys a $T$-period regret of $\widetilde{\mathcal{O}}(\sqrt{T})$ facing strategic buyers. Finally, we conduct numerical simulations to compare our proposed algorithm to standard pricing policies. Our numerical results show that our algorithm outperforms these policies under various buyer bidding behavior.  ( 2 min )
    Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications. (arXiv:2302.03683v1 [cs.LG])
    Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.  ( 2 min )
    A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data. (arXiv:2302.03157v1 [stat.ME])
    We create a mixed-integer optimization (MIO) approach for doing cluster-aware regression, i.e. linear regression that takes into account the inherent clustered structure of the data. We compare to the linear mixed effects regression (LMEM) which is the most used current method, and design simulation experiments to show superior performance to LMEM in terms of both predictive and inferential metrics in silico. Furthermore, we show how our method is formulated in a very interpretable way; LMEM cannot generalize and make cluster-informed predictions when the cluster of new data points is unknown, but we solve this problem by training an interpretable classification tree that can help decide cluster effects for new data points, and demonstrate the power of this generalizability on a real protein expression dataset.  ( 2 min )
    Identification of Power System Oscillation Modes using Blind Source Separation based on Copula Statistic. (arXiv:2302.03633v1 [eess.SP])
    The dynamics of a power system with large penetration of renewable energy resources are becoming more nonlinear due to the intermittence of these resources and the switching of their power electronic devices. Therefore, it is crucial to accurately identify the dynamical modes of oscillation of such a power system when it is subject to disturbances to initiate appropriate preventive or corrective control actions. In this paper, we propose a high-order blind source identification (HOBI) algorithm based on the copula statistic to address these non-linear dynamics in modal analysis. The method combined with Hilbert transform (HOBI-HT) and iteration procedure (HOBMI) can identify all the modes as well as the model order from the observation signals obtained from the number of channels as low as one. We access the performance of the proposed method on numerical simulation signals and recorded data from a simulation of time domain analysis on the classical 11-Bus 4-Machine test system. Our simulation results outperform the state-of-the-art method in accuracy and effectiveness.  ( 2 min )
    Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective. (arXiv:2302.03561v1 [cs.LG])
    We study the problem of optimizing a recommender system for outcomes that occur over several weeks or months. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationships with a recommender system. Measurement, attribution, and coordination challenges complicate algorithm design. We describe careful modeling -- including a new representation of user state and key conditional independence assumptions -- which overcomes these challenges and leads to simple, testable recommender system prototypes. We apply our approach to a podcast recommender system that makes personalized recommendations to hundreds of millions of listeners. A/B tests demonstrate that purposefully optimizing for long-term outcomes leads to large performance gains over conventional approaches that optimize for short-term proxies.  ( 2 min )
    Evaluating Self-Supervised Learning via Risk Decomposition. (arXiv:2302.03068v1 [cs.LG])
    Self-supervised learning (SSL) pipelines differ in many design choices such as the architecture, augmentations, or pretraining data. Yet SSL is typically evaluated using a single metric: linear probing on ImageNet. This does not provide much insight into why or when a model is better, now how to improve it. To address this, we propose an SSL risk decomposition, which generalizes the classical supervised approximation-estimation decomposition by considering errors arising from the representation learning step. Our decomposition consists of four error components: approximation, representation usability, probe generalization, and encoder generalization. We provide efficient estimators for each component and use them to analyze the effect of 30 design choices on 169 SSL vision models evaluated on ImageNet. Our analysis gives valuable insights for designing and using SSL models. For example, it highlights the main sources of error and shows how to improve SSL in specific settings (full- vs few-shot) by trading off error components. All results and pretrained models are at https://github.com/YannDubs/SSL-Risk-Decomposition.  ( 2 min )
    Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation. (arXiv:2302.03673v1 [cs.LG])
    We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters.  ( 3 min )
    Approximate message passing from random initialization with applications to $\mathbb{Z}_{2}$ synchronization. (arXiv:2302.03682v1 [math.ST])
    This paper is concerned with the problem of reconstructing an unknown rank-one matrix with prior structural information from noisy observations. While computing the Bayes-optimal estimator seems intractable in general due to its nonconvex nature, Approximate Message Passing (AMP) emerges as an efficient first-order method to approximate the Bayes-optimal estimator. However, the theoretical underpinnings of AMP remain largely unavailable when it starts from random initialization, a scheme of critical practical utility. Focusing on a prototypical model called $\mathbb{Z}_{2}$ synchronization, we characterize the finite-sample dynamics of AMP from random initialization, uncovering its rapid global convergence. Our theory provides the first non-asymptotic characterization of AMP in this model without requiring either an informative initialization (e.g., spectral initialization) or sample splitting.  ( 2 min )
    SDYN-GANs: Adversarial Learning Methods for Multistep Generative Models for General Order Stochastic Dynamics. (arXiv:2302.03663v1 [cs.LG])
    We introduce adversarial learning methods for data-driven generative modeling of the dynamics of $n^{th}$-order stochastic systems. Our approach builds on Generative Adversarial Networks (GANs) with generative model classes based on stable $m$-step stochastic numerical integrators. We introduce different formulations and training methods for learning models of stochastic dynamics based on observation of trajectory samples. We develop approaches using discriminators based on Maximum Mean Discrepancy (MMD), training protocols using conditional and marginal distributions, and methods for learning dynamic responses over different time-scales. We show how our approaches can be used for modeling physical systems to learn force-laws, damping coefficients, and noise-related parameters. The adversarial learning approaches provide methods for obtaining stable generative models for dynamic tasks including long-time prediction and developing simulations for stochastic systems.  ( 2 min )
    Riemannian Flow Matching on General Geometries. (arXiv:2302.03660v1 [cs.LG])
    We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, inherently cannot scale to high dimensions, or use approximations to limiting quantities that result in biased objectives. Riemannian Flow Matching bypasses these inconveniences and exhibits multiple benefits over prior approaches: It is completely simulation-free on simple geometries, it does not require divergence computation, and its target vector field is computed in closed form even on general geometries. The key ingredient behind RFM is the construction of a simple kernel function for defining per-sample vector fields, which subsumes existing Euclidean cases. Extending to general geometries, we rely on the use of spectral decompositions to efficiently compute kernel functions. Our method achieves state-of-the-art performance on real-world non-Euclidean datasets, and we showcase, for the first time, tractable training on general geometries, including on triangular meshes and maze-like manifolds with boundaries.  ( 2 min )
    Multi-Scale Message Passing Neural PDE Solvers. (arXiv:2302.03580v1 [cs.LG])
    We propose a novel multi-scale message passing neural network algorithm for learning the solutions of time-dependent PDEs. Our algorithm possesses both temporal and spatial multi-scale resolution features by incorporating multi-scale sequence models and graph gating modules in the encoder and processor, respectively. Benchmark numerical experiments are presented to demonstrate that the proposed algorithm outperforms baselines, particularly on a PDE with a range of spatial and temporal scales.  ( 2 min )
    Convergence rates for momentum stochastic gradient descent with noise of machine learning type. (arXiv:2302.03550v1 [math.OC])
    We consider the momentum stochastic gradient descent scheme (MSGD) and its continuous-in-time counterpart in the context of non-convex optimization. We show almost sure exponential convergence of the objective function value for target functions that are Lipschitz continuous and satisfy the Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on the stochastic noise that are motivated by overparameterized supervised learning applications. Moreover, we optimize the convergence rate over the set of friction parameters and show that the MSGD process almost surely converges.  ( 2 min )
    Efficient Parametric Approximations of Neural Network Function Space Distance. (arXiv:2302.03519v1 [cs.LG])
    It is often useful to compactly summarize important properties of model parameters and training data so that they can be used later without storing and/or iterating over the entire dataset. As a specific case, we consider estimating the Function Space Distance (FSD) over a training set, i.e. the average discrepancy between the outputs of two neural networks. We propose a Linearized Activation Function TRick (LAFTR) and derive an efficient approximation to FSD for ReLU neural networks. The key idea is to approximate the architecture as a linear network with stochastic gating. Despite requiring only one parameter per unit of the network, our approach outcompetes other parametric approximations with larger memory requirements. Applied to continual learning, our parametric approximation is competitive with state-of-the-art nonparametric approximations, which require storing many training examples. Furthermore, we show its efficacy in estimating influence functions accurately and detecting mislabeled examples without expensive iterations over the entire dataset.  ( 2 min )
    OPORP: One Permutation + One Random Projection. (arXiv:2302.03505v1 [stat.ML])
    Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show that the estimation variance is essentially: $(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a function of the data ($u,v$). This formula reveals several key properties: (1) We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is actually the asymptotic variance of the classical correlation estimator. We illustrate that by letting the $k$ in OPORP to be $k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.  ( 2 min )
    On the relationship between multivariate splines and infinitely-wide neural networks. (arXiv:2302.03459v1 [cs.LG])
    We consider multivariate splines and show that they have a random feature expansion as infinitely wide neural networks with one-hidden layer and a homogeneous activation function which is the power of the rectified linear unit. We show that the associated function space is a Sobolev space on a Euclidean ball, with an explicit bound on the norms of derivatives. This link provides a new random feature expansion for multivariate splines that allow efficient algorithms. This random feature expansion is numerically better behaved than usual random Fourier features, both in theory and practice. In particular, in dimension one, we compare the associated leverage scores to compare the two random expansions and show a better scaling for the neural network expansion.  ( 2 min )
    Sparse GEMINI for Joint Discriminative Clustering and Feature Selection. (arXiv:2302.03391v1 [stat.ML])
    Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on $p(\pmb{x})$, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple $\ell_1$ penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a clustering model $p_\theta(y|\pmb{x})$. We demonstrate the performances of Sparse GEMINI on synthetic datasets as well as large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.  ( 2 min )
    Federated Variational Inference Methods for Structured Latent Variable Models. (arXiv:2302.03314v1 [stat.ML])
    Federated learning methods, that is, methods that perform model training using data situated across different sources, whilst simultaneously not having the data leave their original source, are of increasing interest in a number of fields. However, despite this interest, the classes of models for which easily-applicable and sufficiently general approaches are available is limited, excluding many structured probabilistic models. We present a general yet elegant resolution to the aforementioned issue. The approach is based on adopting structured variational inference, an approach widely used in Bayesian machine learning, to the federated setting. Additionally, a communication-efficient variant analogous to the canonical FedAvg algorithm is explored. The effectiveness of the proposed algorithms are demonstrated, and their performance is compared on Bayesian multinomial regression, topic modelling, and mixed model examples.  ( 2 min )
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v1 [stat.ML])
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. We derive time-uniform generalizations of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds and, more importantly, general techniques for constructing them. Despite being anytime-valid, our extensions remain as tight as their fixed-time counterparts. Moreover, they enable us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.  ( 2 min )
    Leveraging Demonstrations to Improve Online Learning: Quality Matters. (arXiv:2302.03319v1 [cs.LG])
    We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.  ( 2 min )
    Algorithmically Designed Artificial Neural Networks (ADANNs): Higher order deep operator learning for parametric partial differential equations. (arXiv:2302.03286v1 [math.NA])
    In this article we propose a new deep learning approach to solve parametric partial differential equations (PDEs) approximately. In particular, we introduce a new strategy to design specific artificial neural network (ANN) architectures in conjunction with specific ANN initialization schemes which are tailor-made for the particular scientific computing approximation problem under consideration. In the proposed approach we combine efficient classical numerical approximation techniques such as higher-order Runge-Kutta schemes with sophisticated deep (operator) learning methodologies such as the recently introduced Fourier neural operators (FNOs). Specifically, we introduce customized adaptions of existing standard ANN architectures together with specialized initializations for these ANN architectures so that at initialization we have that the ANNs closely mimic a chosen efficient classical numerical algorithm for the considered approximation problem. The obtained ANN architectures and their initialization schemes are thus strongly inspired by numerical algorithms as well as by popular deep learning methodologies from the literature and in that sense we refer to the introduced ANNs in conjunction with their tailor-made initialization schemes as Algorithmically Designed Artificial Neural Networks (ADANNs). We numerically test the proposed ADANN approach in the case of some parametric PDEs. In the tested numerical examples the ADANN approach significantly outperforms existing traditional approximation algorithms as well as existing deep learning methodologies from the literature.  ( 2 min )
    Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR. (arXiv:2302.03201v1 [cs.LG])
    In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $\tau$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $\Omega(\sqrt{\tau^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $\Omega(\sqrt{\tau^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{\tau^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(\tau^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $\tau$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.  ( 2 min )
    Exact Inference in High-order Structured Prediction. (arXiv:2302.03236v1 [cs.LG])
    In this paper, we study the problem of inference in high-order structured prediction tasks. In the context of Markov random fields, the goal of a high-order inference task is to maximize a score function on the space of labels, and the score function can be decomposed into sum of unary and high-order potentials. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also provide a new class of hypergraph structural properties related to hyperedge expansion that drives the success in general high-order inference problems. Finally, we connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.  ( 2 min )
    Easy Learning from Label Proportions. (arXiv:2302.03115v1 [cs.LG])
    We consider the problem of Learning from Label Proportions (LLP), a weakly supervised classification setup where instances are grouped into "bags", and only the frequency of class labels at each bag is available. Albeit, the objective of the learner is to achieve low task loss at an individual instance level. Here we propose Easyllp: a flexible and simple-to-implement debiasing approach based on aggregate labels, which operates on arbitrary loss functions. Our technique allows us to accurately estimate the expected loss of an arbitrary model at an individual level. We showcase the flexibility of our approach by applying it to popular learning frameworks, like Empirical Risk Minimization (ERM) and Stochastic Gradient Descent (SGD) with provable guarantees on instance level performance. More concretely, we exhibit a variance reduction technique that makes the quality of LLP learning deteriorate only by a factor of k (k being bag size) in both ERM and SGD setups, as compared to full supervision. Finally, we validate our theoretical results on multiple datasets demonstrating our algorithm performs as well or better than previous LLP approaches in spite of its simplicity.  ( 2 min )
    Memory-Based Meta-Learning on Non-Stationary Distributions. (arXiv:2302.03067v1 [cs.LG])
    Memory-based meta-learning is a technique for approximating Bayes-optimal predictors. Under fairly general conditions, minimizing sequential prediction error, measured by the log loss, leads to implicit meta-learning. The goal of this work is to investigate how far this interpretation can be realized by current sequence prediction models and training regimes. The focus is on piecewise stationary sources with unobserved switching-points, which arguably capture an important characteristic of natural language and action-observation sequences in partially observable environments. We show that various types of memory-based neural models, including Transformers, LSTMs, and RNNs can learn to accurately approximate known Bayes-optimal algorithms and behave as if performing Bayesian inference over the latent switching-points and the latent parameters governing the data distribution within each segment.  ( 2 min )

  • Open

    I want to train a CNN to perform Stool analysis - where can I find the datatset?
    ^ submitted by /u/Ok_Gur_2696 [link] [comments]  ( 40 min )
    Best Books to Learn Neural Networks in 2023 for Beginners
    submitted by /u/Lakshmireddys [link] [comments]  ( 40 min )
    Understanding Attention Mechanism in Transformer Neural Networks
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Neural networks with non-instant neuron firing?
    Wondering if you guys could give me some terms to Google. Biological neurons have a refractory period after firing in which they cannot fire again. Additionally, charges/neurotransmitters take a certain amount of time to build up in the target cell. Firings tend to follow this voltage over time graph. Is there a type of neural network with a more time-delayed or first/second order time-derivative firing behavior? One benefit I can think of is that if you've got a more ad-hoc network structure, a traditional NN may have two neurons feeding into each other which would result in a 1 "tick" oscillation or cascading activations since there is no limit on firing rate. EDIT: Finally found it. They're called "Spiking Neural Networks" (SNNs). submitted by /u/JDude13 [link] [comments]  ( 41 min )
  • Open

    How do I use Thompson Sampling with non-binary rewards?
    Any suggestions and/or resources to understand and implement this? submitted by /u/Blasphemer666 [link] [comments]  ( 41 min )
    I Was At The Legendary Man Vs Machine Chess Match - It Was Similar To The AlphaGo AI Match. (Podcast Clip)
    submitted by /u/joemurray1994 [link] [comments]  ( 41 min )
    "An Invitation to Imitation", Bagnell 2015 (tutorial on imitation learning, DAGGer etc)
    submitted by /u/gwern [link] [comments]  ( 41 min )
    Does a bigger model or inclusion of an specialized preprocessing unit result in a more stable learning losses?
    Hello guys, I am trying to fit a DQN on price data. I know its virtually impossible and not profitable in live trading. BUT, the model I am training is currently plagued with rather unstable profits, after like 5 hours of training on an A100. It's clear that is learning something, but the profits are still rather unpredictable. I wanted to know which remedies you recommend to improving its stability? Larger network? Or an auto encoder or something like that for data preprocessing? Thank you submitted by /u/Kiizmod0 [link] [comments]  ( 41 min )
    Motivation behind asynchronous updates in A3C
    I'm currently reading Mnih's monumental paper on asynchronous updates in on policy RL algorithms. When he introduces it, he states that the worker agents experience copies of the environment and this parallelism De-correlates the agent's data into a more stationary process, since at any time step the agent would be experiencing a wide variety of states. I only have an intuition of what this could mean but I'm not sure if I'm correct. I believe that the decorrelation has to do with how updates are made in an on policy algorithm such as REINFORCE. We end up increasing the probability of actions based on the return. In subsequent updates, we are more likely to take this action and this would only increase the probability of taking it further. As a result the updates are correlated. I'm not sure how multiple parallel experiences are connected to stationarity. If we have a global policy that is evolving and all of the experiences are initialised with the global policy, don't they all have the same, shifting distribution? I'd appreciate any insight on this or anything I mentioned. Thanks! submitted by /u/theanswerisnt42 [link] [comments]  ( 43 min )
  • Open

    Will you be switching to Bing for its chatGPT integration?
    View Poll submitted by /u/arnolds112 [link] [comments]  ( 41 min )
    Using A.I. to create Audio documents/books
    Hello, I have no idea where else to ask this. I'm not very tech savvy either but I've listened to the recent A.I. voice recordings and heard that they use input text to work (if I'm not mistaken?). So with that in mind would it be possible to copy-paste a files text in it and have the A.I. read it into a recording? The reason I want to do this is because often I have to read a lot for my projects and would sometimes like to listen to information rather than being on my screen for 10 hours locked inside my home so if I could self create these form of audio files it would benefit me extremely. submitted by /u/Googlefriend1 [link] [comments]  ( 42 min )
    Athene AI show
    Is there like a public version of what he uses? Or something similar it’s so interesting submitted by /u/x2hunter [link] [comments]  ( 41 min )
    Awesome No/low-code AI App Builder
    ​ https://reddit.com/link/10xbrxo/video/5bvwl3qv91ha1/player Hey everyone! Wanted to share a platform a few others & myself have been working on: an easy way to build apps on top of AI Large Language Models such as GPT-3. This video shows the simplest app you can build -- just take one of your great ChatGPT prompts and quickly turn it into a shareable app with an interface anyone can use. You can also chain LLMs together, feed in data from outside sources, and a whole lot more. We have found a lot of value in this community, and so wanted to give you all early access before we release it on a wider scale: Agent Beta Access. Would love any feedback! Btw, here is the app in the video (it's a fun one!): King James Bible Verse Generator. submitted by /u/leopuli [link] [comments]  ( 41 min )
    An oddity while generating celebrity images.
    So, I was testing out how different AI image generators make famous people(and human faces in general). So I tried many free ones like DallE mini, and Imagine, etc as well as some premium ones like Midjourney, Wombo Dream, Stable Diffusion, DallE 2 (Granted Dall E didn't allow it most of the time). I generated peole like RDJ, Michael Jackson, Scarlett Johansson, Madison Beer, and even some you tubers. All of them turned out fine even with different art styles. With the exception of one person... Billie Eilish. Every time I generated her face, I laughed hard. In every image generator, she looked like someone with both High function autism and extreme downs syndrome. Not saying that that you should laugh at those people. But those pictures were funny af. Any clue as to why it happens with her face? submitted by /u/offensivepanzerIV [link] [comments]  ( 41 min )
    The Threat of AI-Powered Chatbots in Spreading Misinformation: New Concerns Raised
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    Regulating AI tools like ChatGPT could be “problematic”
    General purpose artificial intelligence tools will provide new challenges for regulators, which they may struggle to meet. I spoke to several experts to get a better idea of how they think it should be regulated. The general feeling is that for the most part it should be “light touch” and based on risk - so asking medical questions is high risk and tightly regulated, asking for a poem is low risk. There is also concerns GDPR could come into play and a company like OpenAI could be forced to delete its model and training data if enough people complain. https://techmonitor.ai/technology/ai-and-automation/ai-regulation-chatgpt-bard submitted by /u/upyourego [link] [comments]  ( 42 min )
    AI Dream 131 - Transcendence Visualized by AI - Weird Wednesday
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Best Artificial Intelligence courses for Healthcare You should learn
    submitted by /u/Lakshmireddys [link] [comments]  ( 40 min )
    How Can I Become an Early Adopter of AI?
    I've been keeping tabs on what's going on in the AI space for the last year probably and have been fascinated by the possibilities for ChatGPT and Bard. I'm not a huge techie and can't program, but I am intelligent and know my way around a computer. How could someone like me start using AI to make their life easier? What uses does it have in my job, my day to day life, or how could I use it to streamline things I already do? I've been searching online for answers to these questions but really haven't found any. I see huge potential here (as does almost everyone in the world), I'm just not sure what I can do with it at this point in time. submitted by /u/Hot_Ropes_Of_Gum [link] [comments]  ( 42 min )
    Discover and learn faster with Perplexity AI. For any question you have, AI will now link to new topics you might be curious about next. Whether you’re reading about people, places, or products, you can now explore and read more simply by clicking.
    submitted by /u/rafs2006 [link] [comments]  ( 41 min )
    Google won't penalize AI content, introduces new AI content guidelines
    submitted by /u/henlo_there_fren [link] [comments]  ( 40 min )
    Anyone know a programm that finds a location of a picture through AI?
    submitted by /u/Pierruno [link] [comments]  ( 40 min )
    Create an AI powered Gmail Add-on
    AI powered Gmail addon that helps you write 10X faster email saving your time for more important things. I worked in corporate and had to deal with unhealthy amount of emails everyday. This left me no time to work and I would end up working overtime. We have built this Gmail addon to help people who face similar problems to mine. We are also offering pay per use payment model to help you keep track of where you spend your money. Here's the download link : https://workspace.google.com/marketplace/app/buzz_mail/650469784389 submitted by /u/Spiritual-Fact-4059 [link] [comments]  ( 41 min )
    Bing And Edge Get An Upgrade With ChatGPT AI
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Is there a Local Text Generation AI?
    Greetings! I started out testing AI with the Crayon AI Image generator but I really had a lot of fun when I downloaded Stable Diffusion locally, using the Web-UI to generate images. Now I am having a lot of fun testing ChatGPT, mostly seeing what it can do writing creative short stories, but it can be a pain with the high capacity it goes through, and I really do like having things save and store locally as I generate them. Is this a thing? Anything on the level of ChatGPT at all? Thanks in advance! submitted by /u/Aldrnarii [link] [comments]  ( 41 min )
    I Was At The Legendary Man Vs Machine Chess Match - It Was Similar To The AlphaGo AI Match. (Podcast Clip)
    submitted by /u/joemurray1994 [link] [comments]  ( 40 min )
    2,000+ ChatGPT/AI tools. https://www.aisearchtool.com/
    AI Tools Directory submitted by /u/HornOkPlease9596 [link] [comments]  ( 40 min )
    Predict who will have the best AI search in 1 year's time:
    View Poll submitted by /u/tlkop123 [link] [comments]  ( 42 min )
    Health Care Bias Is Dangerous. But So Are ‘Fairness’ Algorithms
    submitted by /u/Fantomas77 [link] [comments]  ( 40 min )
    How Artificial Intelligence Will Help Find Your Purpose
    submitted by /u/derstarkerwille [link] [comments]  ( 41 min )
    Could AI Search Engines cause a chain reaction that results in the loss of hundreds of thousands of websites?
    So if you search something now you usually end up on a website that runs ads to pay for servers, editors etc. However, if ChatGPT and Bard will be fully implemented into Bing and Google, there is no need to visit these websites anymore because you get the answer you need right away. Wouldn't that result in a shitton of websites closing down, which then results in AIs having a harder time to get their hand on correct information? submitted by /u/neco_dota [link] [comments]  ( 48 min )
    Generate quizzes from any text in one click using Artificial Intelligence inside Google Form
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    Google Bard, ChatGPT competitor with Artificial Intelligence, available soon to everyone
    submitted by /u/nickkgar [link] [comments]  ( 40 min )
  • Open

    Optimize your machine learning deployments with auto scaling on Amazon SageMaker
    Machine learning (ML) has become ubiquitous. Our customers are employing ML in every aspect of their business, including the products and services they build, and for drawing insights about their customers. To build an ML-based application, you have to first build the ML model that serves your business requirement. Building ML models involves preparing the […]  ( 15 min )
  • Open

    Unsupervised and semi-supervised anomaly detection with data-centric ML
    Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Google Research, Cloud AI Team Anomaly detection (AD), the task of distinguishing anomalies from normal data, plays a vital role in many real-world applications, such as detecting faulty products from vision sensors in manufacturing, fraudulent behaviors in financial transactions, or network security threats. Depending on the availability of the type of data — negative (normal) vs. positive (anomalous) and the availability of their labels — the task of AD involves different challenges. (a) Fully supervised anomaly detection, (b) normal-only anomaly detection, (c, d, e) semi-supervised anomaly detection, (f) unsupervised anomaly detection. While most previous works were shown to be effective for cases with fully-lab…  ( 92 min )
  • Open

    [R] pix2pixzero - Zero-shot Image-to-Image Translation
    submitted by /u/gecko39 [link] [comments]  ( 43 min )
    [Discussion] Cognitive science inspired AI research
    I came across a few comments on this community about researchers developing AI algorithms inspired by ideas from neuroscience/cognition. I'd like to know how successful this approach has been in terms of coming up with new perspectives on problems. What are some of the key issues researchers are trying to address this way? What are some future directions in which research may progress? I have a rough idea that this could be one way to inspire sample efficient RL but I'd love to hear about other work that goes on in this area submitted by /u/theanswerisnt42 [link] [comments]  ( 43 min )
    [D] List of RL Papers
    Hi, I want to open a thread about RL (non-deep and deep) What are the papers/books that are "must read" to have strong foundation? submitted by /u/C_l3b [link] [comments]  ( 43 min )
    [P] Scripts/Programs to collect Baseline Logs
    Abit of a weird question. So I'm required to make & collect some clean (baseline) logs and dirty (malicious) logs for some mini-ML project I'm doing. So my question is, is there any scripts or programs out there, Linux or Windows, that allows the automation of mimicking an office staff doing work (ie. opening Outlook, sending emails, surfing the web, watching YouTube, opening and editing Word/Excel files, etc.) for the purpose of collecting baseline logs? I'm relative new to this kind of thing, if you guys have better suggestion on a more better/efficient way to do this, feel free to suggest! submitted by /u/Dweeberbob [link] [comments]  ( 43 min )
    [N] New Book on Synthetic Data​: Version 3.0 Just Released
    The book has considerably grown since version 1.0. It started with synthetic data as one of the main components, but also diving into explainable AI, intuitive / interpretable machine learning, and generative AI. Now with 272 pages (up from 156 in the first version), the focus is clearly on synthetic data. Of course, I still discuss explainable and generative AI: these concepts are strongly related to data synthetization. Agent-based modeling in action However many new chapters have been added, covering various aspects of synthetic data — in particular working with more diversified real datasets, how to synthetize them, how to generate high quality random numbers with a very fast algorithm based on digits of irrational numbers, with visual illustrations and Python code in all chapters. I…  ( 47 min )
  • Open

    Research Focus: Week of February 6, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Behind the Tech podcast with Tobi Lütke: CEO and Founder, Shopify In the latest episode of Behind the Tech, Microsoft CTO Kevin Scott is joined by Tobi […] The post Research Focus: Week of February 6, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    New NVIDIA Studio Laptops Powered by GeForce RTX 4090, 4080 Laptop GPUs Unleash Creativity
    The first NVIDIA Studio laptops powered by GeForce RTX 40 Series Laptop GPUs are now available, starting with systems from MSI and Razer — with many more to come.  ( 8 min )
  • Open

    Hénon’s dynamical system
    This post will reproduce a three plots from a paper of Hénon on dynamical systems from 1969 [1]. Let α be a constant, and pick some starting point in the plane, (x0, y0), then update x and y according to xn+1 = xn cos α − (yn − xn²) sin α yn+1 = xn sin […] Hénon’s dynamical system first appeared on John D. Cook.  ( 6 min )
  • Open

    Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees. (arXiv:2206.02659v4 [cs.LG] UPDATED)
    We consider transfer learning approaches that fine-tune a pretrained deep neural network on a target task. We study the generalization properties of fine-tuning to understand the problem of overfitting, which commonly occurs in practice. Previous works have shown that constraining the distance from the initialization of fine-tuning improves generalization. Using a PAC-Bayesian analysis, we observe that besides distance from initialization, Hessians affect generalization through the noise stability of deep neural networks against noise injections. Motivated by the observation, we develop Hessian distance-based generalization bounds for a wide range of fine-tuning methods. Additionally, we study the robustness of fine-tuning in the presence of noisy labels. We design an algorithm incorporating consistent losses and distance-based regularization for fine-tuning, along with a generalization error guarantee under class conditional independent noise in the training set labels. We perform a detailed empirical study of our algorithm on various noisy environments and architectures. On six image classification tasks whose training labels are generated with programmatic labeling, we find a 3.26% accuracy gain over prior fine-tuning methods. Meanwhile, the Hessian distance measure of the fine-tuned model decreases by six times more than existing approaches.  ( 2 min )
    Representational dissimilarity metric spaces for stochastic neural networks. (arXiv:2211.11665v2 [cs.LG] UPDATED)
    Quantifying similarity between neural representations -- e.g. hidden layer activation vectors -- is a perennial problem in deep learning and neuroscience research. Existing methods compare deterministic responses (e.g. artificial networks that lack stochastic layers) or averaged responses (e.g., trial-averaged firing rates in biological data). However, these measures of _deterministic_ representational similarity ignore the scale and geometric structure of noise, both of which play important roles in neural computation. To rectify this, we generalize previously proposed shape metrics (Williams et al. 2021) to quantify differences in _stochastic_ representations. These new distances satisfy the triangle inequality, and thus can be used as a rigorous basis for many supervised and unsupervised analyses. Leveraging this novel framework, we find that the stochastic geometries of neurobiological representations of oriented visual gratings and naturalistic scenes respectively resemble untrained and trained deep network representations. Further, we are able to more accurately predict certain network attributes (e.g. training hyperparameters) from its position in stochastic (versus deterministic) shape space.  ( 2 min )
    Robust Active Distillation. (arXiv:2210.01213v2 [cs.LG] UPDATED)
    Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited amount of labeled data is available. In large-scale applications, however, the teacher tends to provide a large number of incorrect soft-labels that impairs student performance. The sheer size of the teacher additionally constrains the number of soft-labels that can be queried due to prohibitive computational and/or financial costs. The difficulty in achieving simultaneous \emph{efficiency} (i.e., minimizing soft-label queries) and \emph{robustness} (i.e., avoiding student inaccuracies due to incorrect labels) hurts the widespread application of knowledge distillation to many modern tasks. In this paper, we present a parameter-free approach with provable guarantees to query the soft-labels of points that are simultaneously informative and correctly labeled by the teacher. At the core of our work lies a game-theoretic formulation that explicitly considers the inherent trade-off between the informativeness and correctness of input instances. We establish bounds on the expected performance of our approach that hold even in worst-case distillation instances. We present empirical evaluations on popular benchmarks that demonstrate the improved distillation performance enabled by our work relative to that of state-of-the-art active learning and active distillation methods.  ( 2 min )
    Weakly-Supervised 3D Medical Image Segmentation using Geometric Prior and Contrastive Similarity. (arXiv:2302.02125v1 [eess.IV])
    Medical image segmentation is almost the most important pre-processing procedure in computer-aided diagnosis but is also a very challenging task due to the complex shapes of segments and various artifacts caused by medical imaging, (i.e., low-contrast tissues, and non-homogenous textures). In this paper, we propose a simple yet effective segmentation framework that incorporates the geometric prior and contrastive similarity into the weakly-supervised segmentation framework in a loss-based fashion. The proposed geometric prior built on point cloud provides meticulous geometry to the weakly-supervised segmentation proposal, which serves as better supervision than the inherent property of the bounding-box annotation (i.e., height and width). Furthermore, we propose contrastive similarity to encourage organ pixels to gather around in the contrastive embedding space, which helps better distinguish low-contrast tissues. The proposed contrastive embedding space can make up for the poor representation of the conventionally-used gray space. Extensive experiments are conducted to verify the effectiveness and the robustness of the proposed weakly-supervised segmentation framework. The proposed framework is superior to state-of-the-art weakly-supervised methods on the following publicly accessible datasets: LiTS 2017 Challenge, KiTS 2021 Challenge, and LPBA40. We also dissect our method and evaluate the performance of each component.  ( 2 min )
    Augmenting Interpretable Knowledge Tracing by Ability Attribute and Attention Mechanism. (arXiv:2302.02146v1 [cs.LG])
    Knowledge tracing aims to model students' past answer sequences to track the change in their knowledge acquisition during exercise activities and to predict their future learning performance. Most existing approaches ignore the fact that students' abilities are constantly changing or vary between individuals, and lack the interpretability of model predictions. To this end, in this paper, we propose a novel model based on ability attributes and attention mechanism. We first segment the interaction sequences and captures students' ability attributes, then dynamically assign students to groups with similar abilities, and quantify the relevance of the exercises to the skill by calculating the attention weights between the exercises and the skill to enhance the interpretability of the model. We conducted extensive experiments and evaluate real online education datasets. The results confirm that the proposed model is better at predicting performance than five well-known representative knowledge tracing models, and the model prediction results are explained through an inference path.  ( 2 min )
    Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization. (arXiv:2102.11050v4 [cs.LG] UPDATED)
    Motivated by online decision-making in time-varying combinatorial environments, we study the problem of transforming offline algorithms to their online counterparts. We focus on offline combinatorial problems that are amenable to a constant factor approximation using a greedy algorithm that is robust to local errors. For such problems, we provide a general framework that efficiently transforms offline robust greedy algorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have $O(\sqrt{T})$ (approximate) regret under the full information setting. We further introduce a bandit extension of Blackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transform greedy robust offline algorithms into a $O(T^{2/3})$ (approximate) regret in the bandit setting. Demonstrating the flexibility of our framework, we apply our offline-to-online transformation to several problems at the intersection of revenue management, market design, and online optimization, including product ranking optimization in online platforms, reserve price optimization in auctions, and submodular maximization. We also extend our reduction to greedy-like first order methods used in continuous optimization, such as those used for maximizing continuous strong DR monotone submodular functions subject to convex constraints. We show that our transformation, when applied to these applications, leads to new regret bounds or improves the current known bounds. We complement our theoretical studies by conducting numerical simulations for two of our applications, in both of which we observe that the numerical performance of our transformations outperforms the theoretical guarantees in practical instances.  ( 3 min )
    Model Stitching and Visualization How GAN Generators can Invert Networks in Real-Time. (arXiv:2302.02181v1 [cs.CV])
    Critical applications, such as in the medical field, require the rapid provision of additional information to interpret decisions made by deep learning methods. In this work, we propose a fast and accurate method to visualize activations of classification and semantic segmentation networks by stitching them with a GAN generator utilizing convolutions. We test our approach on images of animals from the AFHQ wild dataset and real-world digital pathology scans of stained tissue samples. Our method provides comparable results to established gradient descent methods on these datasets while running about two orders of magnitude faster.  ( 2 min )
    Equivariance with Learned Canonicalization Functions. (arXiv:2211.06489v2 [cs.LG] UPDATED)
    Symmetry-based neural networks often constrain the architecture in order to achieve invariance or equivariance to a group of transformations. In this paper, we propose an alternative that avoids this architectural constraint by learning to produce a canonical representation of the data. These canonicalization functions can readily be plugged into non-equivariant backbone architectures. We offer explicit ways to implement them for many groups of interest. We show that this approach enjoys universality while providing interpretable insights. Our main hypothesis is that learning a neural network to perform canonicalization is better than using predefined heuristics. Our results show that learning the canonicalization function indeed leads to better results and that the approach achieves excellent performance in practice.  ( 2 min )
    VR-LENS: Super Learning-based Cybersickness Detection and Explainable AI-Guided Deployment in Virtual Reality. (arXiv:2302.01985v1 [cs.LG])
    A plethora of recent research has proposed several automated methods based on machine learning (ML) and deep learning (DL) to detect cybersickness in Virtual reality (VR). However, these detection methods are perceived as computationally intensive and black-box methods. Thus, those techniques are neither trustworthy nor practical for deploying on standalone VR head-mounted displays (HMDs). This work presents an explainable artificial intelligence (XAI)-based framework VR-LENS for developing cybersickness detection ML models, explaining them, reducing their size, and deploying them in a Qualcomm Snapdragon 750G processor-based Samsung A52 device. Specifically, we first develop a novel super learning-based ensemble ML model for cybersickness detection. Next, we employ a post-hoc explanation method, such as SHapley Additive exPlanations (SHAP), Morris Sensitivity Analysis (MSA), Local Interpretable Model-Agnostic Explanations (LIME), and Partial Dependence Plot (PDP) to explain the expected results and identify the most dominant features. The super learner cybersickness model is then retrained using the identified dominant features. Our proposed method identified eye tracking, player position, and galvanic skin/heart rate response as the most dominant features for the integrated sensor, gameplay, and bio-physiological datasets. We also show that the proposed XAI-guided feature reduction significantly reduces the model training and inference time by 1.91X and 2.15X while maintaining baseline accuracy. For instance, using the integrated sensor dataset, our reduced super learner model outperforms the state-of-the-art works by classifying cybersickness into 4 classes (none, low, medium, and high) with an accuracy of 96% and regressing (FMS 1-10) with a Root Mean Square Error (RMSE) of 0.03.  ( 2 min )
    The Predictive Forward-Forward Algorithm. (arXiv:2301.01452v2 [cs.LG] UPDATED)
    We propose the predictive forward-forward (PFF) algorithm for conducting credit assignment in neural systems. Specifically, we design a novel, dynamic recurrent neural system that learns a directed generative circuit jointly and simultaneously with a representation circuit, integrating learnable lateral competition and elements of predictive coding, an emerging and viable neurobiological process theory of cortical function, with the forward-forward (FF) adaptation scheme. Furthermore, PFF efficiently learns to propagate learning signals and updates synapses with forward passes only, eliminating key structural and computational constraints imposed by a backpropagation-based scheme. Besides computational advantages, the PFF process could prove useful for understanding the learning mechanisms behind biological neurons that use local signals despite missing feedback connections. We run experiments on image data and demonstrate that the PFF procedure works as well as backpropagation of errors, offering a promising brain-inspired learning algorithm for classifying, reconstructing, and synthesizing data patterns.  ( 2 min )
    Locally Constrained Policy Optimization for Online Reinforcement Learning in Non-Stationary Input-Driven Environments. (arXiv:2302.02182v1 [cs.LG])
    We study online Reinforcement Learning (RL) in non-stationary input-driven environments, where a time-varying exogenous input process affects the environment dynamics. Online RL is challenging in such environments due to catastrophic forgetting (CF). The agent tends to forget prior knowledge as it trains on new experiences. Prior approaches to mitigate this issue assume task labels (which are often not available in practice) or use off-policy methods that can suffer from instability and poor performance. We present Locally Constrained Policy Optimization (LCPO), an on-policy RL approach that combats CF by anchoring policy outputs on old experiences while optimizing the return on current experiences. To perform this anchoring, LCPO locally constrains policy optimization using samples from experiences that lie outside of the current input distribution. We evaluate LCPO in two gym and computer systems environments with a variety of synthetic and real input traces, and find that it outperforms state-of-the-art on-policy and off-policy RL methods in the online setting, while achieving results on-par with an offline agent pre-trained on the whole input trace.  ( 2 min )
    LExecutor: Learning-Guided Execution. (arXiv:2302.02343v1 [cs.SE])
    Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 80.1% and 94.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 50.1%.  ( 2 min )
    Efficient Gradient Approximation Method for Constrained Bilevel Optimization. (arXiv:2302.01970v1 [cs.LG])
    Bilevel optimization has been developed for many machine learning tasks with large-scale and high-dimensional data. This paper considers a constrained bilevel optimization problem, where the lower-level optimization problem is convex with equality and inequality constraints and the upper-level optimization problem is non-convex. The overall objective function is non-convex and non-differentiable. To solve the problem, we develop a gradient-based approach, called gradient approximation method, which determines the descent direction by computing several representative gradients of the objective function inside a neighborhood of the current estimate. We show that the algorithm asymptotically converges to the set of Clarke stationary points, and demonstrate the efficacy of the algorithm by the experiments on hyperparameter optimization and meta-learning.  ( 2 min )
    Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy?. (arXiv:2301.12609v2 [cs.LG] UPDATED)
    Contrary to its original interpretation as a facilitator of knowledge transfer from one model to another, some recent studies have suggested that knowledge distillation (KD) is instead a form of regularization. Perhaps the strongest support of all for this claim is found in its apparent similarities with label smoothing (LS). This paper investigates the stated equivalence of these two methods by examining the predictive uncertainties of the models they train. Experiments on four text classification tasks involving teachers and students of different capacities show that: (a) In most settings, KD and LS drive model uncertainty (entropy) in completely opposite directions, and (b) In KD, the student's predictive uncertainty is a direct function of that of its teacher, reinforcing the knowledge transfer view.  ( 2 min )
    FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours. (arXiv:2203.00854v3 [cs.LG] UPDATED)
    Protein structure prediction helps to understand gene translation and protein function, which is of growing interest and importance in structural biology. The AlphaFold model, which used transformer architecture to achieve atomic-level accuracy in protein structure prediction, was a significant breakthrough. However, training and inference of the AlphaFold model are challenging due to its high computation and memory cost. In this work, we present FastFold, an efficient implementation of AlphaFold for both training and inference. We propose Dynamic Axial Parallelism and Duality Async Operations to improve the scaling efficiency of model parallelism. Besides, AutoChunk is proposed to reduce memory cost by over 80% during inference by automatically determining the chunk strategy. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves 7.5X - 9.5X speedup for long-sequence inference. Furthermore, we scale FastFold to 512 GPUs and achieve an aggregate throughput of 6.02 PetaFLOP/s with 90.1% parallel efficiency.
    Bayesian Optimization-based Combinatorial Assignment. (arXiv:2208.14698v4 [cs.LG] UPDATED)
    We study the combinatorial assignment domain, which includes combinatorial auctions and course allocation. The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning-based preference elicitation algorithms that aim to elicit only the most important information from agents. However, the main shortcoming of this prior work is that it does not model a mechanism's uncertainty over values for not yet elicited bundles. In this paper, we address this shortcoming by presenting a Bayesian optimization-based combinatorial assignment (BOCA) mechanism. Our key technical contribution is to integrate a method for capturing model uncertainty into an iterative combinatorial auction mechanism. Concretely, we design a new method for estimating an upper uncertainty bound that can be used to define an acquisition function to determine the next query to the agents. This enables the mechanism to properly explore (and not just exploit) the bundle space during its preference elicitation phase. We run computational experiments in several spectrum auction domains to evaluate BOCA's performance. Our results show that BOCA achieves higher allocative efficiency than state-of-the-art approaches.
    A Concept Knowledge Graph for User Next Intent Prediction at Alipay. (arXiv:2301.00503v2 [cs.CL] UPDATED)
    This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. To explicitly characterize user intent, we propose \textbf{AlipayKG}, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
    Neural Time Series Analysis with Fourier Transform: A Survey. (arXiv:2302.02173v1 [cs.LG])
    Recently, Fourier transform has been widely introduced into deep neural networks to further advance the state-of-the-art regarding both accuracy and efficiency of time series analysis. The advantages of the Fourier transform for time series analysis, such as efficiency and global view, have been rapidly explored and exploited, exhibiting a promising deep learning paradigm for time series analysis. However, although increasing attention has been attracted and research is flourishing in this emerging area, there lacks a systematic review of the variety of existing studies in the area. To this end, in this paper, we provide a comprehensive review of studies on neural time series analysis with Fourier transform. We aim to systematically investigate and summarize the latest research progress. Accordingly, we propose a novel taxonomy to categorize existing neural time series analysis methods from four perspectives, including characteristics, usage paradigms, network design, and applications. We also share some new research directions in this vibrant area.  ( 2 min )
    SE(3) diffusion model with application to protein backbone generation. (arXiv:2302.02277v1 [cs.LG])
    The design of novel protein structures remains a challenge in protein engineering for applications across biomedicine and chemistry. In this line of work, a diffusion model over rigid bodies in 3D (referred to as frames) has shown success in generating novel, functional protein backbones that have not been observed in nature. However, there exists no principled methodological framework for diffusion on SE(3), the space of orientation preserving rigid motions in R3, that operates on frames and confers the group invariance. We address these shortcomings by developing theoretical foundations of SE(3) invariant diffusion models on multiple frames followed by a novel framework, FrameDiff, for learning the SE(3) equivariant score over multiple frames. We apply FrameDiff on monomer backbone generation and find it can generate designable monomers up to 500 amino acids without relying on a pretrained protein structure prediction network that has been integral to previous methods. We find our samples are capable of generalizing beyond any known protein structure.
    TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Data. (arXiv:2302.02224v1 [cs.LG])
    This work investigates the intersection of cross modal learning and semi supervised learning, where we aim to improve the supervised learning performance of the primary modality by borrowing missing information from an unlabeled modality. We investigate this problem from a Nadaraya Watson (NW) kernel regression perspective and show that this formulation implicitly leads to a kernelized cross attention module. To this end, we propose The Attention Patch (TAP), a simple neural network plugin that allows data level knowledge transfer from the unlabeled modality. We provide numerical simulations on three real world datasets to examine each aspect of TAP and show that a TAP integration in a neural network can improve generalization performance using the unlabeled modality.  ( 2 min )
    Detecting Security Patches via Behavioral Data in Code Repositories. (arXiv:2302.02112v1 [cs.CR])
    The absolute majority of software today is developed collaboratively using collaborative version control tools such as Git. It is a common practice that once a vulnerability is detected and fixed, the developers behind the software issue a Common Vulnerabilities and Exposures or CVE record to alert the user community of the security hazard and urge them to integrate the security patch. However, some companies might not disclose their vulnerabilities and just update their repository. As a result, users are unaware of the vulnerability and may remain exposed. In this paper, we present a system to automatically identify security patches using only the developer behavior in the Git repository without analyzing the code itself or the remarks that accompanied the fix (commit message). We showed we can reveal concealed security patches with an accuracy of 88.3% and F1 Score of 89.8%. This is the first time that a language-oblivious solution for this problem is presented.  ( 2 min )
    Machine Learning Methods for Evaluating Public Crisis: Meta-Analysis. (arXiv:2302.02267v1 [cs.LG])
    This study examines machine learning methods used in crisis management. Analyzing detected patterns from a crisis involves the collection and evaluation of historical or near-real-time datasets through automated means. This paper utilized the meta-review method to analyze scientific literature that utilized machine learning techniques to evaluate human actions during crises. Selected studies were condensed into themes and emerging trends using a systematic literature evaluation of published works accessed from three scholarly databases. Results show that data from social media was prominent in the evaluated articles with 27% usage, followed by disaster management, health (COVID) and crisis informatics, amongst many other themes. Additionally, the supervised machine learning method, with an application of 69% across the board, was predominant. The classification technique stood out among other machine learning tasks with 41% usage. The algorithms that played major roles were the Support Vector Machine, Neural Networks, Naive Bayes, and Random Forest, with 23%, 16%, 15%, and 12% contributions, respectively.
    Learn Proportional Derivative Controllable Latent Space from Pixels. (arXiv:2110.08239v2 [cs.LG] UPDATED)
    Recent advances in latent space dynamics model from pixels show promising progress in vision-based model predictive control (MPC). However, executing MPC in real time can be challenging due to its intensive computational cost in each timestep. We propose to introduce additional learning objectives to enforce that the learned latent space is proportional derivative controllable. In execution time, the simple PD-controller can be applied directly to the latent space encoded from pixels, to produce simple and effective control to systems with visual observations. We show that our method outperforms baseline methods to produce robust goal reaching and trajectory tracking in various environments.
    Neighboring state-based RL Exploration. (arXiv:2212.10712v2 [cs.LG] UPDATED)
    Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49\% in terms of Eval Reward Return.
    Neural Optimal Transport with General Cost Functionals. (arXiv:2205.15403v2 [cs.LG] UPDATED)
    Neural optimal transport techniques mostly use Euclidean cost functions, such as $\ell^1$ or $\ell^2$. These costs are suitable for translation tasks between related domains, but they are hardly applicable to problems where a specific non-Euclidean optimality of the mapping is required such as dataset transfer. To tackle this issue, we introduce a novel neural network-based algorithm to compute optimal transport plans and maps for general cost functionals. Such functionals provide more flexibility and allow using auxiliary information, such as class labels, to construct the required transport map. Our method is based on a saddle point reformulation of the optimal transport problem and generalizes prior methods for weak and strong transport cost functionals. As an application, we construct a functional to map data distributions with preserving the class-wise structure of data.
    Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization. (arXiv:2302.02299v1 [cs.LG])
    Recent success in Deep Reinforcement Learning (DRL) methods has shown that policy optimization with respect to an off-policy distribution via importance sampling is effective for sample reuse. In this paper, we show that the use of importance sampling could introduce high variance in the objective estimate. Specifically, we show in a principled way that the variance of importance sampling estimate grows quadratically with importance ratios and the large ratios could consequently jeopardize the effectiveness of surrogate objective optimization. We then propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high. We instantiate this sample dropout technique on representative policy optimization algorithms, including TRPO, PPO, and ESPO, and demonstrate that it consistently boosts the performance of those DRL algorithms on both continuous and discrete action controls, including MuJoCo, DMControl and Atari video games. Our code is open-sourced at \url{https://github.com/LinZichuan/sdpo.git}.
    Using Intermediate Forward Iterates for Intermediate Generator Optimization. (arXiv:2302.02336v1 [cs.LG])
    Score-based models have recently been introduced as a richer framework to model distributions in high dimensions and are generally more suitable for generative tasks. In score-based models, a generative task is formulated using a parametric model (such as a neural network) to directly learn the gradient of such high dimensional distributions, instead of the density functions themselves, as is done traditionally. From the mathematical point of view, such gradient information can be utilized in reverse by stochastic sampling to generate diverse samples. However, from a computational perspective, existing score-based models can be efficiently trained only if the forward or the corruption process can be computed in closed form. By using the relationship between the process and layers in a feed-forward network, we derive a backpropagation-based procedure which we call Intermediate Generator Optimization to utilize intermediate iterates of the process with negligible computational overhead. The main advantage of IGO is that it can be incorporated into any standard autoencoder pipeline for the generative task. We analyze the sample complexity properties of IGO to solve downstream tasks like Generative PCA. We show applications of the IGO on two dense predictive tasks viz., image extrapolation, and point cloud denoising. Our experiments indicate that obtaining an ensemble of generators for various time points is possible using first-order methods.
    Multi-Center Federated Learning: Clients Clustering for Better Personalization. (arXiv:2005.01026v3 [cs.LG] UPDATED)
    Federated learning has received great attention for its capability to train a large-scale model in a decentralized manner without needing to access user data directly. It helps protect the users' private data from centralized collecting. Unlike distributed machine learning, federated learning aims to tackle non-IID data from heterogeneous sources in various real-world applications, such as those on smartphones. Existing federated learning approaches usually adopt a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. However, due to the diverse nature of user behaviors, assigning users' gradients to different global models (i.e., centers) can better capture the heterogeneity of data distributions across users. Our paper proposes a novel multi-center aggregation mechanism for federated learning, which learns multiple global models from the non-IID user data and simultaneously derives the optimal matching between users and centers. We formulate the problem as a joint optimization that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Our experimental results on benchmark datasets show that our method outperforms several popular federated learning methods.
    Learning in quantum games. (arXiv:2302.02333v1 [cs.GT])
    In this paper, we introduce a class of learning dynamics for general quantum games, that we call "follow the quantum regularized leader" (FTQL), in reference to the classical "follow the regularized leader" (FTRL) template for learning in finite games. We show that the induced quantum state dynamics decompose into (i) a classical, commutative component which governs the dynamics of the system's eigenvalues in a way analogous to the evolution of mixed strategies under FTRL; and (ii) a non-commutative component for the system's eigenvectors which has no classical counterpart. Despite the complications that this non-classical component entails, we find that the FTQL dynamics incur no more than constant regret in all quantum games. Moreover, adjusting classical notions of stability to account for the nonlinear geometry of the state space of quantum games, we show that only pure quantum equilibria can be stable and attracting under FTQL while, as a partial converse, pure equilibria that satisfy a certain "variational stability" condition are always attracting. Finally, we show that the FTQL dynamics are Poincar\'e recurrent in quantum min-max games, extending in this way a very recent result for the quantum replicator dynamics.
    DiGress: Discrete Denoising diffusion for graph generation. (arXiv:2209.14734v3 [cs.LG] UPDATED)
    This work introduces DiGress, a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. Our model utilizes a discrete diffusion process that progressively edits graphs with noise, through the process of adding or removing edges and changing the categories. A graph transformer network is trained to revert this process, simplifying the problem of distribution learning over graphs into a sequence of node and edge classification tasks. We further improve sample quality by introducing a Markovian noise model that preserves the marginal distribution of node and edge types during diffusion, and by incorporating auxiliary graph-theoretic features. A procedure for conditioning the generation on graph-level features is also proposed. DiGress achieves state-of-the-art performance on molecular and non-molecular datasets, with up to 3x validity improvement on a planar graph dataset. It is also the first model to scale to the large GuacaMol dataset containing 1.3M drug-like molecules without the use of molecule-specific representations.
    The Missing Indicator Method: From Low to High Dimensions. (arXiv:2211.09259v2 [cs.LG] UPDATED)
    Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records.
    Revisiting Discriminative vs. Generative Classifiers: Theory and Implications. (arXiv:2302.02334v1 [cs.LG])
    A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the ``two regimes'' phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.
    Heterogeneous Federated Knowledge Graph Embedding Learning and Unlearning. (arXiv:2302.02069v1 [cs.LG])
    Federated Learning (FL) recently emerges as a paradigm to train a global machine learning model across distributed clients without sharing raw data. Knowledge Graph (KG) embedding represents KGs in a continuous vector space, serving as the backbone of many knowledge-driven applications. As a promising combination, federated KG embedding can fully take advantage of knowledge learned from different clients while preserving the privacy of local data. However, realistic problems such as data heterogeneity and knowledge forgetting still remain to be concerned. In this paper, we propose FedLU, a novel FL framework for heterogeneous KG embedding learning and unlearning. To cope with the drift between local optimization and global convergence caused by data heterogeneity, we propose mutual knowledge distillation to transfer local knowledge to global, and absorb global knowledge back. Moreover, we present an unlearning method based on cognitive neuroscience, which combines retroactive interference and passive decay to erase specific knowledge from local clients and propagate to the global model by reusing knowledge distillation. We construct new datasets for assessing realistic performance of the state-of-the-arts. Extensive experiments show that FedLU achieves superior results in both link prediction and knowledge forgetting.
    How Many and Which Training Points Would Need to be Removed to Flip this Prediction?. (arXiv:2302.02169v1 [cs.LG])
    We consider the problem of identifying a minimal subset of training data $\mathcal{S}_t$ such that if the instances comprising $\mathcal{S}_t$ had been removed prior to training, the categorization of a given test point $x_t$ would have been different. Identifying such a set may be of interest for a few reasons. First, the cardinality of $\mathcal{S}_t$ provides a measure of robustness (if $|\mathcal{S}_t|$ is small for $x_t$, we might be less confident in the corresponding prediction), which we show is correlated with but complementary to predicted probabilities. Second, interrogation of $\mathcal{S}_t$ may provide a novel mechanism for contesting a particular model prediction: If one can make the case that the points in $\mathcal{S}_t$ are wrongly labeled or irrelevant, this may argue for overturning the associated prediction. Identifying $\mathcal{S}_t$ via brute-force is intractable. We propose comparatively fast approximation methods to find $\mathcal{S}_t$ based on influence functions, and find that -- for simple convex text classification models -- these approaches can often successfully identify relatively small sets of training examples which, if removed, would flip the prediction. To our knowledge, this is the first work in to investigate the problem of identifying a minimal training set necessary to flip a given prediction in the context of machine learning.
    Unsupervised Learning for Pilot-free Transmission in 3GPP MIMO Systems. (arXiv:2302.02191v1 [cs.IT])
    Reference signals overhead reduction has recently evolved as an effective solution for improving the system spectral efficiency. This paper introduces a new downlink data structure that is free from demodulation reference signals (DM-RS), and hence does not require any channel estimation at the receiver. The new proposed data transmission structure involves a simple repetition step of part of the user data across the different sub-bands. Exploiting the repetition structure at the user side, it is shown that reliable recovery is possible via canonical correlation analysis. This paper also proposes two effective mechanisms for boosting the CCA performance in OFDM systems; one for repetition pattern selection and another to deal with the severe frequency selectivity issues. The proposed approach exhibits favorable complexity-performance tradeoff, rendering it appealing for practical implementation. Numerical results, using a 3GPP link-level testbench, demonstrate the superiority of the proposed approach relative to the state-of-the-art methods.
    Euclidean-Norm-Induced Schatten-p Quasi-Norm Regularization for Low-Rank Tensor Completion and Tensor Robust Principal Component Analysis. (arXiv:2012.03436v4 [cs.LG] UPDATED)
    The nuclear norm and Schatten-$p$ quasi-norm are popular rank proxies in low-rank matrix recovery. However, computing the nuclear norm or Schatten-$p$ quasi-norm of a tensor is hard in both theory and practice, hindering their application to low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA). In this paper, we propose a new class of tensor rank regularizers based on the Euclidean norms of the CP component vectors of a tensor and show that these regularizers are monotonic transformations of tensor Schatten-$p$ quasi-norm. This connection enables us to minimize the Schatten-$p$ quasi-norm in LRTC and TRPCA implicitly via the component vectors. The method scales to big tensors and provides an arbitrarily sharper rank proxy for low-rank tensor recovery compared to the nuclear norm. On the other hand, we study the generalization abilities of LRTC with the Schatten-$p$ quasi-norm regularizer and LRTC with the proposed regularizers. The theorems show that a relatively sharper regularizer leads to a tighter error bound, which is consistent with our numerical results. Particularly, we prove that for LRTC with Schatten-$p$ quasi-norm regularizer on $d$-order tensors, $p=1/d$ is always better than any $p>1/d$ in terms of the generalization ability. We also provide a recovery error bound to verify the usefulness of small $p$ in the Schatten-$p$ quasi-norm for TRPCA. Numerical results on synthetic data and real data demonstrate the effectiveness of the regularization methods and theorems.
    Efficient Human-in-the-loop System for Guiding DNNs Attention. (arXiv:2206.05981v3 [cs.CV] UPDATED)
    Attention guidance is an approach to addressing dataset bias in deep learning, where the model relies on incorrect features to make decisions. Focusing on image classification tasks, we propose an efficient human-in-the-loop system to interactively direct the attention of classifiers to the regions specified by users, thereby reducing the influence of co-occurrence bias and improving the transferability and interpretability of a DNN. Previous approaches for attention guidance require the preparation of pixel-level annotations and are not designed as interactive systems. We present a new interactive method to allow users to annotate images with simple clicks, and study a novel active learning strategy to significantly reduce the number of annotations. We conducted both a numerical evaluation and a user study to evaluate the proposed system on multiple datasets. Compared to the existing non-active-learning approach which usually relies on huge amounts of polygon-based segmentation masks to fine-tune or train the DNNs, our system can save lots of labor and money and obtain a fine-tuned network that works better even when the dataset is biased. The experiment results indicate that the proposed system is efficient, reasonable, and reliable.
    Global Optimization with Parametric Function Approximation. (arXiv:2211.09100v2 [cs.LG] UPDATED)
    We consider the problem of global optimization with noisy zeroth order oracles - a well-motivated problem useful for various applications ranging from hyper-parameter tuning for deep learning to new material design. Existing work relies on Gaussian processes or other non-parametric family, which suffers from the curse of dimensionality. In this paper, we propose a new algorithm GO-UCB that leverages a parametric family of functions (e.g., neural networks) instead. Under a realizable assumption and a few other mild geometric conditions, we show that GO-UCB achieves a cumulative regret of $\tilde{O}(\sqrt{T})$ where $T$ is the time horizon. At the core of GO-UCB is a carefully designed uncertainty set over parameters based on gradients that allows optimistic exploration. Synthetic and real-world experiments illustrate GO-UCB works better than Bayesian optimization approaches in high dimensional cases, even if the model is misspecified.
    Predicting the power grid frequency of European islands. (arXiv:2209.15414v2 [stat.AP] UPDATED)
    Modelling, forecasting and overall understanding of the dynamics of the power grid and its frequency are essential for the safe operation of existing and future power grids. Much previous research was focused on large continental areas, while small systems, such as islands are less well-studied. These natural island systems are ideal testing environments for microgrid proposals and artificially islanded grid operation. In the present paper, we utilize measurements of the power grid frequency obtained in European islands: the Faroe Islands, Ireland, the Balearic Islands and Iceland and investigate how their frequency can be predicted, compared to the Nordic power system, acting as a reference. The Balearic islands are found to be particularly deterministic and easy to predict in contrast to hard-to-predict Iceland. Furthermore, we show that typically 2-4 weeks of data are needed to improve prediction performance beyond simple benchmarks.
    Learning Solution Manifolds for Control Problems via Energy Minimization. (arXiv:2203.03432v2 [cs.RO] UPDATED)
    A variety of control tasks such as inverse kinematics (IK), trajectory optimization (TO), and model predictive control (MPC) are commonly formulated as energy minimization problems. Numerical solutions to such problems are well-established. However, these are often too slow to be used directly in real-time applications. The alternative is to learn solution manifolds for control problems in an offline stage. Although this distillation process can be trivially formulated as a behavioral cloning (BC) problem in an imitation learning setting, our experiments highlight a number of significant shortcomings arising due to incompatible local minima, interpolation artifacts, and insufficient coverage of the state space. In this paper, we propose an alternative to BC that is efficient and numerically robust. We formulate the learning of solution manifolds as a minimization of the energy terms of a control objective integrated over the space of problems of interest. We minimize this energy integral with a novel method that combines Monte Carlo-inspired adaptive sampling strategies with the derivatives used to solve individual instances of the control task. We evaluate the performance of our formulation on a series of robotic control problems of increasing complexity, and we highlight its benefits through comparisons against traditional methods such as behavioral cloning and Dataset aggregation (Dagger).
    EGC2: Enhanced Graph Classification with Easy Graph Compression. (arXiv:2107.07737v2 [cs.LG] UPDATED)
    Graph classification is crucial in network analyses. Networks face potential security threats, such as adversarial attacks. Some defense methods may trade off the algorithm complexity for robustness, such as adversarial training, whereas others may trade off clean example performance, such as smoothingbased defense. Most suffer from high complexity or low transferability. To address this problem, we proposed EGC2, an enhanced graph classification model with easy graph compression. EGC2 captures the relationship between the features of different nodes by constructing feature graphs and improving the aggregation of the node-level representations. To achieve lower-complexity defense applied to graph classification models, EGC2 utilizes a centrality-based edge-importance index to compress the graphs, filtering out trivial structures and adversarial perturbations in the input graphs, thus improving the model's robustness. Experiments on ten benchmark datasets demonstrate that the proposed feature read-out and graph compression mechanisms enhance the robustness of multiple basic models, resulting in a state-of-the-art performance in terms of accuracy and robustness against various adversarial attacks.
    On the complexity of nonsmooth automatic differentiation. (arXiv:2206.01730v2 [math.NA] UPDATED)
    Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.
    Selecting the Best Optimizers for Deep Learning based Medical Image Segmentation. (arXiv:2302.02289v1 [eess.IV])
    The goal of this work is to identify the best optimizers for deep learning in the context of cardiac image segmentation and to provide guidance on how to design segmentation networks with effective optimization strategies. Adaptive learning helps with fast convergence by starting with a larger learning rate (LR) and gradually decreasing it. Momentum optimizers are particularly effective at quickly optimizing neural networks within the accelerated schemes category. By revealing the potential interplay between these two types of algorithms (LR and momentum optimizers or momentum rate (MR) in short), in this article, we explore the two variants of SGD algorithms in a single setting. We suggest using cyclic learning as the base optimizer and integrating optimal values of learning rate and momentum rate. We investigated the relationship of LR and MR under an important problem of medical image segmentation of cardiac structures from MRI and CT scans. We conducted experiments using the cardiac imaging dataset from the ACDC challenge of MICCAI 2017, and four different architectures shown to be successful for cardiac image segmentation problems. Our comprehensive evaluations demonstrated that the proposed optimizer achieved better results (over a 2\% improvement in the dice metric) than other optimizers in deep learning literature with similar or lower computational cost in both single and multi-object segmentation settings. We hypothesized that combination of accelerated and adaptive optimization methods can have a drastic effect in medical image segmentation performances. To this end, we proposed a new cyclic optimization method (\textit{CLMR}) to address the efficiency and accuracy problems in deep learning based medical image segmentation. The proposed strategy yielded better generalization in comparison to adaptive optimizers.
    Proper Learning of Linear Dynamical Systems as a Non-Commutative Polynomial Optimisation Problem. (arXiv:2002.01444v4 [math.OC] UPDATED)
    There has been much recent progress in forecasting the next observation of a linear dynamical system (LDS), which is known as the improper learning, as well as in the estimation of its system matrices, which is known as the proper learning of LDS. We present an approach to proper learning of LDS, which in spite of the non-convexity of the problem, guarantees global convergence of numerical solutions to a least-squares estimator. We present promising computational results.
    Exploring validation metrics for offline model-based optimisation. (arXiv:2211.10747v2 [stat.ML] UPDATED)
    In offline model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of desirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and apply this to conditional denoising diffusion models. Empirically, we find that two validation metrics -- agreement and Frechet distance -- correlate quite well with the ground truth. When there is high variability in conditional generation, feedback is required in the form of an approximated version of the real-world scoring function. Furthermore, we find that generating high-scoring samples may require heavily weighting the generative model in favour of sample quality, potentially at the cost of sample diversity.
    Reducing Nearest Neighbor Training Sets Optimally and Exactly. (arXiv:2302.02132v1 [cs.CG])
    In nearest-neighbor classification, a training set $P$ of points in $\mathbb{R}^d$ with given classification is used to classify every point in $\mathbb{R}^d$: Every point gets the same classification as its nearest neighbor in $P$. Recently, Eppstein [SOSA'22] developed an algorithm to detect the relevant training points, those points $p\in P$, such that $P$ and $P\setminus\{p\}$ induce different classifications. We investigate the problem of finding the minimum cardinality reduced training set $P'\subseteq P$ such that $P$ and $P'$ induce the same classification. We show that the set of relevant points is such a minimum cardinality reduced training set if $P$ is in general position. Furthermore, we show that finding a minimum cardinality reduced training set for possibly degenerate $P$ is in P for $d=1$, and NP-complete for $d\geq 2$.
    Maximizing Global Model Appeal in Federated Learning. (arXiv:2205.14840v2 [cs.LG] UPDATED)
    Federated learning typically considers collaboratively training a global model using local data at edge clients. Clients may have their own individual requirements, such as having a minimal training loss threshold, which they expect to be met by the global model. However, due to client heterogeneity, the global model may not meet each client's requirements, and only a small subset may find the global model appealing. In this work, we explore the problem of the global model lacking appeal to the clients due to not being able to satisfy local requirements. We propose MaxFL, which aims to maximize the number of clients that find the global model appealing. We show that having a high global model appeal is important to maintain an adequate pool of clients for training, and can directly improve the test accuracy on both seen and unseen clients. We provide convergence guarantees for MaxFL and show that MaxFL achieves a $22$-$40\%$ and $18$-$50\%$ test accuracy improvement for the training clients and unseen clients respectively, compared to a wide range of FL modeling approaches, including those that tackle data heterogeneity, aim to incentivize clients, and learn personalized or fair models.
    Robust Empirical Risk Minimization with Tolerance. (arXiv:2210.00635v2 [cs.LG] UPDATED)
    Developing simple, sample-efficient learning algorithms for robust classification is a pressing issue in today's tech-dominated world, and current theoretical techniques requiring exponential sample complexity and complicated improper learning rules fall far from answering the need. In this work we study the fundamental paradigm of (robust) $\textit{empirical risk minimization}$ (RERM), a simple process in which the learner outputs any hypothesis minimizing its training error. RERM famously fails to robustly learn VC classes (Montasser et al., 2019a), a bound we show extends even to `nice' settings such as (bounded) halfspaces. As such, we study a recent relaxation of the robust model called $\textit{tolerant}$ robust learning (Ashtiani et al., 2022) where the output classifier is compared to the best achievable error over slightly larger perturbation sets. We show that under geometric niceness conditions, a natural tolerant variant of RERM is indeed sufficient for $\gamma$-tolerant robust learning VC classes over $\mathbb{R}^d$, and requires only $\tilde{O}\left( \frac{VC(H)d\log \frac{D}{\gamma\delta}}{\epsilon^2}\right)$ samples for robustness regions of (maximum) diameter $D$.
    Improving Fair Training under Correlation Shifts. (arXiv:2302.02323v1 [cs.LG])
    Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make two contributions for solving this problem. First, we analytically show that existing in-processing fair algorithms have fundamental limits in accuracy and group fairness. We introduce the notion of correlation shifts, which can explicitly capture the change of the above bias. Second, we propose a novel pre-processing step that samples the input data to reduce correlation shifts and thus enables the in-processing approaches to overcome their limitations. We formulate an optimization problem for adjusting the data ratio among labels and sensitive groups to reflect the shifted correlation. A key benefit of our approach lies in decoupling the roles of pre- and in-processing approaches: correlation adjustment via pre-processing and unfairness mitigation on the processed data via in-processing. Experiments show that our framework effectively improves existing in-processing fair algorithms w.r.t. accuracy and fairness, both on synthetic and real datasets.
    A Characteristic Function for Shapley-Value-Based\\Attribution of Anomaly Scores. (arXiv:2004.04464v2 [cs.LG] UPDATED)
    In anomaly detection, the degree of irregularity is often summarized as a real-valued anomaly score. We address the problem of attributing such anomaly scores to input features for interpreting the results of anomaly detection. We particularly investigate the use of the Shapley value for attributing anomaly scores of semi-supervised detection methods. We propose a characteristic function specifically designed for attributing anomaly scores. The idea is to approximate the absence of some features by locally minimizing the anomaly score with regard to the to-be-absent features. We examine the applicability of the proposed characteristic function and other general approaches for interpreting anomaly scores on multiple datasets and multiple anomaly detection methods. The results indicate the potential utility of the attribution methods including the proposed one.
    Hierarchical Sliced Wasserstein Distance. (arXiv:2209.13570v5 [stat.ML] UPDATED)
    Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNet.
    Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs. (arXiv:2106.01128v2 [cs.LG] UPDATED)
    The ability to align points across two related yet incomparable point clouds (e.g. living in different spaces) plays an important role in machine learning. The Gromov-Wasserstein (GW) framework provides an increasingly popular answer to such problems, by seeking a low-distortion, geometry-preserving assignment between these points. As a non-convex, quadratic generalization of optimal transport (OT), GW is NP-hard. While practitioners often resort to solving GW approximately as a nested sequence of entropy-regularized OT problems, the cubic complexity (in the number $n$ of samples) of that approach is a roadblock. We show in this work how a recent variant of the OT problem that restricts the set of admissible couplings to those having a low-rank factorization is remarkably well suited to the resolution of GW: when applied to GW, we show that this approach is not only able to compute a stationary point of the GW problem in time $O(n^2)$, but also uniquely positioned to benefit from the knowledge that the initial cost matrices are low-rank, to yield a linear time $O(n)$ GW approximation. Our approach yields similar results, yet orders of magnitude faster computation than the SoTA entropic GW approaches, on both simulated and real data.
    Deep Graph-Level Clustering Using Pseudo-Label-Guided Mutual Information Maximization Network. (arXiv:2302.02369v1 [cs.LG])
    In this work, we study the problem of partitioning a set of graphs into different groups such that the graphs in the same group are similar while the graphs in different groups are dissimilar. This problem was rarely studied previously, although there have been a lot of work on node clustering and graph classification. The problem is challenging because it is difficult to measure the similarity or distance between graphs. One feasible approach is using graph kernels to compute a similarity matrix for the graphs and then performing spectral clustering, but the effectiveness of existing graph kernels in measuring the similarity between graphs is very limited. To solve the problem, we propose a novel method called Deep Graph-Level Clustering (DGLC). DGLC utilizes a graph isomorphism network to learn graph-level representations by maximizing the mutual information between the representations of entire graphs and substructures, under the regularization of a clustering module that ensures discriminative representations via pseudo labels. DGLC achieves graph-level representation learning and graph-level clustering in an end-to-end manner. The experimental results on six benchmark datasets of graphs show that our DGLC has state-of-the-art performance in comparison to many baselines.
    Sparse GCA and Thresholded Gradient Descent. (arXiv:2107.00371v2 [stat.ML] UPDATED)
    Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple datasets. It generalizes canonical correlation analysis that is designed for two datasets. We study sparse GCA when there are potentially multiple generalized correlation tuples in data and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as generalized eigenvalue problems at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic datasets.
    CCSL: A Causal Structure Learning Method from Multiple Unknown Environments. (arXiv:2111.09666v2 [cs.LG] UPDATED)
    Most existing causal structure learning methods assume data collected from one environment and independent and identically distributed (i.i.d.). In some cases, data are collected from different subjects from multiple environments, which provides more information but might make the data non-identical or non-independent distribution. Some previous efforts try to learn causal structure from this type of data in two independent stages, i.e., first discovering i.i.d. groups from non-i.i.d. samples, then learning the causal structures from different groups. This straightforward solution ignores the intrinsic connections between the two stages, that is both the clustering stage and the learning stage should be guided by the same causal mechanism. Towards this end, we propose a unified Causal Cluster Structures Learning (named CCSL) method for causal discovery from non-i.i.d. data. This method simultaneously integrates the following two tasks: 1) clustering samples of the subjects with the same causal mechanism into different groups; 2) learning causal structures from the samples within the group. Specifically, for the former, we provide a Causality-related Chinese Restaurant Process to cluster samples based on the similarity of the causal structure; for the latter, we introduce a variational-inference-based approach to learn the causal structures. Theoretical results provide identification of the causal model and the clustering model under the linear non-Gaussian assumption. Experimental results on both simulated and real-world data further validate the correctness and effectiveness of the proposed method.
    Human alignment of neural network representations. (arXiv:2211.01201v3 [cs.CV] UPDATED)
    Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses. We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses, whereas the training dataset and objective function both have a much larger impact. These findings are consistent across three datasets of human similarity judgments collected using two different tasks. Linear transformations of neural network representations learned from behavioral responses from one dataset substantially improve alignment with human similarity judgments on the other two datasets. In addition, we find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
    Bayesian Fixed-Budget Best-Arm Identification. (arXiv:2211.08572v2 [cs.LG] UPDATED)
    Fixed-budget best-arm identification (BAI) is a bandit problem where the agent maximizes the probability of identifying the optimal arm within a fixed budget of observations. In this work, we study this problem in the Bayesian setting. We propose a Bayesian elimination algorithm and derive an upper bound on its probability of misidentifying the optimal arm. The bound reflects the quality of the prior and is the first distribution-dependent bound in this setting. We prove it using a frequentist-like argument, where we carry the prior through, and then integrate out the bandit instance at the end. We also provide the first lower bound on the probability of misidentification in a $2$-armed Bayesian bandit and show that our upper bound (almost) matches the lower bound. Our experiments show that Bayesian elimination is superior to frequentist methods and competitive with the state-of-the-art Bayesian algorithms that have no guarantees in our setting.
    Equi-Tuning: Group Equivariant Fine-Tuning of Pretrained Models. (arXiv:2210.06475v2 [cs.LG] UPDATED)
    We introduce equi-tuning, a novel fine-tuning method that transforms (potentially non-equivariant) pretrained models into group equivariant models while incurring minimum $L_2$ loss between the feature representations of the pretrained and the equivariant models. Large pretrained models can be equi-tuned for different groups to satisfy the needs of various downstream tasks. Equi-tuned models benefit from both group equivariance as an inductive bias and semantic priors from pretrained models. We provide applications of equi-tuning on three different tasks: image classification, compositional generalization in language, and fairness in natural language generation (NLG). We also provide a novel group-theoretic definition for fairness in NLG. The effectiveness of this definition is shown by testing it against a standard empirical method of fairness in NLG. We provide experimental results for equi-tuning using a variety of pretrained models: Alexnet, Resnet, VGG, and Densenet for image classification; RNNs, GRUs, and LSTMs for compositional generalization; and GPT2 for fairness in NLG. We test these models on benchmark datasets across all considered tasks to show the generality and effectiveness of the proposed method.
    Multi-Domain Long-Tailed Learning by Augmenting Disentangled Representations. (arXiv:2210.14358v2 [cs.LG] UPDATED)
    There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. Existing long-tailed classification methods focus on the single-domain setting, where all examples are drawn from the same distribution. However, real-world scenarios often involve multiple domains with distinct imbalanced class distributions. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, which produces invariant predictors by balanced augmenting hidden representations over domains and classes. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domain-invariant class prototype that averages out domain-specific effects. We evaluate TALLY on four long-tailed variants of classical domain generalization benchmarks and two real-world imbalanced multi-domain datasets. The results indicate that TALLY consistently outperforms other state-of-the-art methods in both subpopulation shift and domain shift.
    Global Context Vision Transformers. (arXiv:2206.09959v4 [cs.CV] UPDATED)
    We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision tasks. The core of the novel model are global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, as an alternative to complex operations such as an attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. In addition, we address the lack of inductive bias in ViTs and improve the modeling of inter-channel dependencies by proposing a novel downsampler which leverages a parameter-efficient fused inverted residual block. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, GC ViT models with 51M, 90M and 201M parameters achieve 84.3%, 84.9% and 85.6% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins.
    Gender Bias in Fake News: An Analysis. (arXiv:2209.11984v3 [cs.CY] UPDATED)
    Data science research into fake news has gathered much momentum in recent years, arguably facilitated by the emergence of large public benchmark datasets. While it has been well-established within media studies that gender bias is an issue that pervades news media, there has been very little exploration into the relationship between gender bias and fake news. In this work, we provide the first empirical analysis of gender bias vis-a-vis fake news, leveraging simple and transparent lexicon-based methods over public benchmark datasets. Our analysis establishes the increased prevalance of gender bias in fake news across three facets viz., abundance, affect and proximal words. The insights from our analysis provide a strong argument that gender bias needs to be an important consideration in research into fake news.
    A Permutation-free Kernel Two-Sample Test. (arXiv:2211.14908v2 [stat.ME] UPDATED)
    The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance metric between distributions that has found utility in two-sample testing. The usual kernel-MMD test statistic is a degenerate U-statistic under the null, and thus it has an intractable limiting distribution. Hence, to design a level-$\alpha$ test, one usually selects the rejection threshold as the $(1-\alpha)$-quantile of the permutation distribution. The resulting nonparametric test has finite-sample validity but suffers from large computational cost, since every permutation takes quadratic time. We propose the cross-MMD, a new quadratic-time MMD test statistic based on sample-splitting and studentization. We prove that under mild assumptions, the cross-MMD has a limiting standard Gaussian distribution under the null. Importantly, we also show that the resulting test is consistent against any fixed alternative, and when using the Gaussian kernel, it has minimax rate-optimal power against local alternatives. For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.
    Deep Latent State Space Models for Time-Series Generation. (arXiv:2212.12749v3 [stat.ML] UPDATED)
    Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being x100 faster than other baselines on long sequences.
    Uncovering Adversarial Risks of Test-Time Adaptation. (arXiv:2301.12576v2 [cs.LG] UPDATED)
    Recently, test-time adaptation (TTA) has been proposed as a promising solution for addressing distribution shifts. It allows a base model to adapt to an unforeseen distribution during inference by leveraging the information from the batch of (unlabeled) test data. However, we uncover a novel security vulnerability of TTA based on the insight that predictions on benign samples can be impacted by malicious samples in the same batch. To exploit this vulnerability, we propose Distribution Invading Attack (DIA), which injects a small fraction of malicious data into the test batch. DIA causes models using TTA to misclassify benign and unperturbed test data, providing an entirely new capability for adversaries that is infeasible in canonical machine learning pipelines. Through comprehensive evaluations, we demonstrate the high effectiveness of our attack on multiple benchmarks across six TTA methods. In response, we investigate two countermeasures to robustify the existing insecure TTA implementations, following the principle of "security by design". Together, we hope our findings can make the community aware of the utility-security tradeoffs in deploying TTA and provide valuable insights for developing robust TTA approaches.
    Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking. (arXiv:2301.12458v2 [cs.LG] UPDATED)
    Generative graph self-supervised learning (SSL) aims to learn node representations by reconstructing the input graph data. However, most existing methods focus on unsupervised learning tasks only and very few work has shown its superiority over the state-of-the-art graph contrastive learning (GCL) models, especially on the classification task. While a very recent model has been proposed to bridge the gap, its performance on unsupervised learning tasks is still unknown. In this paper, to comprehensively enhance the performance of generative graph SSL against other GCL models on both unsupervised and supervised learning tasks, we propose the SeeGera model, which is based on the family of self-supervised variational graph auto-encoder (VGAE). Specifically, SeeGera adopts the semi-implicit variational inference framework, a hierarchical variational framework, and mainly focuses on feature reconstruction and structure/feature masking. On the one hand, SeeGera co-embeds both nodes and features in the encoder and reconstructs both links and features in the decoder. Since feature embeddings contain rich semantic information on features, they can be combined with node embeddings to provide fine-grained knowledge for feature reconstruction. On the other hand, SeeGera adds an additional layer for structure/feature masking to the hierarchical variational framework, which boosts the model generalizability. We conduct extensive experiments comparing SeeGera with 9 other state-of-the-art competitors. Our results show that SeeGera can compare favorably against other state-of-the-art GCL methods in a variety of unsupervised and supervised learning tasks.
    Beyond NaN: Resiliency of Optimization Layers in The Face of Infeasibility. (arXiv:2202.06242v2 [cs.LG] UPDATED)
    Prior work has successfully incorporated optimization layers as the last layer in neural networks for various problems, thereby allowing joint learning and planning in one neural network forward pass. In this work, we identify a weakness in such a set-up where inputs to the optimization layer lead to undefined output of the neural network. Such undefined decision outputs can lead to possible catastrophic outcomes in critical real time applications. We show that an adversary can cause such failures by forcing rank deficiency on the matrix fed to the optimization layer which results in the optimization failing to produce a solution. We provide a defense for the failure cases by controlling the condition number of the input matrix. We study the problem in the settings of synthetic data, Jigsaw Sudoku, and in speed planning for autonomous driving, building on top of prior frameworks in end-to-end learning and optimization. We show that our proposed defense effectively prevents the framework from failing with undefined output. Finally, we surface a number of edge cases which lead to serious bugs in popular equation and optimization solvers which can be abused as well.
    FairMILE: A Multi-Level Framework for Fair and Scalable Graph Representation Learning. (arXiv:2211.09925v2 [cs.LG] UPDATED)
    Graph representation learning models have been deployed for making decisions in multiple high-stakes scenarios. It is therefore critical to ensure that these models are fair. Prior research has shown that graph neural networks can inherit and reinforce the bias present in graph data. Researchers have begun to examine ways to mitigate the bias in such models. However, existing efforts are restricted by their inefficiency, limited applicability, and the constraints they place on sensitive attributes. To address these issues, we present FairMILE a general framework for fair and scalable graph representation learning. FairMILE is a multi-level framework that allows contemporary unsupervised graph embedding methods to scale to large graphs in an agnostic manner. FairMILE learns both fair and high-quality node embeddings where the fairness constraints are incorporated in each phase of the framework. Our experiments across two distinct tasks demonstrate that FairMILE can learn node representations that often achieve superior fairness scores and high downstream performance while significantly outperforming all the baselines in terms of efficiency.
    Benchmarking optimality of time series classification methods in distinguishing diffusions. (arXiv:2301.13112v2 [stat.ML] UPDATED)
    Performance benchmarking is a crucial component of time series classification (TSC) algorithm design, and a fast-growing number of datasets have been established for empirical benchmarking. However, the empirical benchmarks are costly and do not guarantee statistical optimality. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is optimal in the sense of the Neyman-Pearson lemma: it has the smallest false positive rate among classifiers with a controlled level of false negative rate. The LRT requires the likelihood ratio of the time series to be computable. The diffusion processes from stochastic differential equations provide such time series and are flexible in design for generating linear or nonlinear time series. We demonstrate the benchmarking with three scalable state-of-the-art TSC algorithms: random forest, ResNet, and ROCKET. Test results show that they can achieve LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying nonlinear multivariate time series from high-dimensional stochastic interacting particle systems. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series. Thus, the LRT with diffusion processes can systematically and efficiently benchmark the optimality of TSC algorithms and may guide their future improvements.
    Harnessing Simulation for Molecular Embeddings. (arXiv:2302.02055v1 [cs.LG])
    While deep learning has unlocked advances in computational biology once thought to be decades away, extending deep learning techniques to the molecular domain has proven challenging, as labeled data is scarce and the benefit from self-supervised learning can be negligible in many cases. In this work, we explore a different approach. Inspired by methods in deep reinforcement learning and robotics, we explore harnessing physics-based molecular simulation to develop molecular embeddings. By fitting a Graph Neural Network to simulation data, molecules that display similar interactions with biological targets under simulation develop similar representations in the embedding space. These embeddings can then be used to initialize the feature space of down-stream models trained on real-world data to encode information learned during simulation into a molecular prediction task. Our experimental findings indicate this approach improves the performance of existing deep learning models on real-world molecular prediction tasks by as much as 38% with minimal modification to the downstream model and no hyperparameter tuning.
    Neural Optimal Transport. (arXiv:2201.12220v2 [cs.LG] UPDATED)
    We present a novel neural-networks-based algorithm to compute optimal transport maps and plans for strong and weak transport costs. To justify the usage of neural networks, we prove that they are universal approximators of transport plans between probability distributions. We evaluate the performance of our optimal transport algorithm on toy examples and on the unpaired image-to-image translation.
    Introspective Experience Replay: Look Back When Surprised. (arXiv:2206.03171v4 [cs.LG] UPDATED)
    In reinforcement learning (RL), experience replay-based sampling techniques play a crucial role in promoting convergence by eliminating spurious correlations. However, widely used methods such as uniform experience replay (UER) and prioritized experience replay (PER) have been shown to have sub-optimal convergence and high seed sensitivity respectively. To address these issues, we propose a novel approach called IntrospectiveExperience Replay (IER) that selectively samples batches of data points prior to surprising events. Our method builds upon the theoretically sound reverse experience replay (RER) technique, which has been shown to reduce bias in the output of Q-learning-type algorithms with linear function approximation. However, this approach is not always practical or reliable when using neural function approximation. Through empirical evaluations, we demonstrate that IER with neural function approximation yields reliable and superior performance compared toUER, PER, and hindsight experience replay (HER) across most tasks.
    Confidence-Ranked Reconstruction of Census Microdata from Published Statistics. (arXiv:2211.03128v2 [cs.CY] UPDATED)
    A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.
    Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning. (arXiv:2210.05158v2 [cs.LG] UPDATED)
    Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. While promising, we show that these methods can be unreliable, as their performance may degrade significantly when conditioned on high, out-of-distribution (ood) returns. This is crucial in practice, as we often expect the policy to perform better than the offline dataset by conditioning on an ood value. We show that this unreliability arises from both the suboptimality of training data and model architectures. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization. Trajectory weighting upweights the high-return trajectories to reduce the train-test gap for BC methods, while conservative regularizer encourages the policy to stay close to the data distribution for ood conditioning. We study CWBC in the context of RvS (Emmons et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC significantly boosts their performance on various benchmarks.
    Re-parameterizing Your Optimizers rather than Architectures. (arXiv:2205.15242v3 [cs.LG] UPDATED)
    The well-designed structures in neural networks reflect the prior knowledge incorporated into the models. However, though different models have various priors, we are used to training them with model-agnostic optimizers such as SGD. In this paper, we propose to incorporate model-specific prior knowledge into optimizers by modifying the gradients according to a set of model-specific hyper-parameters. Such a methodology is referred to as Gradient Re-parameterization, and the optimizers are named RepOptimizers. For the extreme simplicity of model structure, we focus on a VGG-style plain model and showcase that such a simple model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs on par with or better than the recent well-designed models. From a practical perspective, RepOpt-VGG is a favorable base model because of its simple structure, high inference speed and training efficiency. Compared to Structural Re-parameterization, which adds priors into models via constructing extra training-time structures, RepOptimizers require no extra forward/backward computations and solve the problem of quantization. We hope to spark further research beyond the realms of model structure design. Code and models \url{https://github.com/DingXiaoH/RepOptimizers}.
    Flow-matching -- efficient coarse-graining of molecular dynamics without forces. (arXiv:2203.11167v4 [physics.comp-ph] UPDATED)
    Coarse-grained (CG) molecular simulations have become a standard tool to study molecular processes on time- and length-scales inaccessible to all-atom simulations. Parameterizing CG force fields to match all-atom simulations has mainly relied on force-matching or relative entropy minimization, which require many samples from costly simulations with all-atom or CG resolutions, respectively. Here we present flow-matching, a new training method for CG force fields that combines the advantages of both methods by leveraging normalizing flows, a generative deep learning method. Flow-matching first trains a normalizing flow to represent the CG probability density, which is equivalent to minimizing the relative entropy without requiring iterative CG simulations. Subsequently, the flow generates samples and forces according to the learned distribution in order to train the desired CG free energy model via force matching. Even without requiring forces from the all-atom simulations, flow-matching outperforms classical force-matching by an order of magnitude in terms of data efficiency, and produces CG models that can capture the folding and unfolding transitions of small proteins.
    Tensor Decomposition of Large-scale Clinical EEGs Reveals Interpretable Patterns of Brain Physiology. (arXiv:2211.13793v2 [eess.SP] UPDATED)
    Identifying abnormal patterns in electroencephalography (EEG) remains the cornerstone of diagnosing several neurological diseases. The current clinical EEG review process relies heavily on expert visual review, which is unscalable and error-prone. In an effort to augment the expert review process, there is a significant interest in mining population-level EEG patterns using unsupervised approaches. Current approaches rely either on two-dimensional decompositions (e.g., principal and independent component analyses) or deep representation learning (e.g., auto-encoders, self-supervision). However, most approaches do not leverage the natural multi-dimensional structure of EEGs and lack interpretability. In this study, we propose a tensor decomposition approach using the canonical polyadic decomposition to discover a parsimonious set of population-level EEG patterns, retaining the natural multi-dimensional structure of EEGs (time x space x frequency). We then validate their clinical value using a cohort of patients including varying stages of cognitive impairment. Our results show that the discovered patterns reflect physiologically meaningful features and accurately classify the stages of cognitive impairment (healthy vs mild cognitive impairment vs Alzheimer's dementia) with substantially fewer features compared to classical and deep learning-based baselines. We conclude that the decomposition of population-level EEG tensors recovers expert-interpretable EEG patterns that can aid in the study of smaller specialized clinical cohorts.
    Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v5 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,\delta)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to investigate individual privacy across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2% higher than that of the class with the highest accuracy.
    Continuous Forecasting via Neural Eigen Decomposition. (arXiv:2202.00117v3 [cs.LG] UPDATED)
    Neural differential equations predict the derivative of a stochastic process. This allows irregular forecasting with arbitrary time-steps. However, the expressive temporal flexibility often comes with a high sensitivity to noise. In addition, current methods model measurements and control together, limiting generalization to different control policies. These properties severely limit applicability to medical treatment problems, which require reliable forecasting given high noise, limited data and changing treatment policies. We introduce the Neural Eigen-SDE algorithm (NESDE), which relies on piecewise linear dynamics modeling with spectral representation. NESDE provides control over the expressiveness level; decoupling of control from measurements; and closed-form continuous prediction in inference. NESDE is demonstrated to provide robust forecasting in both synthetic and real high-noise medical problems. Finally, we use the learned dynamics models to publish simulated medical gym environments.
    Projecting Non-Fungible Token (NFT) Collections: A Contextual Generative Approach. (arXiv:2210.15493v2 [q-fin.CP] UPDATED)
    Non-fungible tokens (NFTs) are digital assets stored on a blockchain representing real-world objects such as art or collectibles. An NFT collection comprises numerous tokens; each token can be transacted multiple times. It is a multibillion-dollar market where the number of collections has more than doubled in 2022. In this paper, we want to obtain a generative model that, given the early transactions history (first quarter Q1) of a newly minted collection, generates subsequent transactions (quarters Q2, Q3, Q4), where the generative model is trained using the transaction history of a few mature collections. The goal is to use the generated transactions to project the potential market value of this newly minted collection over the next few quarters. A technical challenge exists in that different collections have diverse characteristics, and the generative model should generate based on the appropriate "contexts" of the collection. Our method takes a two-step approach. First, it employs unsupervised learning on the early transactions to extract characteristics (which we call contexts) of NFT collections. Next, it generates future transactions of each token based on these contexts and the early transactions, projecting the target collection's potential market value. Comprehensive experiments demonstrate our contextual generative approach's NFT projection capabilities.
    Achieve the Minimum Width of Neural Networks for Universal Approximation. (arXiv:2209.11395v2 [cs.LG] UPDATED)
    The universal approximation property (UAP) of neural networks is fundamental for deep learning, and it is well known that wide neural networks are universal approximators of continuous functions within both the $L^p$ norm and the continuous/uniform norm. However, the exact minimum width, $w_{\min}$, for the UAP has not been studied thoroughly. Recently, using a decoder-memorizer-encoder scheme, \citet{Park2021Minimum} found that $w_{\min} = \max(d_x+1,d_y)$ for both the $L^p$-UAP of ReLU networks and the $C$-UAP of ReLU+STEP networks, where $d_x,d_y$ are the input and output dimensions, respectively. In this paper, we consider neural networks with an arbitrary set of activation functions. We prove that both $C$-UAP and $L^p$-UAP for functions on compact domains share a universal lower bound of the minimal width; that is, $w^*_{\min} = \max(d_x,d_y)$. In particular, the critical width, $w^*_{\min}$, for $L^p$-UAP can be achieved by leaky-ReLU networks, provided that the input or output dimension is larger than one. Our construction is based on the approximation power of neural ordinary differential equations and the ability to approximate flow maps by neural networks. The nonmonotone or discontinuous activation functions case and the one-dimensional case are also discussed.
    Offline Equilibrium Finding. (arXiv:2207.05285v2 [cs.AI] UPDATED)
    Offline reinforcement learning (offline RL) is an emerging field that has recently begun gaining attention across various application domains due to its ability to learn strategies from earlier collected datasets. Offline RL proved very successful, paving a path to solving previously intractable real-world problems, and we aim to generalize this paradigm to a multiplayer-game setting. To this end, we introduce a problem of offline equilibrium finding (OEF) and construct multiple types of datasets across a wide range of games using several established methods. To solve the OEF problem, we design a model-based framework that can directly apply any online equilibrium finding algorithm to the OEF setting while making minimal changes. The three most prominent contemporary online equilibrium finding algorithms are adapted to the context of OEF, creating three model-based variants: OEF-PSRO and OEF-CFR, which generalize the widely-used algorithms PSRO and Deep CFR to compute Nash equilibria (NEs), and OEF-JPSRO, which generalizes the JPSRO to calculate (Coarse) Correlated equilibria ((C)CEs). We also combine the behavior cloning policy with the model-based policy to further improve the performance and provide a theoretical guarantee of the solution quality. Extensive experimental results demonstrate the superiority of our approach over offline RL algorithms and the importance of using model-based methods for OEF problems. We hope our work will contribute to advancing research in large-scale equilibrium finding.
    Efficient Adaptive Activation Rounding for Post-Training Quantization. (arXiv:2208.11945v2 [cs.LG] UPDATED)
    Post-training quantization (PTQ) attracts increasing attention due to its convenience in deploying quantized neural networks. Rounding is the primary source of quantization error, for which previous works adopt the rounding-to-nearest scheme with a constant border of 0.5. This work demonstrates that optimizing rounding schemes can improve model accuracy. By replacing the constant border with a simple border function, we can obtain the minimal error for multiplying two numbers and eliminate the bias of its expected value, which further benefits model accuracy. Based on this insight, we approximate the border function to make the incurred overhead negligible. We also jointly optimize propagated errors and global errors. We finally propose our AQuant framework, which can learn the border function automatically. Extensive experiments show that AQuant achieves noticeable improvements compared with state-of-the-art works and pushes the accuracy of ResNet-18 up to 60.31% under the 2-bit weight and activation post-training quantization.
    Bidirectional Language Models Are Also Few-shot Learners. (arXiv:2209.14500v2 [cs.LG] UPDATED)
    Large language models such as GPT-3 (Brown et al., 2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning. To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm. We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models. Utilizing the machine translation task as a case study, we prompt the bidirectional mT5 model (Xue et al., 2021) with SAP and demonstrate its few-shot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite mT5's approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models, rather than only unidirectional models.
    Can Stochastic Gradient Langevin Dynamics Provide Differential Privacy for Deep Learning?. (arXiv:2110.05057v5 [cs.LG] UPDATED)
    Bayesian learning via Stochastic Gradient Langevin Dynamics (SGLD) has been suggested for differentially private learning. While previous research provides differential privacy bounds for SGLD at the initial steps of the algorithm or when close to convergence, the question of what differential privacy guarantees can be made in between remains unanswered. This interim region is of great importance, especially for Bayesian neural networks, as it is hard to guarantee convergence to the posterior. This paper shows that using SGLD might result in unbounded privacy loss for this interim region, even when sampling from the posterior is as differentially private as desired.
    On Best-Arm Identification with a Fixed Budget in Non-Parametric Multi-Armed Bandits. (arXiv:2210.00895v2 [cs.LG] UPDATED)
    We lay the foundations of a non-parametric theory of best-arm identification in multi-armed bandits with a fixed budget T. We consider general, possibly non-parametric, models D for distributions over the arms; an overarching example is the model D = P(0,1) of all probability distributions over [0,1]. We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities that correspond to infima over Kullback-Leibler divergences between some distributions in D and a given distribution. This is made possible by a refined analysis of the successive-rejects strategy of Audibert, Bubeck, and Munos (2010). We finally provide lower bounds on the same average log-probability, also in terms of the same new information-theoretic quantities; these lower bounds are larger when the (natural) assumptions on the considered strategies are stronger. All these new upper and lower bounds generalize existing bounds based, e.g., on gaps between distributions.
    Adaptive Perturbation-Based Gradient Estimation for Discrete Latent Variable Models. (arXiv:2209.04862v2 [cs.LG] UPDATED)
    The integration of discrete algorithmic components in deep learning architectures has numerous applications. Recently, Implicit Maximum Likelihood Estimation (IMLE, Niepert, Minervini, and Franceschi 2021), a class of gradient estimators for discrete exponential family distributions, was proposed by combining implicit differentiation through perturbation with the path-wise gradient estimator. However, due to the finite difference approximation of the gradients, it is especially sensitive to the choice of the finite difference step size, which needs to be specified by the user. In this work, we present Adaptive IMLE (AIMLE), the first adaptive gradient estimator for complex discrete distributions: it adaptively identifies the target distribution for IMLE by trading off the density of gradient information with the degree of bias in the gradient estimates. We empirically evaluate our estimator on synthetic examples, as well as on Learning to Explain, Discrete Variational Auto-Encoders, and Neural Relational Inference tasks. In our experiments, we show that our adaptive gradient estimator can produce faithful estimates while requiring orders of magnitude fewer samples than other gradient estimators.
    Run-Off Election: Improved Provable Defense against Data Poisoning Attacks. (arXiv:2302.02300v1 [cs.LG])
    In data poisoning attacks, an adversary tries to change a model's prediction by adding, modifying, or removing samples in the training data. Recently, ensemble-based approaches for obtaining provable defenses against data poisoning have been proposed where predictions are done by taking a majority vote across multiple base models. In this work, we show that merely considering the majority vote in ensemble defenses is wasteful as it does not effectively utilize available information in the logits layers of the base models. Instead, we propose Run-Off Election (ROE), a novel aggregation method based on a two-round election across the base models: In the first round, models vote for their preferred class and then a second, Run-Off election is held between the top two classes in the first round. Based on this approach, we propose DPA+ROE and FA+ROE defense methods based on Deep Partition Aggregation (DPA) and Finite Aggregation (FA) approaches from prior work. We show how to obtain robustness for these methods using ideas inspired by dynamic programming and duality. We evaluate our methods on MNIST, CIFAR-10, and GTSRB and obtain improvements in certified accuracy by up to 4.73%, 3.63%, and 3.54%, respectively, establishing a new state-of-the-art in (pointwise) certified robustness against data poisoning. In many cases, our approach outperforms the state-of-the-art, even when using 32 times less computational power.
    Transferable E(3) equivariant parameterization for Hamiltonian of molecules and solids. (arXiv:2210.16190v2 [physics.comp-ph] UPDATED)
    Using the message-passing mechanism in machine learning (ML) instead of self-consistent iterations to directly build the mapping from structures to electronic Hamiltonian matrices will greatly improve the efficiency of density functional theory (DFT) calculations. In this work, we proposed a general analytic Hamiltonian representation in an E(3) equivariant framework, which can fit the ab initio Hamiltonian of molecules and solids by a complete data-driven method and are equivariant under rotation, space inversion, and time reversal operations. Our model reached state-of-the-art precision in the benchmark test and accurately predicted the electronic Hamiltonian matrices and related properties of various periodic and aperiodic systems, showing high transferability and generalization ability. This framework provides a general transferable model that can be used to accelerate the electronic structure calculations on different large systems with the same network weights trained on small structures.
    Representation Learning in Continuous-Time Dynamic Signed Networks. (arXiv:2207.03408v3 [cs.SI] UPDATED)
    Signed networks allow us to model conflicting relationships and interactions, such as friend/enemy and support/oppose. These signed interactions happen in real-time. Modeling such dynamics of signed networks is crucial to understanding the evolution of polarization in the network and enabling effective prediction of the signed structure (i.e., link signs and signed weights) in the future. However, existing works have modeled either (static) signed networks or dynamic (unsigned) networks but not dynamic signed networks. Since both sign and dynamics inform the graph structure in different ways, it is non-trivial to model how to combine the two features. In this work, we propose a new Graph Neural Network (GNN)-based approach to model dynamic signed networks, named SEMBA: Signed link's Evolution using Memory modules and Balanced Aggregation. Here, the idea is to incorporate the signs of temporal interactions using separate modules guided by balance theory and to evolve the embeddings from a higher-order neighborhood. Experiments on 4 real-world datasets and 4 different tasks demonstrate that SEMBA consistently and significantly outperforms the baselines by up to $80\%$ on the tasks of predicting signs of future links while matching the state-of-the-art performance on predicting the existence of these links in the future. We find that this improvement is due specifically to the superior performance of SEMBA on the minority negative class.
    GRANDE: a neural model over directed multigraphs with application to anti-money laundering. (arXiv:2302.02101v1 [cs.LG])
    The application of graph representation learning techniques to the area of financial risk management (FRM) has attracted significant attention recently. However, directly modeling transaction networks using graph neural models remains challenging: Firstly, transaction networks are directed multigraphs by nature, which could not be properly handled with most of the current off-the-shelf graph neural networks (GNN). Secondly, a crucial problem in FRM scenarios like anti-money laundering (AML) is to identify risky transactions and is most naturally cast into an edge classification problem with rich edge-level features, which are not fully exploited by the prevailing GNN design that follows node-centric message passing protocols. In this paper, we present a systematic investigation of design aspects of neural models over directed multigraphs and develop a novel GNN protocol that overcomes the above challenges via efficiently incorporating directional information, as well as proposing an enhancement that targets edge-related tasks using a novel message passing scheme over an extension of edge-to-node dual graph. A concrete GNN architecture called GRANDE is derived using the proposed protocol, with several further improvements and generalizations to temporal dynamic graphs. We apply the GRANDE model to both a real-world anti-money laundering task and public datasets. Experimental evaluations show the superiority of the proposed GRANDE architecture over recent state-of-the-art models on dynamic graph modeling and directed graph modeling.
    ReDi: Efficient Learning-Free Diffusion Inference via Trajectory Retrieval. (arXiv:2302.02285v1 [cs.CV])
    Diffusion models show promising generation capability for a variety of data. Despite their high generation quality, the inference for diffusion models is still time-consuming due to the numerous sampling iterations required. To accelerate the inference, we propose ReDi, a simple yet learning-free Retrieval-based Diffusion sampling framework. From a precomputed knowledge base, ReDi retrieves a trajectory similar to the partially generated trajectory at an early stage of generation, skips a large portion of intermediate steps, and continues sampling from a later step in the retrieved trajectory. We theoretically prove that the generation performance of ReDi is guaranteed. Our experiments demonstrate that ReDi improves the model inference efficiency by 2x speedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain image generation such as image stylization.
    Polynomial-time sparse measure recovery. (arXiv:2204.07879v3 [cs.LG] UPDATED)
    Many problems in computer science reduce to the recovery of an $n$-sparse measure from its (generalized) moments. Sparse measure recovery has been the research focus in super-resolution, tensor decomposition, and learning neural networks. The existing methods use either convex relaxations or overparameterization for recovery. Here, we propose recovery with non-convex optimization without overparameterization. Our algorithm is a (sub)gradient descent method optimizing a non-convex energy function studied in physics. We establish the global convergence of gradient descent on the energy function. This result enables us to solve super-resolution in $O(n^2)$ time, which significantly improves upon $O(n^3)$ time for solving convex relaxations. For a particular neural network, we prove the global convergence of subgradient descent on the population loss without overparameterization. The studied network has zero-one activations, and inputs drawn from the unit sphere.
    MTEB: Massive Text Embedding Benchmark. (arXiv:2210.07316v2 [cs.CL] UPDATED)
    Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://github.com/embeddings-benchmark/mteb.
    NeuRI: Diversifying DNN Generation via Inductive Rule Inference. (arXiv:2302.02261v1 [cs.SE])
    Deep Learning (DL) is prevalently used in various industries to improve decision-making and automate processes, driven by the ever-evolving DL libraries and compilers. The correctness of DL systems is crucial for trust in DL applications. As such, the recent wave of research has been studying the automated synthesis of test-cases (i.e., DNN models and their inputs) for fuzzing DL systems. However, existing model generators only subsume a limited number of operators, for lacking the ability to pervasively model operator constraints. To address this challenge, we propose NeuRI, a fully automated approach for generating valid and diverse DL models composed of hundreds of types of operators. NeuRI adopts a three-step process: (i) collecting valid and invalid API traces from various sources; (ii) applying inductive program synthesis over the traces to infer the constraints for constructing valid models; and (iii) performing hybrid model generation by incorporating both symbolic and concrete operators concolically. Our evaluation shows that NeuRI improves branch coverage of TensorFlow and PyTorch by 51% and 15% over the state-of-the-art. Within four months, NeuRI finds 87 new bugs for PyTorch and TensorFlow, with 64 already fixed or confirmed, and 8 high-priority bugs labeled by PyTorch, constituting 10% of all high-priority bugs of the period. Additionally, open-source developers regard error-inducing models reported by us as "high-quality" and "common in practice".
    Recurrence With Correlation Network for Medical Image Registration. (arXiv:2302.02283v1 [cs.CV])
    We present Recurrence with Correlation Network (RWCNet), a medical image registration network with multi-scale features and a cost volume layer. We demonstrate that these architectural features improve medical image registration accuracy in two image registration datasets prepared for the MICCAI 2022 Learn2Reg Workshop Challenge. On the large-displacement National Lung Screening Test (NLST) dataset, RWCNet is able to achieve a total registration error (TRE) of 2.11mm between corresponding keypoints without instance fine-tuning. On the OASIS brain MRI dataset, RWCNet is able to achieve an average dice overlap of 81.7% for 35 different anatomical labels. It outperforms another multi-scale network, the Laplacian Image Registration Network (LapIRN), on both datasets. Ablation experiments are performed to highlight the contribution of the various architectural features. While multi-scale features improved validation accuracy for both datasets, the cost volume layer and number of recurrent steps only improved performance on the large-displacement NLST dataset. This result suggests that cost volume layer and iterative refinement using RNN provide good support for optimization and generalization in large-displacement medical image registration. The code for RWCNet is available at https://github.com/vigsivan/optimization-based-registration.
    Clustering with Neural Network and Index. (arXiv:2212.03853v2 [cs.LG] UPDATED)
    A new model called Clustering with Neural Network and Index (CNNI) is introduced. CNNI uses a Neural Network to cluster data points. Training of the Neural Network mimics supervised learning, with an internal clustering evaluation index acting as the loss function. An experiment is conducted to test the feasibility of the new model, and compared with results of other clustering models like K-means and Gaussian Mixture Model (GMM). The result shows CNNI can work properly for clustering data; CNNI equipped with MMJ-SC, achieves the first parametric (inductive) clustering model that can deal with non-convex shaped (non-flat geometry) data.
    Decentralized Differentially Private Without-Replacement Stochastic Gradient Descent. (arXiv:1809.02727v4 [cs.LG] UPDATED)
    While machine learning has achieved remarkable results in a wide variety of domains, the training of models often requires large datasets that may need to be collected from different individuals. As sensitive information may be contained in the individual's dataset, sharing training data may lead to severe privacy concerns. Therefore, there is a compelling need to develop privacy-aware machine learning methods, for which one effective approach is to leverage the generic framework of differential privacy. Considering that stochastic gradient descent (SGD) is one of the most commonly adopted methods for large-scale machine learning problems, a decentralized differentially private SGD algorithm is proposed in this work. Particularly, we focus on SGD without replacement due to its favorable structure for practical implementation. Both privacy and convergence analysis are provided for the proposed algorithm. Finally, extensive experiments are performed to demonstrate the effectiveness of the proposed method.
    Input Invex Neural Network. (arXiv:2106.08748v3 [cs.LG] UPDATED)
    Connected decision boundaries are useful in several tasks like image segmentation, clustering, alpha-shape or defining a region in nD-space. However, the machine learning literature lacks methods for generating connected decision boundaries using neural networks. Thresholding an invex function, a generalization of a convex function, generates such decision boundaries. This paper presents two methods for constructing invex functions using neural networks. The first approach is based on constraining a neural network with Gradient Clipped-Gradient Penality (GCGP), where we clip and penalise the gradients. In contrast, the second one is based on the relationship of the invex function to the composition of invertible and convex functions. We employ connectedness as a basic interpretation method and create connected region-based classifiers. We show that multiple connected set based classifiers can approximate any classification function. In the experiments section, we use our methods for classification tasks using an ensemble of 1-vs-all models as well as using a single multiclass model on larger-scale datasets. The experiments show that connected set-based classifiers do not pose any disadvantage over ordinary neural network classifiers, but rather, enhance their interpretability. We also did an extensive study on the properties of invex function and connected sets for interpretability and network morphism with experiments on simulated and real-world data sets. Our study suggests that invex function is fundamental to understanding and applying locality and connectedness of input space which is useful for various downstream tasks.
    Direct Advantage Estimation. (arXiv:2109.06093v3 [cs.LG] UPDATED)
    The predominant approach in reinforcement learning is to assign credit to actions based on the expected return. However, we show that the return may depend on the policy in a way which could lead to excessive variance in value estimation and slow down learning. Instead, we show that the advantage function can be interpreted as causal effects and shares similar properties with causal representations. Based on this insight, we propose Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from on-policy data while simultaneously minimizing the variance of the return without requiring the (action-)value function. We also relate our method to Temporal Difference methods by showing how value functions can be seamlessly integrated into DAE. The proposed method is easy to implement and can be readily adapted by modern actor-critic methods. We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.
    Translate First Reorder Later: Leveraging Monotonicity in Semantic Parsing. (arXiv:2210.04878v2 [cs.CL] UPDATED)
    Prior work in semantic parsing has shown that conventional seq2seq models fail at compositional generalization tasks. This limitation led to a resurgence of methods that model alignments between sentences and their corresponding meaning representations, either implicitly through latent variables or explicitly by taking advantage of alignment annotations. We take the second direction and propose TPOL, a two-step approach that first translates input sentences monotonically and then reorders them to obtain the correct output. This is achieved with a modular framework comprising a Translator and a Reorderer component. We test our approach on two popular semantic parsing datasets. Our experiments show that by means of the monotonic translations, TPOL can learn reliable lexico-logical patterns from aligned data, significantly improving compositional generalization both over conventional seq2seq models, as well as over other approaches that exploit gold alignments.
    Synthesising Realistic Calcium Traces of Neuronal Populations Using GAN. (arXiv:2009.02707v3 [q-bio.NC] UPDATED)
    Calcium imaging has become a powerful and popular technique to monitor the activity of large populations of neurons in vivo. However, for ethical considerations and despite recent technical developments, recordings are still constrained to a limited number of trials and animals. This limits the amount of data available from individual experiments and hinders the development of analysis techniques and models for more realistic sizes of neuronal populations. The ability to artificially synthesize realistic neuronal calcium signals could greatly alleviate this problem by scaling up the number of trials. Here, we propose a Generative Adversarial Network (GAN) model to generate realistic calcium signals as seen in neuronal somata with calcium imaging. To this end, we propose CalciumGAN, a model based on the WaveGAN architecture and train it on calcium fluorescent signals with the Wasserstein distance. We test the model on artificial data with known ground-truth and show that the distribution of the generated signals closely resembles the underlying data distribution. Then, we train the model on real calcium traces recorded from the primary visual cortex of behaving mice and confirm that the deconvolved spike trains match the statistics of the recorded data. Together, these results demonstrate that our model can successfully generate realistic calcium traces, thereby providing the means to augment existing datasets of neuronal activity for enhanced data exploration and modelling.
    Boosting Exploration in Multi-Task Reinforcement Learning using Adversarial Networks. (arXiv:2201.11783v3 [cs.LG] UPDATED)
    Advancements in reinforcement learning (RL) have been remarkable in recent years. However, the limitations of traditional training methods have become increasingly evident, particularly in meta-RL settings where agents face new, unseen tasks. Conventional training approaches are susceptible to failure in such situations as they need more robustness to adversity. Our proposed adversarial training regime for Multi-Task Reinforcement Learning (MT-RL) addresses the limitations of conventional training methods in RL, especially in meta-RL environments where the agent faces new tasks. The adversarial component challenges the agent, forcing it to improve its decision-making abilities in dynamic and unpredictable situations. This component operates without relying on manual intervention or domain-specific knowledge, making it a highly versatile solution. Experiments conducted in multiple MT-RL environments demonstrate that adversarial training leads to better exploration and a deeper understanding of the environment. The adversarial training regime for MT-RL presents a new perspective on training and development for RL agents and is a valuable contribution to the field.
    Transformation-Based Models of Video Sequences. (arXiv:1701.08435v3 [cs.LG] UPDATED)
    In this work we propose a simple unsupervised approach for next frame prediction in video. Instead of directly predicting the pixels in a frame given past frames, we predict the transformations needed for generating the next frame in a sequence, given the transformations of the past frames. This leads to sharper results, while using a smaller prediction model. In order to enable a fair comparison between different video frame prediction models, we also propose a new evaluation protocol. We use generated frames as input to a classifier trained with ground truth sequences. This criterion guarantees that models scoring high are those producing sequences which preserve discriminative features, as opposed to merely penalizing any deviation, plausible or not, from the ground truth. Our proposed approach compares favourably against more sophisticated ones on the UCF-101 data set, while also being more efficient in terms of the number of parameters and computational cost.
    Multiscale Graph Comparison via the Embedded Laplacian Discrepancy. (arXiv:2201.12064v2 [stat.ML] UPDATED)
    Laplacian eigenvectors capture natural community structures on graphs and are widely used in spectral clustering and manifold learning. The use of Laplacian eigenvectors as embeddings for the purpose of multiscale graph comparison has however been limited. Here we propose the Embedded Laplacian Discrepancy (ELD) as a simple and fast approach to compare graphs (of potentially different sizes) based on the similarity of the graphs' community structures. The ELD operates by representing graphs as point clouds in a common, low-dimensional space, on which a natural Wasserstein-based distance can be efficiently computed. A main challenge in comparing graphs through any eigenvector-based approaches is the potential ambiguity that could arise due to sign-flips and basis symmetries. The ELD leverages a simple symmetrization trick to bypass any sign ambiguities. For comparing graphs that do not have any ambiguities due to basis symmetries (i.e. the spectrums are simple), we show that the ELD becomes a natural pseudo-metric that enjoys nice properties such as invariance under graph isomorphism. For comparing graphs with non-simple spectrums, we propose a procedure to approximate the ELD via a simple perturbation technique to resolve any ambiguity from basis symmetries. We show that such perturbations are stable using matrix perturbation theory under mild assumptions that are straightforward to verify in practice. We demonstrate the excellent applicability of the ELD approach on both simulated and real datasets.
    Strong Consistency and Rate of Convergence of Switched Least Squares System Identification for Autonomous Markov Jump Linear Systems. (arXiv:2112.10753v2 [cs.LG] UPDATED)
    In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state observations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-independent rate of convergence shows that, almost surely, the system identification error is $\mathcal{O}\big(\sqrt{\log(T)/T} \big)$ where $T$ is the time horizon. These results show that switched least squares method for MJS has the same rate of convergence as least squares method for autonomous linear systems. We derive our results by imposing a general stability assumption on the model called stability in the average sense. We show that stability in the average sense is a weaker form of stability compared to the stability assumptions commonly imposed in the literature. We present numerical examples to illustrate the performance of the proposed method.
    IGRF-RFE: A Hybrid Feature Selection Method for MLP-based Network Intrusion Detection on UNSW-NB15 Dataset. (arXiv:2203.16365v2 [cs.LG] UPDATED)
    The effectiveness of machine learning models is significantly affected by the size of the dataset and the quality of features as redundant and irrelevant features can radically degrade the performance. This paper proposes IGRF-RFE: a hybrid feature selection method tasked for multi-class network anomalies using a Multilayer perceptron (MLP) network. IGRF-RFE can be considered as a feature reduction technique based on both the filter feature selection method and the wrapper feature selection method. In our proposed method, we use the filter feature selection method, which is the combination of Information Gain and Random Forest Importance, to reduce the feature subset search space. Then, we apply recursive feature elimination(RFE) as a wrapper feature selection method to further eliminate redundant features recursively on the reduced feature subsets. Our experimental results obtained based on the UNSW-NB15 dataset confirm that our proposed method can improve the accuracy of anomaly detection while reducing the feature dimension. The results show that the feature dimension is reduced from 42 to 23 while the multi-classification accuracy of MLP is improved from 82.25% to 84.24%.
    Open Problems and Modern Solutions for Deep Reinforcement Learning. (arXiv:2302.02298v1 [cs.LG])
    Deep Reinforcement Learning (DRL) has achieved great success in solving complicated decision-making problems. Despite the successes, DRL is frequently criticized for many reasons, e.g., data inefficient, inflexible and intractable reward design. In this paper, we review two publications that investigate the mentioned issues of DRL and propose effective solutions. One designs the reward for human-robot collaboration by combining the manually designed extrinsic reward with a parameterized intrinsic reward function via the deterministic policy gradient, which improves the task performance and guarantees a stronger obstacle avoidance. The other one applies selective attention and particle filters to rapidly and flexibly attend to and select crucial pre-learned features for DRL using approximate inference instead of backpropagation, thereby improving the efficiency and flexibility of DRL. Potential avenues for future work in both domains are discussed in this paper.
    Approximate Newton policy gradient algorithms. (arXiv:2110.02398v4 [cs.LG] UPDATED)
    Policy gradient algorithms have been widely applied to Markov decision processes and reinforcement learning problems in recent years. Regularization with various entropy functions is often used to encourage exploration and improve stability. This paper proposes an approximate Newton method for the policy gradient algorithm with entropy regularization. In the case of Shannon entropy, the resulting algorithm reproduces the natural policy gradient algorithm. For other entropy functions, this method results in brand-new policy gradient algorithms. We prove that all these algorithms enjoy Newton-type quadratic convergence and that the corresponding gradient flow converges globally to the optimal solution. Using synthetic and industrial-scale examples, we demonstrate that the proposed approximate Newton method typically converges in single-digit iterations, often orders of magnitude faster than other state-of-the-art algorithms.
    A Theoretical Framework for AI Models Explainability. (arXiv:2212.14447v3 [cs.AI] UPDATED)
    EXplainable Artificial Intelligence (XAI) is a vibrant research topic in the artificial intelligence community, with growing interest across methods and domains. Much has been written about the subject, yet XAI still lacks shared terminology and a framework capable of providing structural soundness to explanations. In our work, we address these issues by proposing a novel definition of explanation that is a synthesis of what can be found in the literature. We recognize that explanations are not atomic but the combination of evidence stemming from the model and its input-output mapping, and the human interpretation of this evidence. Furthermore, we fit explanations into the properties of faithfulness (i.e., the explanation being a true description of the model's inner workings and decision-making process) and plausibility (i.e., how much the explanation looks convincing to the user). Using our proposed theoretical framework simplifies how these properties are operationalized and it provides new insight into common explanation methods that we analyze as case studies.
    Joint Reasoning on Hybrid-knowledge sources for Task-Oriented Dialog. (arXiv:2210.07295v2 [cs.CL] UPDATED)
    Traditional systems designed for task oriented dialog utilize knowledge present only in structured knowledge sources to generate responses. However, relevant information required to generate responses may also reside in unstructured sources, such as documents. Recent state of the art models such as HyKnow and SeKnow aimed at overcoming these challenges make limiting assumptions about the knowledge sources. For instance, these systems assume that certain types of information, such as a phone number, is always present in a structured knowledge base (KB) while information about aspects such as entrance ticket prices, would always be available in documents. In this paper, we create a modified version of the MutliWOZ-based dataset prepared by SeKnow to demonstrate how current methods have significant degradation in performance when strict assumptions about the source of information are removed. Then, in line with recent work exploiting pre-trained language models, we fine-tune a BART based model using prompts for the tasks of querying knowledge sources, as well as, for response generation, without making assumptions about the information present in each knowledge source. Through a series of experiments, we demonstrate that our model is robust to perturbations to knowledge modality (source of information), and that it can fuse information from structured as well as unstructured knowledge to generate responses.
    Adversarial Bandits with Knapsacks. (arXiv:1811.11881v10 [cs.DS] UPDATED)
    We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the "classic" adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to the algorithm's reward. We design an algorithm with competitive ratio O(log T) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version and use it as a subroutine to solve the latter.
    DeepPSL: End-to-end perception and reasoning. (arXiv:2109.13662v4 [eess.SY] UPDATED)
    We introduce DeepPSL a variant of probabilistic soft logic (PSL) to produce an end-to-end trainable system that integrates reasoning and perception. PSL represents first-order logic in terms of a convex graphical model -- hinge-loss Markov random fields (HL-MRFs). PSL stands out among probabilistic logic frameworks due to its tractability having been applied to systems of more than 1 billion ground rules. The key to our approach is to represent predicates in first-order logic using deep neural networks and then to approximately back-propagate through the HL-MRF and thus train every aspect of the first-order system being represented. We believe that this approach represents an interesting direction for the integration of deep learning and reasoning techniques with applications to knowledge base learning, multi-task learning, and explainability. Evaluation on three different tasks demonstrates that DeepPSL significantly outperforms state-of-the-art neuro-symbolic methods on scalability while achieving comparable or better accuracy.
    Random Forest Weighted Local Fr\'echet Regression. (arXiv:2202.04912v2 [stat.ML] UPDATED)
    Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and M\"uller (2019) established a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method utilizes these weights as the local average to solve the conditional Fr\'echet mean, while the second method performs local linear Fr\'echet regression, both significantly improving existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order Mmn -estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to human mortality distribution data.
    PIXEL: Physics-Informed Cell Representations for Fast and Accurate PDE Solvers. (arXiv:2207.12800v2 [cs.LG] UPDATED)
    With the increases in computational power and advances in machine learning, data-driven learning-based methods have gained significant attention in solving PDEs. Physics-informed neural networks (PINNs) have recently emerged and succeeded in various forward and inverse PDE problems thanks to their excellent properties, such as flexibility, mesh-free solutions, and unsupervised training. However, their slower convergence speed and relatively inaccurate solutions often limit their broader applicability in many science and engineering domains. This paper proposes a new kind of data-driven PDEs solver, physics-informed cell representations (PIXEL), elegantly combining classical numerical methods and learning-based approaches. We adopt a grid structure from the numerical methods to improve accuracy and convergence speed and overcome the spectral bias presented in PINNs. Moreover, the proposed method enjoys the same benefits in PINNs, e.g., using the same optimization frameworks to solve both forward and inverse PDE problems and readily enforcing PDE constraints with modern automatic differentiation techniques. We provide experimental results on various challenging PDEs that the original PINNs have struggled with and show that PIXEL achieves fast convergence speed and high accuracy. Project page: https://namgyukang.github.io/PIXEL/
    PGNAA Spectral Classification of Metal with Density Estimations. (arXiv:2208.13836v2 [cs.LG] UPDATED)
    For environmental, sustainable economic and political reasons, recycling processes are becoming increasingly important, aiming at a much higher use of secondary raw materials. Currently, for the copper and aluminium industries, no method for the non-destructive online analysis of heterogeneous materials are available. The Prompt Gamma Neutron Activation Analysis (PGNAA) has the potential to overcome this challenge. A difficulty when using PGNAA for online classification arises from the small amount of noisy data, due to short-term measurements. In this case, classical evaluation methods using detailed peak by peak analysis fail. Therefore, we propose to view spectral data as probability distributions. Then, we can classify material using maximum log-likelihood with respect to kernel density estimation and use discrete sampling to optimize hyperparameters. For measurements of pure aluminium alloys we achieve near perfect classification of aluminium alloys under 0.25 second.
    Performance and utility trade-off in interpretable sleep staging. (arXiv:2211.03282v3 [eess.SP] UPDATED)
    Recent advances in deep learning have led to the development of models approaching the human level of accuracy. However, healthcare remains an area lacking in widespread adoption. The safety-critical nature of healthcare results in a natural reticence to put these black-box deep learning models into practice. This paper explores interpretable methods for a clinical decision support system called sleep staging, an essential step in diagnosing sleep disorders. Clinical sleep staging is an arduous process requiring manual annotation for each 30s of sleep using physiological signals such as electroencephalogram (EEG). Recent work has shown that sleep staging using simple models and an exhaustive set of features can perform nearly as well as deep learning approaches but only for some specific datasets. Moreover, the utility of those features from a clinical standpoint is ambiguous. On the other hand, the proposed framework, NormIntSleep demonstrates exceptional performance across different datasets by representing deep learning embeddings using normalized features. NormIntSleep performs 4.5% better than the exhaustive feature-based approach and 1.5% better than other representation learning approaches. An empirical comparison between the utility of the interpretations of these models highlights the improved alignment with clinical expectations when performance is traded-off slightly. NormIntSleep paired with a clinically meaningful set of features can best balance this trade-off by providing reliable, clinically relevant interpretation with robust performance.
    Learning Variational Models with Unrolling and Bilevel Optimization. (arXiv:2209.12651v3 [stat.ML] UPDATED)
    In this paper we consider the problem learning of variational models in the context of supervised learning via risk minimization. Our goal is to provide a deeper understanding of the two approaches of learning of variational models via bilevel optimization and via algorithm unrolling. The former considers the variational model as a lower level optimization problem below the risk minimization problem, while the latter replaces the lower level optimization problem by an algorithm that solves said problem approximately. Both approaches are used in practice, but, unrolling is much simpler from a computational point of view. To analyze and compare the two approaches, we consider a simple toy model, and compute all risks and the respective estimators explicitly. We show that unrolling can be better than the bilevel optimization approach, but also that the performance of unrolling can depend significantly on further parameters, sometimes in unexpected ways: While the stepsize of the unrolled algorithm matters a lot, the number of unrolled iterations only matters if the number is even or odd, and these two cases are notably different.
    Fruit Ripeness Classification: a Survey. (arXiv:2212.14441v2 [cs.CV] UPDATED)
    Fruit is a key crop in worldwide agriculture feeding millions of people. The standard supply chain of fruit products involves quality checks to guarantee freshness, taste, and, most of all, safety. An important factor that determines fruit quality is its stage of ripening. This is usually manually classified by experts in the field, which makes it a labor-intensive and error-prone process. Thus, there is an arising need for automation in the process of fruit ripeness classification. Many automatic methods have been proposed that employ a variety of feature descriptors for the food item to be graded. Machine learning and deep learning techniques dominate the top-performing methods. Furthermore, deep learning can operate on raw data and thus relieve the users from having to compute complex engineered features, which are often crop-specific. In this survey, we review the latest methods proposed in the literature to automatize fruit ripeness classification, highlighting the most common feature descriptors they operate on.
    Side Effects of Learning from Low-dimensional Data Embedded in a Euclidean Space. (arXiv:2203.00614v5 [cs.LG] UPDATED)
    The low-dimensional manifold hypothesis posits that the data found in many applications, such as those involving natural images, lie (approximately) on low-dimensional manifolds embedded in a high-dimensional Euclidean space. In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input. However, one often needs to consider evaluating the optimized network at points outside the training distribution. This paper considers the case in which the training data is distributed in a linear subspace of $\mathbb R^d$. We derive estimates on the variation of the learning function, defined by a neural network, in the direction transversal to the subspace. We study the potential regularization effects associated with the network's depth and noise in the codimension of the data manifold. We also present additional side effects in training due to the presence of noise.
    Physics-Guided, Physics-Informed, and Physics-Encoded Neural Networks in Scientific Computing. (arXiv:2211.07377v2 [cs.LG] UPDATED)
    Recent breakthroughs in computing power have made it feasible to use machine learning and deep learning to advance scientific computing in many fields, including fluid mechanics, solid mechanics, materials science, etc. Neural networks, in particular, play a central role in this hybridization. Due to their intrinsic architecture, conventional neural networks cannot be successfully trained and scoped when data is sparse, which is the case in many scientific and engineering domains. Nonetheless, neural networks provide a solid foundation to respect physics-driven or knowledge-based constraints during training. Generally speaking, there are three distinct neural network frameworks to enforce the underlying physics: (i) physics-guided neural networks (PgNNs), (ii) physics-informed neural networks (PiNNs), and (iii) physics-encoded neural networks (PeNNs). These methods provide distinct advantages for accelerating the numerical modeling of complex multiscale multi-physics phenomena. In addition, the recent developments in neural operators (NOs) add another dimension to these new simulation paradigms, especially when the real-time prediction of complex multi-physics systems is required. All these models also come with their own unique drawbacks and limitations that call for further fundamental research. This study aims to present a review of the four neural network frameworks (i.e., PgNNs, PiNNs, PeNNs, and NOs) used in scientific computing research. The state-of-the-art architectures and their applications are reviewed, limitations are discussed, and future research opportunities in terms of improving algorithms, considering causalities, expanding applications, and coupling scientific and deep learning solvers are presented. This critical review provides researchers and engineers with a solid starting point to comprehend how to integrate different layers of physics into neural networks.
    Convolutional Neural Generative Coding: Scaling Predictive Coding to Natural Images. (arXiv:2211.12047v2 [cs.CV] UPDATED)
    In this work, we develop convolutional neural generative coding (Conv-NGC), a generalization of predictive coding to the case of convolution/deconvolution-based computation. Specifically, we concretely implement a flexible neurobiologically-motivated algorithm that progressively refines latent state feature maps in order to dynamically form a more accurate internal representation/reconstruction model of natural images. The performance of the resulting sensory processing system is evaluated on complex datasets such as Color-MNIST, CIFAR-10, and Street House View Numbers (SVHN). We study the effectiveness of our brain-inspired model on the tasks of reconstruction and image denoising and find that it is competitive with convolutional auto-encoding systems trained by backpropagation of errors and outperforms them with respect to out-of-distribution reconstruction (including the full 90k CINIC-10 test set).
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v4 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(N^{-1}+m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size, $m$ is the worker number, and $1+\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(N^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+\lambda^{1+\alpha} + \phi_{\mathcal{S}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD is positively correlated with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
    CHIMLE: Conditional Hierarchical IMLE for Multimodal Conditional Image Synthesis. (arXiv:2211.14286v2 [cs.CV] UPDATED)
    A persistent challenge in conditional image synthesis has been to generate diverse output images from the same input image despite only one output image being observed per input image. GAN-based methods are prone to mode collapse, which leads to low diversity. To get around this, we leverage Implicit Maximum Likelihood Estimation (IMLE) which can overcome mode collapse fundamentally. IMLE uses the same generator as GANs but trains it with a different, non-adversarial objective which ensures each observed image has a generated sample nearby. Unfortunately, to generate high-fidelity images, prior IMLE-based methods require a large number of samples, which is expensive. In this paper, we propose a new method to get around this limitation, which we dub Conditional Hierarchical IMLE (CHIMLE), which can generate high-fidelity images without requiring many samples. We show CHIMLE significantly outperforms the prior best IMLE, GAN and diffusion-based methods in terms of image fidelity and mode coverage across four tasks, namely night-to-day, 16x single image super-resolution, image colourization and image decompression. Quantitatively, our method improves Fr\'echet Inception Distance (FID) by 36.9% on average compared to the prior best IMLE-based method, and by 27.5% on average compared to the best non-IMLE-based general-purpose methods.
    Real-to-Sim: Predicting Residual Errors of Robotic Systems with Sparse Data using a Learning-based Unscented Kalman Filter. (arXiv:2209.03210v2 [cs.RO] UPDATED)
    Achieving highly accurate dynamic or simulator models that are close to the real robot can facilitate model-based controls (e.g., model predictive control or linear-quadradic regulators), model-based trajectory planning (e.g., trajectory optimization), and decrease the amount of learning time necessary for reinforcement learning methods. Thus, the objective of this work is to learn the residual errors between a dynamic and/or simulator model and the real robot. This is achieved using a neural network, where the parameters of a neural network are updated through an Unscented Kalman Filter (UKF) formulation. Using this method, we model these residual errors with only small amounts of data -- a necessity as we improve the simulator/dynamic model by learning directly from real-world operation. We demonstrate our method on robotic hardware (e.g., manipulator arm, and a wheeled robot), and show that with the learned residual errors, we can further close the reality gap between dynamic models, simulations, and actual hardware.
    The Dual PC Algorithm and the Role of Gaussianity for Structure Learning of Bayesian Networks. (arXiv:2112.09036v4 [stat.ML] UPDATED)
    Learning the graphical structure of Bayesian networks is key to describing data-generating mechanisms in many complex applications but poses considerable computational challenges. Observational data can only identify the equivalence class of the directed acyclic graph underlying a Bayesian network model, and a variety of methods exist to tackle the problem. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by reverse-engineering the conditional independence (CI) relationships holding in the variable distribution. The dual PC algorithm is a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. By exploiting block matrix inversions we can simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies show that the dual PC algorithm outperforms the classic PC algorithm both in terms of run time and in recovering the underlying network structure, even in the presence of deviations from Gaussianity. Additionally, we show that the dual PC algorithm applies for Gaussian copula models, and demonstrate its performance in that setting.
    Asymmetric Certified Robustness via Feature-Convex Neural Networks. (arXiv:2302.01961v1 [cs.LG])
    Recent works have introduced input-convex neural networks (ICNNs) as learning models with advantageous training, inference, and generalization properties linked to their convex structure. In this paper, we propose a novel feature-convex neural network architecture as the composition of an ICNN with a Lipschitz feature map in order to achieve adversarial robustness. We consider the asymmetric binary classification setting with one "sensitive" class, and for this class we prove deterministic, closed-form, and easily-computable certified robust radii for arbitrary $\ell_p$-norms. We theoretically justify the use of these models by characterizing their decision region geometry, extending the universal approximation theorem for ICNN regression to the classification setting, and proving a lower bound on the probability that such models perfectly fit even unstructured uniformly distributed data in sufficiently high dimensions. Experiments on Malimg malware classification and subsets of MNIST, Fashion-MNIST, and CIFAR-10 datasets show that feature-convex classifiers attain state-of-the-art certified $\ell_1$-radii as well as substantial $\ell_2$- and $\ell_{\infty}$-radii while being far more computationally efficient than any competitive baseline.
    Dynamical Equations With Bottom-up Self-Organizing Properties Learn Accurate Dynamical Hierarchies Without Any Loss Function. (arXiv:2302.02140v1 [cs.AI])
    Self-organization is ubiquitous in nature and mind. However, machine learning and theories of cognition still barely touch the subject. The hurdle is that general patterns are difficult to define in terms of dynamical equations and designing a system that could learn by reordering itself is still to be seen. Here, we propose a learning system, where patterns are defined within the realm of nonlinear dynamics with positive and negative feedback loops, allowing attractor-repeller pairs to emerge for each pattern observed. Experiments reveal that such a system can map temporal to spatial correlation, enabling hierarchical structures to be learned from sequential data. The results are accurate enough to surpass state-of-the-art unsupervised learning algorithms in seven out of eight experiments as well as two real-world problems. Interestingly, the dynamic nature of the system makes it inherently adaptive, giving rise to phenomena similar to phase transitions in chemistry/thermodynamics when the input structure changes. Thus, the work here sheds light on how self-organization can allow for pattern recognition and hints at how intelligent behavior might emerge from simple dynamic equations without any objective/loss function.
    PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs. (arXiv:2206.00048v2 [cs.CV] UPDATED)
    Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.
    PubGraph: A Large Scale Scientific Temporal Knowledge Graph. (arXiv:2302.02231v1 [cs.AI])
    Research publications are the primary vehicle for sharing scientific progress in the form of new discoveries, methods, techniques, and insights. Publications have been studied from the perspectives of both content analysis and bibliometric structure, but a barrier to more comprehensive studies of scientific research is a lack of publicly accessible large-scale data and resources. In this paper, we present PubGraph, a new resource for studying scientific progress that takes the form of a large-scale temporal knowledge graph (KG). It contains more than 432M nodes and 15.49B edges mapped to the popular Wikidata ontology. We extract three KGs with varying sizes from PubGraph to allow experimentation at different scales. Using these KGs, we introduce a new link prediction benchmark for transductive and inductive settings with temporally-aligned training, validation, and testing partitions. Moreover, we develop two new inductive learning methods better suited to PubGraph, operating on unseen nodes without explicit features, scaling to large KGs, and outperforming existing models. Our results demonstrate that structural features of past citations are sufficient to produce high-quality predictions about new publications. We also identify new challenges for KG models, including an adversarial community-based link prediction setting, zero-shot inductive learning, and large-scale learning.
    A Singular Woodbury and Pseudo-Determinant Matrix Identities and Application to Gaussian Process Regression. (arXiv:2207.08038v2 [math.ST] UPDATED)
    We study a matrix that arises from a singular form of the Woodbury matrix identity. We present generalized inverse and pseudo-determinant identities for this matrix, which have direct applications for Gaussian process regression, specifically its likelihood representation and precision matrix. We extend the definition of the precision matrix to the Bott-Duffin inverse of the covariance matrix, preserving properties related to conditional independence, conditional precision, and marginal precision. We also provide an efficient algorithm and numerical analysis for the presented determinant identities and demonstrate their advantages under specific conditions relevant to computing log-determinant terms in likelihood functions of Gaussian process regression.
    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v4 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.
    Inorganic synthesis recommendation by machine learning materials similarity from scientific literature. (arXiv:2302.02303v1 [cond-mat.mtrl-sci])
    Synthesis prediction is a key accelerator for the rapid design of advanced materials. However, determining synthesis variables such as the choice of precursor materials, operations, and conditions is challenging for inorganic materials because the sequence of reactions during heating is not well understood. In this work, we use a knowledge base of 29,900 solid-state synthesis recipes, text-mined from the scientific literature, to automatically learn which precursors to recommend for the synthesis of a novel target material. The data-driven approach learns chemical similarity of materials and refers the synthesis of a new target to precedent synthesis procedures of similar materials, mimicking human synthesis design. When proposing five precursor sets for each of 2,654 unseen test target materials, the recommendation strategy achieves a success rate of at least 82%. Our approach captures decades of heuristic synthesis data in a mathematical form, making it accessible for use in recommendation engines and autonomous laboratories.
    Temporally Layered Architecture for Adaptive, Distributed and Continuous Control. (arXiv:2301.00723v2 [cs.NE] UPDATED)
    We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to "act-or-not" at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines.
    Learning to Shape Rewards using a Game of Two Partners. (arXiv:2103.09159v5 [cs.LG] UPDATED)
    Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construction is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA's properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.
    Certified Robust Control under Adversarial Perturbations. (arXiv:2302.02208v1 [cs.LG])
    Autonomous systems increasingly rely on machine learning techniques to transform high-dimensional raw inputs into predictions that are then used for decision-making and control. However, it is often easy to maliciously manipulate such inputs and, as a result, predictions. While effective techniques have been proposed to certify the robustness of predictions to adversarial input perturbations, such techniques have been disembodied from control systems that make downstream use of the predictions. We propose the first approach for composing robustness certification of predictions with respect to raw input perturbations with robust control to obtain certified robustness of control to adversarial input perturbations. We use a case study of adaptive vehicle control to illustrate our approach and show the value of the resulting end-to-end certificates through extensive experiments.
    Koopman Operator Learning: Sharp Spectral Rates and Spurious Eigenvalues. (arXiv:2302.02004v1 [cs.LG])
    Non-linear dynamical systems can be handily described by the associated Koopman operator, whose action evolves every observable of the system forward in time. Learning the Koopman operator from data is enabled by a number of algorithms. In this work we present nonasymptotic learning bounds for the Koopman eigenvalues and eigenfunctions estimated by two popular algorithms: Extended Dynamic Mode Decomposition (EDMD) and Reduced Rank Regression (RRR). We focus on time-reversal-invariant Markov chains, implying that the Koopman operator is self-adjoint. This includes important examples of stochastic dynamical systems, notably Langevin dynamics. Our spectral learning bounds are driven by the simultaneous control of the operator norm risk of the estimators and a metric distortion associated to the corresponding eigenfunctions. Our analysis indicates that both algorithms have similar variance, but EDMD suffers from a larger bias which might be detrimental to its learning rate. We further argue that a large metric distortion may lead to spurious eigenvalues, a phenomenon which has been empirically observed, and note that metric distortion can be estimated from data. Numerical experiments complement the theoretical findings.
    Towards Improving the Generation Quality of Autoregressive Slot VAEs. (arXiv:2206.01370v2 [cs.CV] UPDATED)
    Unconditional scene inference and generation are challenging to learn jointly with a single compositional model. Despite encouraging progress on models that extract object-centric representations ("slots") from images, unconditional generation of scenes from slots has received less attention. This is primarily because learning the multi-object relations necessary to imagine coherent scenes is difficult. We hypothesize that most existing slot-based models have a limited ability to learn object correlations. We propose two improvements that strengthen slot correlation learning. The first is to condition the slots on a global, scene-level variable that captures higher-order correlations between slots. Second, we address the fundamental lack of a canonical order for objects by proposing to learn a consistent order to use for the autoregressive generation of scene objects. Specifically, we train an autoregressive slot prior to sequentially generate scene objects following the learned order. Slot inference entails estimating a randomly ordered set of slots using existing approaches for extracting slots from images, then aligning those slots to ordered slots generated autoregressively with the prior. Our experiments across three multi-object environments demonstrate clear gains in scene generation quality. Detailed ablation studies are also provided that validate the two proposed improvements.
    Sequential pattern mining in educational data: The application context, potential, strengths, and limitations. (arXiv:2302.01932v1 [cs.LG])
    Increasingly, researchers have suggested the benefits of temporal analysis to improve our understanding of the learning process. Sequential pattern mining (SPM), as a pattern recognition technique, has the potential to reveal the temporal aspects of learning and can be a valuable tool in educational data science. However, its potential is not well understood and exploited. This chapter addresses this gap by reviewing work that utilizes sequential pattern mining in educational contexts. We identify that SPM is suitable for mining learning behaviors, analyzing and enriching educational theories, evaluating the efficacy of instructional interventions, generating features for prediction models, and building educational recommender systems. SPM can contribute to these purposes by discovering similarities and differences in learners' activities and revealing the temporal change in learning behaviors. As a sequential analysis method, SPM can reveal unique insights about learning processes and be powerful for self-regulated learning research. It is more flexible in capturing the relative arrangement of learning events than the other sequential analysis methods. Future research may improve its utility in educational data science by developing tools for counting pattern occurrences as well as identifying and removing unreliable patterns. Future work needs to establish a systematic guideline for data preprocessing, parameter setting, and interpreting sequential patterns.
    Joint Learning of Reward Machines and Policies in Environments with Partially Known Semantics. (arXiv:2204.11833v2 [cs.LG] UPDATED)
    We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcement-learning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions' truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from the exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions' truth value. Finally, the algorithm uses a Q-learning procedure for the states of the hypothesis reward machine to determine the policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.
    Efficient Variational Bayes Learning of Graphical Models with Smooth Structural Changes. (arXiv:2009.07703v3 [stat.ML] UPDATED)
    Estimating time-varying graphical models are of paramount importance in various social, financial, biological, and engineering systems, since the evolution of such networks can be utilized for example to spot trends, detect anomalies, predict vulnerability, and evaluate the impact of interventions. Existing methods require extensive tuning of parameters that control the graph sparsity and temporal smoothness. Furthermore, these methods are computationally burdensome with time complexity $O(NP^3)$ for $P$ variables and $N$ time points. As a remedy, we propose a low-complexity tuning-free Bayesian approach, named BASS. Specifically, we impose temporally-dependent spike-and-slab priors on the graphs such that they are sparse and varying smoothly across time. A variational inference algorithm is then derived to learn the graph structures from the data automatically. Owning to the pseudo-likelihood and the mean-field approximation, the time complexity of BASS is only $O(NP^2)$. Additionally, by identifying the frequency-domain resemblance to the time-varying graphical models, we show that BASS can be extended to learning frequency-varying inverse spectral density matrices, and yields graphical models for multivariate stationary time series. Numerical results on both synthetic and real data show that that BASS can better recover the underlying true graphs, while being more efficient than the existing methods, especially for high-dimensional cases.
    SMGRL: Scalable Multi-resolution Graph Representation Learning. (arXiv:2201.12670v2 [cs.LG] UPDATED)
    Graph convolutional networks (GCNs) allow us to learn topologically-aware node embeddings, which can be useful for classification or link prediction. However, they are unable to capture long-range dependencies without adding additional layers -- which in turn leads to over-smoothing and increased time and space complexity. Further, the complex dependencies between nodes make mini-batching challenging, limiting their applicability to large graphs. We propose a Scalable Multi-resolution Graph Representation Learning (SMGRL) framework that enables us to learn multi-resolution node embeddings efficiently. Our framework is model-agnostic and can be applied to any existing GCN model. We dramatically reduce training costs by training only on a reduced-dimension coarsening of the original graph, then exploit self-similarity to apply the resulting algorithm at multiple resolutions. The resulting multi-resolution embeddings can be aggregated to yield high-quality node embeddings that capture both long- and short-range dependencies. Our experiments show that this leads to improved classification accuracy, without incurring high computational costs.
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v4 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    Fair Spatial Indexing: A paradigm for Group Spatial Fairness. (arXiv:2302.02306v1 [cs.LG])
    Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies fairness with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on spatial group fairness and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data.
    Feature-based Individual Fairness in k-Clustering. (arXiv:2109.04554v2 [cs.LG] UPDATED)
    Ensuring fairness in machine learning algorithms is a challenging and essential task. We consider the problem of clustering a set of points while satisfying fairness constraints. While there have been several attempts to capture group fairness in the $k$-clustering problem, fairness at an individual level is relatively less explored. We introduce a new notion of individual fairness in $k$-clustering based on features not necessarily used for clustering. We show that this problem is NP-hard and does not admit a constant factor approximation. Therefore, we design a randomized algorithm that guarantees approximation both in terms of minimizing the clustering distance objective and individual fairness under natural restrictions on the distance metric and fairness constraints. Finally, our experimental results against six competing baselines validate that our algorithm produces individually fairer clusters than the fairest baseline by 12.5% on average while also being less costly in terms of the clustering objective than the best baseline by 34.5% on average.
    HardSATGEN: Understanding the Difficulty of Hard SAT Formula Generation and A Strong Structure-Hardness-Aware Baseline. (arXiv:2302.02104v1 [cs.AI])
    Industrial SAT formula generation is a critical yet challenging task for heuristic development and the surging learning-based methods in practical SAT applications. Existing SAT generation approaches can hardly simultaneously capture the global structural properties and maintain plausible computational hardness, which can be hazardous for the various downstream engagements. To this end, we first present an in-depth analysis for the limitation of previous learning methods in reproducing the computational hardness of original instances, which may stem from the inherent homogeneity in their adopted split-merge procedure. On top of the observations that industrial formulae exhibit clear community structure and oversplit substructures lead to the difficulty in semantic formation of logical structures, we propose HardSATGEN, which introduces a fine-grained control mechanism to the neural split-merge paradigm for SAT formula generation to better recover the structural and computational properties of the industrial benchmarks. Experimental results including evaluations on private corporate data and hyperparameter tuning over solvers in practical use show the significant superiority of HardSATGEN being the only method to successfully augments formulae maintaining similar computational hardness and capturing the global structural properties simultaneously. Compared to the best previous methods to our best knowledge, the average performance gains achieve 38.5% in structural statistics, 88.4% in computational metrics, and over 140.7% in the effectiveness of guiding solver development tuned by our generated instances.
    Optimal lower bounds for Quantum Learning via Information Theory. (arXiv:2301.02227v2 [quant-ph] UPDATED)
    Although a concept class may be learnt more efficiently using quantum samples as compared with classical samples in certain scenarios, Arunachalam and de Wolf (JMLR, 2018) proved that quantum learners are asymptotically no more efficient than classical ones in the quantum PAC and Agnostic learning models. They established lower bounds on sample complexity via quantum state identification and Fourier analysis. In this paper, we derive optimal lower bounds for quantum sample complexity in both the PAC and agnostic models via an information-theoretic approach. The proofs are arguably simpler, and the same ideas can potentially be used to derive optimal bounds for other problems in quantum learning theory. We then turn to a quantum analogue of the Coupon Collector problem, a classic problem from probability theory also of importance in the study of PAC learning. Arunachalam, Belovs, Childs, Kothari, Rosmanis, and de Wolf (TQC, 2020) characterized the quantum sample complexity of this problem up to constant factors. First, we show that the information-theoretic approach mentioned above provably does not yield the optimal lower bound. As a by-product, we get a natural ensemble of pure states in arbitrarily high dimensions which are not easily (simultaneously) distinguishable, while the ensemble has close to maximal Holevo information. Second, we discover that the information-theoretic approach yields an asymptotically optimal bound for an approximation variant of the problem. Finally, we derive a sharp lower bound for the Quantum Coupon Collector problem, with the exact leading order term, via the generalized Holevo-Curlander bounds on the distinguishability of an ensemble. All the aspects of the Quantum Coupon Collector problem we study rest on properties of the spectrum of the associated Gram matrix, which may be of independent interest.
    Domain Adaptation via Rebalanced Sub-domain Alignment. (arXiv:2302.02009v1 [cs.LG])
    Unsupervised domain adaptation (UDA) is a technique used to transfer knowledge from a labeled source domain to a different but related unlabeled target domain. While many UDA methods have shown success in the past, they often assume that the source and target domains must have identical class label distributions, which can limit their effectiveness in real-world scenarios. To address this limitation, we propose a novel generalization bound that reweights source classification error by aligning source and target sub-domains. We prove that our proposed generalization bound is at least as strong as existing bounds under realistic assumptions, and we empirically show that it is much stronger on real-world data. We then propose an algorithm to minimize this novel generalization bound. We demonstrate by numerical experiments that this approach improves performance in shifted class distribution scenarios compared to state-of-the-art methods.
    Multivariate Time Series Anomaly Detection via Dynamic Graph Forecasting. (arXiv:2302.02051v1 [cs.LG])
    Anomalies in univariate time series often refer to abnormal values and deviations from the temporal patterns from majority of historical observations. In multivariate time series, anomalies also refer to abnormal changes in the inter-series relationship, such as correlation, over time. Existing studies have been able to model such inter-series relationships through graph neural networks. However, most works settle on learning a static graph globally or within a context window to assist a time series forecasting task or a reconstruction task, whose objective is not tailored to explicitly detect the abnormal relationship. Some other works detect anomalies based on reconstructing or forecasting a list of inter-series graphs, which inadvertently weakens their power to capture temporal patterns within the data due to the discrete nature of graphs. In this study, we propose DyGraphAD, a multivariate time series anomaly detection framework based upon a list of dynamic inter-series graphs. The core idea is to detect anomalies based on the deviation of inter-series relationships and intra-series temporal patterns from normal to anomalous states, by leveraging the evolving nature of the graphs in order to assist a graph forecasting task and a time series forecasting task simultaneously. Our numerical experiments on real-world datasets demonstrate that DyGraphAD has superior performance than baseline anomaly detection approaches.
    Unsupervised Ensemble Methods for Anomaly Detection in PLC-based Process Control. (arXiv:2302.02097v1 [cs.LG])
    Programmable logic controller (PLC) based industrial control systems (ICS) are used to monitor and control critical infrastructure. Integration of communication networks and an Internet of Things approach in ICS has increased ICS vulnerability to cyber-attacks. This work proposes novel unsupervised machine learning ensemble methods for anomaly detection in PLC-based ICS. The work presents two broad approaches to anomaly detection: a weighted voting ensemble approach with a learning algorithm based on coefficient of determination and a stacking-based ensemble approach using isolation forest meta-detector. The two ensemble methods were analyzed via an open-source PLC-based ICS subjected to multiple attack scenarios as a case study. The work considers four different learning models for the weighted voting ensemble method. Comparative performance analyses of five ensemble methods driven diverse base detectors are presented. Results show that stacking-based ensemble method using isolation forest meta-detector achieves superior performance to previous work on all performance metrics. Results also suggest that effective unsupervised ensemble methods, such as stacking-based ensemble having isolation forest meta-detector, can robustly detect anomalies in arbitrary ICS datasets. Finally, the presented results were validated by using statistical hypothesis tests.
    AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Decision Tree Models. (arXiv:2302.02162v1 [cs.LG])
    Model extraction attack is one of the most prominent adversarial techniques to target machine learning models along with membership inference attack and model inversion attack. On the other hand, Explainable Artificial Intelligence (XAI) is a set of techniques and procedures to explain the decision making process behind AI. XAI is a great tool to understand the reasoning behind AI models but the data provided for such revelation creates security and privacy vulnerabilities. In this poster, we propose AUTOLYCUS, a model extraction attack that exploits the explanations provided by LIME to infer the decision boundaries of decision tree models and create extracted surrogate models that behave similar to a target model.
    RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference. (arXiv:2302.02292v1 [cs.CR])
    The proliferation of deep learning (DL) has led to the emergence of privacy and security concerns. To address these issues, secure Two-party computation (2PC) has been proposed as a means of enabling privacy-preserving DL computation. However, in practice, 2PC methods often incur high computation and communication overhead, which can impede their use in large-scale systems. To address this challenge, we introduce RRNet, a systematic framework that aims to jointly reduce the overhead of MPC comparison protocols and accelerate computation through hardware acceleration. Our approach integrates the hardware latency of cryptographic building blocks into the DNN loss function, resulting in improved energy efficiency, accuracy, and security guarantees. Furthermore, we propose a cryptographic hardware scheduler and corresponding performance model for Field Programmable Gate Arrays (FPGAs) to further enhance the efficiency of our framework. Experiments show RRNet achieved a much higher ReLU reduction performance than all SOTA works on CIFAR-10 dataset.
    REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers. (arXiv:2302.02041v1 [cs.LG])
    Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is challenging as it requires modeling both a "parent" table and its relationships across tables. We introduce REaLTabFormer (Realistic Relational and Tabular Transformer), a tabular and relational synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence (Seq2Seq) model. We implement target masking to prevent data copying and propose the $Q_{\delta}$ statistic and statistical bootstrapping to detect overfitting. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a baseline model. REaLTabFormer also achieves state-of-the-art results on prediction tasks, "out-of-the-box", for large non-relational datasets without needing fine-tuning.
    How Bad is Top-$K$ Recommendation under Competing Content Creators?. (arXiv:2302.01971v1 [cs.GT])
    Content creators compete for exposure on recommendation platforms, and such strategic behavior leads to a dynamic shift over the content distribution. However, how the creators' competition impacts user welfare and how the relevance-driven recommendation influences the dynamics in the long run are still largely unknown. This work provides theoretical insights into these research questions. We model the creators' competition under the assumptions that: 1) the platform employs an innocuous top-$K$ recommendation policy; 2) user decisions follow the Random Utility model; 3) content creators compete for user engagement and, without knowing their utility function in hindsight, apply arbitrary no-regret learning algorithms to update their strategies. We study the user welfare guarantee through the lens of Price of Anarchy and show that the fraction of user welfare loss due to creator competition is always upper bounded by a small constant depending on $K$ and randomness in user decisions; we also prove the tightness of this bound. Our result discloses an intrinsic merit of the myopic approach to the recommendation, i.e., relevance-driven matching performs reasonably well in the long run, as long as users' decisions involve randomness and the platform provides reasonably many alternatives to its users.
    Conformalized semi-supervised random forest for classification and abnormality detection. (arXiv:2302.02237v1 [cs.LG])
    Traditional classifiers infer labels under the premise that the training and test samples are generated from the same distribution. This assumption can be problematic for safety-critical applications such as medical diagnosis and network attack detection. In this paper, we consider the multi-class classification problem when the training data and the test data may have different distributions. We propose conformalized semi-supervised random forest (CSForest), which constructs set-valued predictions $C(x)$ to include the correct class label with desired probability while detecting outliers efficiently. We compare the proposed method to other state-of-art methods in both a synthetic example and a real data application to demonstrate the strength of our proposal.
    Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization. (arXiv:2302.01987v1 [cs.LG])
    In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery and Individual Causal Discovery. The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walks with restarts to model the network propagation of a system fault. The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets with case studies demonstrate the effectiveness and superiority of the proposed framework.
    On the Analysis of Correlation Between Nominal Data and Numerical Data. (arXiv:2302.02007v1 [cs.LG])
    The article investigates the possibility of measuring the strength of a linear correlation relationship between nominal data and numerical data. Correlation coefficients for variables coded with real numbers as well as for variables coded with complex numbers were studied. For variables coded with real numbers, unambiguous measures of real linear correlation were obtained. In the case of complex coding, it has been observed that the obtained complex correlation coefficients change with the permutation of the phases in the complex numbers used to code classes of elements with equal cardinalities. It was found that a necessary condition for linear correlation is the possibility of linear ordering of a set with data. Since linear order is not possible in the set of complex numbers, complex correlation coefficients cannot be used as a measure of linear correlation. In the event of such a situation, a substitute action was suggested that would prevent equal cardinality of classes of identical elements contained in the set with nominal data. This action would consist in the correction of data, analogous to the correction during preprocessing or cleaning of data containing missing or outlier values.
    Reinforcement Learning with History-Dependent Dynamic Contexts. (arXiv:2302.02061v1 [cs.LG])
    We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.
    Semantic-Guided Image Augmentation with Pre-trained Models. (arXiv:2302.02070v1 [cs.CV])
    Image augmentation is a common mechanism to alleviate data scarcity in computer vision. Existing image augmentation methods often apply pre-defined transformations or mixup to augment the original image, but only locally vary the image. This makes them struggle to find a balance between maintaining semantic information and improving the diversity of augmented images. In this paper, we propose a Semantic-guided Image augmentation method with Pre-trained models (SIP). Specifically, SIP constructs prompts with image labels and captions to better guide the image-to-image generation process of the pre-trained Stable Diffusion model. The semantic information contained in the original images can be well preserved, and the augmented images still maintain diversity. Experimental results show that SIP can improve two commonly used backbones, i.e., ResNet-50 and ViT, by 12.60% and 2.07% on average over seven datasets, respectively. Moreover, SIP not only outperforms the best image augmentation baseline RandAugment by 4.46% and 1.23% on two backbones, but also further improves the performance by integrating naturally with the baseline. A detailed analysis of SIP is presented, including the diversity of augmented images, an ablation study on textual prompts, and a case study on the generated images.
    Federated Temporal Difference Learning with Linear Function Approximation under Environmental Heterogeneity. (arXiv:2302.02212v1 [cs.LG])
    We initiate the study of federated reinforcement learning under environmental heterogeneity by considering a policy evaluation problem. Our setup involves $N$ agents interacting with environments that share the same state and action space but differ in their reward functions and state transition kernels. Assuming agents can communicate via a central server, we ask: Does exchanging information expedite the process of evaluating a common policy? To answer this question, we provide the first comprehensive finite-time analysis of a federated temporal difference (TD) learning algorithm with linear function approximation, while accounting for Markovian sampling, heterogeneity in the agents' environments, and multiple local updates to save communication. Our analysis crucially relies on several novel ingredients: (i) deriving perturbation bounds on TD fixed points as a function of the heterogeneity in the agents' underlying Markov decision processes (MDPs); (ii) introducing a virtual MDP to closely approximate the dynamics of the federated TD algorithm; and (iii) using the virtual MDP to make explicit connections to federated optimization. Putting these pieces together, we rigorously prove that in a low-heterogeneity regime, exchanging model estimates leads to linear convergence speedups in the number of agents.
    Reinforcement Learning in Low-Rank MDPs with Density Features. (arXiv:2302.02252v1 [cs.LG])
    MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density features, i.e., the right matrix, which induce powerful models for state-occupancy distributions. This setting not only sheds light on leveraging unsupervised learning in RL, but also enables plug-in solutions for convex RL. In the offline setting, we propose an algorithm for off-policy estimation of occupancies that can handle non-exploratory data. Using this as a subroutine, we further devise an online algorithm that constructs exploratory data distributions in a level-by-level manner. As a central technical challenge, the additive error of occupancy estimation is incompatible with the multiplicative definition of data coverage. In the absence of strong assumptions like reachability, this incompatibility easily leads to exponential error blow-up, which we overcome via novel technical tools. Our results also readily extend to the representation learning setting, when the density features are unknown and must be learned from an exponentially large candidate set.
    Measuring The Impact Of Programming Language Distribution. (arXiv:2302.01973v1 [cs.LG])
    Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al. 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$.
    Directed Acyclic Graphs With Tears. (arXiv:2302.02160v1 [cs.AI])
    Bayesian network is a frequently-used method for fault detection and diagnosis in industrial processes. The basis of Bayesian network is structure learning which learns a directed acyclic graph (DAG) from data. However, the search space will scale super-exponentially with the increase of process variables, which makes the data-driven structure learning a challenging problem. To this end, the DAGs with NOTEARs methods are being well studied not only for their conversion of the discrete optimization into continuous optimization problem but also their compatibility with deep learning framework. Nevertheless, there still remain challenges for NOTEAR-based methods: 1) the infeasible solution results from the gradient descent-based optimization paradigm; 2) the truncation operation to promise the learned graph acyclic. In this work, the reason for challenge 1) is analyzed theoretically, and a novel method named DAGs with Tears method is proposed based on mix-integer programming to alleviate challenge 2). In addition, prior knowledge is able to incorporate into the new proposed method, making structure learning more practical and useful in industrial processes. Finally, a numerical example and an industrial example are adopted as case studies to demonstrate the superiority of the developed method.
    A neural operator-based surrogate solver for free-form electromagnetic inverse design. (arXiv:2302.01934v1 [physics.comp-ph])
    Neural operators have emerged as a powerful tool for solving partial differential equations in the context of scientific machine learning. Here, we implement and train a modified Fourier neural operator as a surrogate solver for electromagnetic scattering problems and compare its data efficiency to existing methods. We further demonstrate its application to the gradient-based nanophotonic inverse design of free-form, fully three-dimensional electromagnetic scatterers, an area that has so far eluded the application of deep learning techniques.
    SPARLING: Learning Latent Representations with Extremely Sparse Activations. (arXiv:2302.01976v1 [cs.LG])
    Real-world processes often contain intermediate state that can be modeled as an extremely sparse tensor. We introduce Sparling, a new kind of informational bottleneck that explicitly models this state by enforcing extreme activation sparsity. We additionally demonstrate that this technique can be used to learn the true intermediate representation with no additional supervision (i.e., from only end-to-end labeled examples), and thus improve the interpretability of the resulting models. On our DigitCircle domain, we are able to get an intermediate state prediction accuracy of 98.84%, even as we only train end-to-end.
    Decision-Aware Conditional GANs for Time Series Data. (arXiv:2009.12682v4 [cs.LG] UPDATED)
    We introduce the decision-aware time-series conditional generative adversarial network (DAT-CGAN) as a method for time-series generation. The framework adopts a multi-Wasserstein loss on structured decision-related quantities, capturing the heterogeneity of decision-related data and providing new effectiveness in supporting the decision processes of end users. We improve sample efficiency through an overlapped block-sampling method, and provide a theoretical characterization of the generalization properties of DAT-CGAN. The framework is demonstrated on financial time series for a multi-time-step portfolio choice problem. We demonstrate better generative quality in regard to underlying data and different decision-related quantities than strong, GAN-based baselines.
    Interpolation for Robust Learning: Data Augmentation on Geodesics. (arXiv:2302.02092v1 [cs.LG])
    We propose to study and promote the robustness of a model as per its performance through the interpolation of training data distributions. Specifically, (1) we augment the data by finding the worst-case Wasserstein barycenter on the geodesic connecting subpopulation distributions of different categories. (2) We regularize the model for smoother performance on the continuous geodesic path connecting subpopulation distributions. (3) Additionally, we provide a theoretical guarantee of robustness improvement and investigate how the geodesic location and the sample size contribute, respectively. Experimental validations of the proposed strategy on four datasets, including CIFAR-100 and ImageNet, establish the efficacy of our method, e.g., our method improves the baselines' certifiable robustness on CIFAR10 up to $7.7\%$, with $16.8\%$ on empirical robustness on CIFAR-100. Our work provides a new perspective of model robustness through the lens of Wasserstein geodesic-based interpolation with a practical off-the-shelf strategy that can be combined with existing robust training methods.
    MOMA:Distill from Self-Supervised Teachers. (arXiv:2302.02089v1 [cs.CV])
    Contrastive Learning and Masked Image Modelling have demonstrated exceptional performance on self-supervised representation learning, where Momentum Contrast (i.e., MoCo) and Masked AutoEncoder (i.e., MAE) are the state-of-the-art, respectively. In this work, we propose MOMA to distill from pre-trained MoCo and MAE in a self-supervised manner to collaborate the knowledge from both paradigms. We introduce three different mechanisms of knowledge transfer in the propsoed MOMA framework. : (1) Distill pre-trained MoCo to MAE. (2) Distill pre-trained MAE to MoCo (3) Distill pre-trained MoCo and MAE to a random initialized student. During the distillation, the teacher and the student are fed with original inputs and masked inputs, respectively. The learning is enabled by aligning the normalized representations from the teacher and the projected representations from the student. This simple design leads to efficient computation with extremely high mask ratio and dramatically reduced training epochs, and does not require extra considerations on the distillation target. The experiments show MOMA delivers compact student models with comparable performance to existing state-of-the-art methods, combining the power of both self-supervised learning paradigms. It presents competitive results against different benchmarks in computer vision. We hope our method provides an insight on transferring and adapting the knowledge from large-scale pre-trained models in a computationally efficient way.
    Multi-Source Diffusion Models for Simultaneous Music Generation and Separation. (arXiv:2302.02257v1 [cs.SD])
    In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e. generating a mixture, separating the sources), we also introduce and experiment on the partial inference task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
    FedSpectral+: Spectral Clustering using Federated Learning. (arXiv:2302.02137v1 [cs.LG])
    Clustering in graphs has been a well-known research problem, particularly because most Internet and social network data is in the form of graphs. Organizations widely use spectral clustering algorithms to find clustering in graph datasets. However, applying spectral clustering to a large dataset is challenging due to computational overhead. While the distributed spectral clustering algorithm exists, they face the problem of data privacy and increased communication costs between the clients. Thus, in this paper, we propose a spectral clustering algorithm using federated learning (FL) to overcome these issues. FL is a privacy-protecting algorithm that accumulates model parameters from each local learner rather than collecting users' raw data, thus providing both scalability and data privacy. We developed two approaches: FedSpectral and FedSpectral+. FedSpectral is a baseline approach that uses local spectral clustering labels to aggregate the global spectral clustering by creating a similarity graph. FedSpectral+, a state-of-the-art approach, uses the power iteration method to learn the global spectral embedding by incorporating the entire graph data without access to the raw information distributed among the clients. We further designed our own similarity metric to check the clustering quality of the distributed approach to that of the original/non-FL clustering. The proposed approach FedSpectral+ obtained a similarity of 98.85% and 99.8%, comparable to that of global clustering on the ego-Facebook and email-Eu-core dataset.
    Dual Self-Awareness Value Decomposition Framework without Individual Global Max for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2302.02180v1 [cs.MA])
    Value decomposition methods have gradually become popular in the cooperative multi-agent reinforcement learning field. However, almost all value decomposition methods follow the Individual Global Max (IGM) principle or its variants, which restricts the range of issues that value decomposition methods can resolve. Inspired by the notion of dual self-awareness in psychology, we propose a dual self-awareness value decomposition framework that entirely rejects the IGM premise. Each agent consists of an ego policy that carries out actions and an alter ego value function that takes part in credit assignment. The value function factorization can ignore the IGM assumption by using an explicit search procedure. We also suggest a novel anti-ego exploration mechanism to avoid the algorithm becoming stuck in a local optimum. As the first fully IGM-free value decomposition method, our proposed framework achieves desirable performance in various cooperative tasks.
    Diversity Induced Environment Design via Self-Play. (arXiv:2302.02119v1 [cs.AI])
    Recent work on designing an appropriate distribution of environments has shown promise for training effective generally capable agents. Its success is partly because of a form of adaptive curriculum learning that generates environment instances (or levels) at the frontier of the agent's capabilities. However, such an environment design framework often struggles to find effective levels in challenging design spaces and requires costly interactions with the environment. In this paper, we aim to introduce diversity in the Unsupervised Environment Design (UED) framework. Specifically, we propose a task-agnostic method to identify observed/hidden states that are representative of a given level. The outcome of this method is then utilized to characterize the diversity between two levels, which as we show can be crucial to effective performance. In addition, to improve sampling efficiency, we incorporate the self-play technique that allows the environment generator to automatically generate environments that are of great benefit to the training agent. Quantitatively, our approach, Diversity-induced Environment Design via Self-Play (DivSP), shows compelling performance over existing methods.
    Robust Budget Pacing with a Single Sample. (arXiv:2302.02006v1 [cs.LG])
    Major Internet advertising platforms offer budget pacing tools as a standard service for advertisers to manage their ad campaigns. Given the inherent non-stationarity in an advertiser's value and also competing advertisers' values over time, a commonly used approach is to learn a target expenditure plan that specifies a target spend as a function of time, and then run a controller that tracks this plan. This raises the question: how many historical samples are required to learn a good expenditure plan? We study this question by considering an advertiser repeatedly participating in $T$ second-price auctions, where the tuple of her value and the highest competing bid is drawn from an unknown time-varying distribution. The advertiser seeks to maximize her total utility subject to her budget constraint. Prior work has shown the sufficiency of $T\log T$ samples per distribution to achieve the optimal $O(\sqrt{T})$-regret. We dramatically improve this state-of-the-art and show that just one sample per distribution is enough to achieve the near-optimal $\tilde O(\sqrt{T})$-regret, while still being robust to noise in the sampling distributions.
    Structural Explanations for Graph Neural Networks using HSIC. (arXiv:2302.02139v1 [cs.LG])
    Graph neural networks (GNNs) are a type of neural model that tackle graphical tasks in an end-to-end manner. Recently, GNNs have been receiving increased attention in machine learning and data mining communities because of the higher performance they achieve in various tasks, including graph classification, link prediction, and recommendation. However, the complicated dynamics of GNNs make it difficult to understand which parts of the graph features contribute more strongly to the predictions. To handle the interpretability issues, recently, various GNN explanation methods have been proposed. In this study, a flexible model agnostic explanation method is proposed to detect significant structures in graphs using the Hilbert-Schmidt independence criterion (HSIC), which captures the nonlinear dependency between two variables through kernels. More specifically, we extend the GraphLIME method for node explanation with a group lasso and a fused lasso-based node explanation method. The group and fused regularization with GraphLIME enables the interpretation of GNNs in substructure units. Then, we show that the proposed approach can be used for the explanation of sequential graph classification tasks. Through experiments, it is demonstrated that our method can identify crucial structures in a target graph in various settings.
    Oscillation-free Quantization for Low-bit Vision Transformers. (arXiv:2302.02210v1 [cs.CV])
    Weight oscillation is an undesirable side effect of quantization-aware training, in which quantized weights frequently jump between two quantized levels, resulting in training instability and a sub-optimal final model. We discover that the learnable scaling factor, a widely-used $\textit{de facto}$ setting in quantization aggravates weight oscillation. In this study, we investigate the connection between the learnable scaling factor and quantized weight oscillation and use ViT as a case driver to illustrate the findings and remedies. In addition, we also found that the interdependence between quantized weights in $\textit{query}$ and $\textit{key}$ of a self-attention layer makes ViT vulnerable to oscillation. We, therefore, propose three techniques accordingly: statistical weight quantization ($\rm StatsQ$) to improve quantization robustness compared to the prevalent learnable-scale-based method; confidence-guided annealing ($\rm CGA$) that freezes the weights with $\textit{high confidence}$ and calms the oscillating weights; and $\textit{query}$-$\textit{key}$ reparameterization ($\rm QKR$) to resolve the query-key intertwined oscillation and mitigate the resulting gradient misestimation. Extensive experiments demonstrate that these proposed techniques successfully abate weight oscillation and consistently achieve substantial accuracy improvement on ImageNet. Specifically, our 2-bit DeiT-T/DeiT-S algorithms outperform the previous state-of-the-art by 9.8% and 7.7%, respectively. The code is included in the supplementary material and will be released.
    Zero-Shot Robot Manipulation from Passive Human Videos. (arXiv:2302.02011v1 [cs.RO])
    Can we learn robot manipulation for everyday tasks, only by watching videos of humans doing arbitrary tasks in different unstructured settings? Unlike widely adopted strategies of learning task-specific behaviors or direct imitation of a human video, we develop a a framework for extracting agent-agnostic action representations from human videos, and then map it to the agent's embodiment during deployment. Our framework is based on predicting plausible human hand trajectories given an initial image of a scene. After training this prediction model on a diverse set of human videos from the internet, we deploy the trained model zero-shot for physical robot manipulation tasks, after appropriate transformations to the robot's embodiment. This simple strategy lets us solve coarse manipulation tasks like opening and closing drawers, pushing, and tool use, without access to any in-domain robot manipulation trajectories. Our real-world deployment results establish a strong baseline for action prediction information that can be acquired from diverse arbitrary videos of human activities, and be useful for zero-shot robotic manipulation in unseen scenes.
    Backdoor Attacks on Time Series: A Generative Approach. (arXiv:2211.07915v5 [cs.LG] UPDATED)
    Backdoor attacks have emerged as one of the major security threats to deep learning models as they can easily control the model's test-time predictions by pre-injecting a backdoor trigger into the model at training time. While backdoor attacks have been extensively studied on images, few works have investigated the threat of backdoor attacks on time series data. To fill this gap, in this paper we present a novel generative approach for time series backdoor attacks against deep learning based time series classifiers. Backdoor attacks have two main goals: high stealthiness and high attack success rate. We find that, compared to images, it can be more challenging to achieve the two goals on time series. This is because time series have fewer input dimensions and lower degrees of freedom, making it hard to achieve a high attack success rate without compromising stealthiness. Our generative approach addresses this challenge by generating trigger patterns that are as realistic as real-time series patterns while achieving a high attack success rate without causing a significant drop in clean accuracy. We also show that our proposed attack is resistant to potential backdoor defenses. Furthermore, we propose a novel universal generator that can poison any type of time series with a single generator that allows universal attacks without the need to fine-tune the generative model for new time series datasets.
    Federated deep transfer learning for EEG decoding using multiple BCI tasks. (arXiv:2211.10976v3 [eess.SP] UPDATED)
    Deep learning has been successful in BCI decoding. However, it is very data-hungry and requires pooling data from multiple sources. EEG data from various sources decrease the decoding performance due to negative transfer. Recently, transfer learning for EEG decoding has been suggested as a remedy and become subject to recent BCI competitions (e.g. BEETL), but there are two complications in combining data from many subjects. First, privacy is not protected as highly personal brain data needs to be shared (and copied across increasingly tight information governance boundaries). Moreover, BCI data are collected from different sources and are often based on different BCI tasks, which has been thought to limit their reusability. Here, we demonstrate a federated deep transfer learning technique, the Multi-dataset Federated Separate-Common-Separate Network (MF-SCSN) based on our previous work of SCSN, which integrates privacy-preserving properties into deep transfer learning to utilise data sets with different tasks. This framework trains a BCI decoder using different source data sets obtained from different imagery tasks (e.g. some data sets with hands and feet, vs others with single hands and tongue, etc). Therefore, by introducing privacy-preserving transfer learning techniques, we unlock the reusability and scalability of existing BCI data sets. We evaluated our federated transfer learning method on the NeurIPS 2021 BEETL competition BCI task. The proposed architecture outperformed the baseline decoder by 3%. Moreover, compared with the baseline and other transfer learning algorithms, our method protects the privacy of the brain data from different data centres.
    Interaction Order Prediction for Temporal Graphs. (arXiv:2302.02128v1 [cs.SI])
    Link prediction in graphs is a task that has been widely investigated. It has been applied in various domains such as knowledge graph completion, content/item recommendation, social network recommendations and so on. The initial focus of most research was on link prediction in static graphs. However, there has recently been abundant work on modeling temporal graphs, and consequently one of the tasks that has been researched is link prediction in temporal graphs. However, most of the existing work does not focus on the order of link formation, and only predicts the existence of links. In this study, we aim to predict the order of node interactions.
    Representation Deficiency in Masked Language Modeling. (arXiv:2302.02060v1 [cs.CL])
    Masked Language Modeling (MLM) has been one of the most prominent approaches for pretraining bidirectional text encoders due to its simplicity and effectiveness. One notable concern about MLM is that the special $\texttt{[MASK]}$ symbol causes a discrepancy between pretraining data and downstream data as it is present only in pretraining but not in fine-tuning. In this work, we offer a new perspective on the consequence of such a discrepancy: We demonstrate empirically and theoretically that MLM pretraining allocates some model dimensions exclusively for representing $\texttt{[MASK]}$ tokens, resulting in a representation deficiency for real tokens and limiting the pretrained model's expressiveness when it is adapted to downstream data without $\texttt{[MASK]}$ tokens. Motivated by the identified issue, we propose MAE-LM, which pretrains the Masked Autoencoder architecture with MLM where $\texttt{[MASK]}$ tokens are excluded from the encoder. Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
    Transformers as Algorithms: Generalization and Stability in In-context Learning. (arXiv:2301.07067v2 [cs.LG] UPDATED)
    In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. In this work, we formalize in-context learning as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system. The crux of our analysis is relating the excess risk to the stability of the algorithm implemented by the transformer. We characterize when transformer/attention architecture provably obeys the stability condition and also provide empirical verification. For generalization on unseen tasks, we identify an inductive bias phenomenon in which the transfer learning risk is governed by the task complexity and the number of MTL tasks in a highly predictable manner. Finally, we provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) provide insights on stability, and (3) verify our theoretical predictions.
    A Non-monotonic Self-terminating Language Model. (arXiv:2210.00660v2 [cs.LG] UPDATED)
    Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top-$k$ sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-$k$ sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020). We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020), to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.
    This Intestine Does Not Exist: Multiscale Residual Variational Autoencoder for Realistic Wireless Capsule Endoscopy Image Generation. (arXiv:2302.02150v1 [cs.CV])
    Medical image synthesis has emerged as a promising solution to address the limited availability of annotated medical data needed for training machine learning algorithms in the context of image-based Clinical Decision Support (CDS) systems. To this end, Generative Adversarial Networks (GANs) have been mainly applied to support the algorithm training process by generating synthetic images for data augmentation. However, in the field of Wireless Capsule Endoscopy (WCE), the limited content diversity and size of existing publicly available annotated datasets, adversely affect both the training stability and synthesis performance of GANs. Aiming to a viable solution for WCE image synthesis, a novel Variational Autoencoder architecture is proposed, namely "This Intestine Does not Exist" (TIDE). The proposed architecture comprises multiscale feature extraction convolutional blocks and residual connections, which enable the generation of high-quality and diverse datasets even with a limited number of training images. Contrary to the current approaches, which are oriented towards the augmentation of the available datasets, this study demonstrates that using TIDE, real WCE datasets can be fully substituted by artificially generated ones, without compromising classification performance. Furthermore, qualitative and user evaluation studies by experienced WCE specialists, validate from a medical viewpoint that both the normal and abnormal WCE images synthesized by TIDE are sufficiently realistic.
    Hierarchical Learning with Unsupervised Skill Discovery for Highway Merging Applications. (arXiv:2302.02179v1 [cs.LG])
    Driving in dense traffic with human and autonomous drivers is a challenging task that requires high level planning and reasoning along with the ability to react quickly to changes in a dynamic environment. In this study, we propose a hierarchical learning approach that uses learned motion primitives as actions. Motion primitives are obtained using unsupervised skill discovery without a predetermined reward function, allowing them to be reused in different scenarios. This can reduce the total training time for applications that need to obtain multiple models with varying behavior. Simulation results demonstrate that the proposed approach yields driver models that achieve higher performance with less training compared to baseline reinforcement learning methods.
    Model-Aware Contrastive Learning: Towards Escaping the Dilemmas. (arXiv:2207.07874v3 [cs.LG] UPDATED)
    Contrastive learning (CL) continuously achieves significant breakthroughs across multiple domains. However, the most common InfoNCE based methods suffer from some existing dilemmas, e.g., uniformity-tolerance dilemma (UTD) and the gradient reduction. It has been identified that UTD can lead to unexpected performance degradation. We argue that the fixity of temperature is to blame for UTD. To tackle this challenge, we enrich the CL loss family by presenting a Model-Aware Contrastive Learning (MACL) strategy, whose temperature is adaptive to the magnitude of alignment that reflects the basic confidence of the instance discrimination task, then enables CL loss to adjust the penalty strength for hard negatives adaptively. Regarding another dilemma, the gradient reduction issue, we derive the limits of an involved gradient scaling factor, which allows us to explain from a unified perspective why some recent approaches are effective with fewer negative samples, and summarily present a gradient reweighting to escape this dilemma. Extensive remarkable empirical results in vision, sentence, and graph modality validate our approach's general improvement for representation learning and downstream tasks.
    Extracting the gamma-ray source-count distribution below the Fermi-LAT detection limit with deep learning. (arXiv:2302.01947v1 [astro-ph.CO])
    We reconstruct the extra-galactic gamma-ray source-count distribution, or $dN/dS$, of resolved and unresolved sources by adopting machine learning techniques. Specifically, we train a convolutional neural network on synthetic 2-dimensional sky-maps, which are built by varying parameters of underlying source-counts models and incorporate the Fermi-LAT instrumental response functions. The trained neural network is then applied to the Fermi-LAT data, from which we estimate the source count distribution down to flux levels a factor of 50 below the Fermi-LAT threshold. We perform our analysis using 14 years of data collected in the $(1,10)$ GeV energy range. The results we obtain show a source count distribution which, in the resolved regime, is in excellent agreement with the one derived from catalogued sources, and then extends as $dN/dS \sim S^{-2}$ in the unresolved regime, down to fluxes of $5 \cdot 10^{-12}$ cm$^{-2}$ s$^{-1}$. The neural network architecture and the devised methodology have the flexibility to enable future analyses to study the energy dependence of the source-count distribution.
    An Asymptotically Optimal Algorithm for the One-Dimensional Convex Hull Feasibility Problem. (arXiv:2302.02033v1 [stat.ML])
    This work studies the pure-exploration setting for the convex hull feasibility (CHF) problem where one aims to efficiently and accurately determine if a given point lies in the convex hull of means of a finite set of distributions. We give a complete characterization of the sample complexity of the CHF problem in the one-dimensional setting. We present the first asymptotically optimal algorithm called Thompson-CHF, whose modular design consists of a stopping rule and a sampling rule. In addition, we provide an extension of the algorithm that generalizes several important problems in the multi-armed bandit literature. Finally, we further investigate the Gaussian bandit case with unknown variances and address how the Thompson-CHF algorithm can be adjusted to be asymptotically optimal in this setting.
    Matrix Estimation for Individual Fairness. (arXiv:2302.02096v1 [cs.LG])
    In recent years, multiple notions of algorithmic fairness have arisen. One such notion is individual fairness (IF), which requires that individuals who are similar receive similar treatment. In parallel, matrix estimation (ME) has emerged as a natural paradigm for handling noisy data with missing values. In this work, we connect the two concepts. We show that pre-processing data using ME can improve an algorithm's IF without sacrificing performance. Specifically, we show that using a popular ME method known as singular value thresholding (SVT) to pre-process the data provides a strong IF guarantee under appropriate conditions. We then show that, under analogous conditions, SVT pre-processing also yields estimates that are consistent and approximately minimax optimal. As such, the ME pre-processing step does not, under the stated conditions, increase the prediction error of the base algorithm, i.e., does not impose a fairness-performance trade-off. We verify these results on synthetic and real data.
    GAN-based federated learning for label protection in binary classification. (arXiv:2302.02245v1 [cs.LG])
    As an emerging technique, vertical federated learning collaborates with different data sources to jointly train a machine learning model without data exchange. However, federated learning is computationally expensive and inefficient in modeling due to complex encryption algorithms and secure computation protocols. Split learning offers an alternative solution to circumvent these challenges. Despite this, vanilla split learning still suffers privacy leakage. Here, we propose the Generative Adversarial Federated Model (GAFM), which integrates the vanilla split learning framework with the Generative Adversarial Network (GAN) for protection against label leakage from gradients in binary classification tasks. We compare our proposal to existing models, including Marvell, Max Norm, and SplitNN, on three publicly available datasets, where GAFM shows significant improvement regarding the trade-off between classification accuracy and label privacy protection. We also provide heuristic justification for why GAFM can improve over baselines and demonstrate that GAFM offers label protection through gradient perturbation compared to SplitNN.
    Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence. (arXiv:2210.06819v2 [cs.LG] UPDATED)
    The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
    A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation. (arXiv:2211.14296v2 [cs.LG] UPDATED)
    The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization.
    GEDI: GEnerative and DIscriminative Training for Self-Supervised Learning. (arXiv:2212.13425v3 [cs.LG] UPDATED)
    Self-supervised learning is a popular and powerful method for utilizing large amounts of unlabeled data, for which a wide variety of training objectives have been proposed in the literature. In this study, we perform a Bayesian analysis of state-of-the-art self-supervised learning objectives and propose a unified formulation based on likelihood learning. Our analysis suggests a simple method for integrating self-supervised learning with generative models, allowing for the joint training of these two seemingly distinct approaches. We refer to this combined framework as GEDI, which stands for GEnerative and DIscriminative training. Additionally, we demonstrate an instantiation of the GEDI framework by integrating an energy-based model with a cluster-based self-supervised learning model. Through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, we show that GEDI outperforms existing self-supervised learning strategies in terms of clustering performance by a wide margin. We also demonstrate that GEDI can be integrated into a neural-symbolic framework to address tasks in the small data regime, where it can use logical constraints to further improve clustering and classification performance.
    Physics-informed Neural Networks approach to solve the Blasius function. (arXiv:2301.00106v2 [cs.LG] UPDATED)
    Deep learning techniques with neural networks have been used effectively in computational fluid dynamics (CFD) to obtain solutions to nonlinear differential equations. This paper presents a physics-informed neural network (PINN) approach to solve the Blasius function. This method eliminates the process of changing the non-linear differential equation to an initial value problem. Also, it tackles the convergence issue arising in the conventional series solution. It is seen that this method produces results that are at par with the numerical and conventional methods. The solution is extended to the negative axis to show that PINNs capture the singularity of the function at $\eta=-5.69$
    A survey on knowledge-enhanced multimodal learning. (arXiv:2211.12328v2 [cs.LG] UPDATED)
    Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.
    Inferencing the earth moving equipment-environment interaction in open pit mining. (arXiv:2302.02130v1 [cs.LG])
    In mining, grade control generally focuses on blast hole sampling and the estimation of ore control block models with little or no attention given to how the materials are being excavated from the ground. In the process of loading trucks, the underlying variability of the individual bucket load will determine the variability of truck payload. Hence, accurate material movement demands a good knowledge of the excavation process and the buckets interaction with the environment. However, equipment frequently goes into off nominal states due to unexpected delays, disturbances or faults. The large amount of such disturbances causes information loss that reduces the statistical power and biases estimates, leading to increased uncertainty in the production. A reliable method that inferences the missing knowledge about the interaction between the machine and the environment from the available data sources, is vital to accurately model the material movement. In this study, a twostep method was implemented that performed unsupervised clustering and then predicted the missing information. The first method is DBSCAN based spatial clustering which divides the diggers and buckets positional data into connected loading segments. Clear patterns of segmented bucket dig positions were observed. The second model utilized Gaussian process regression which was trained with the clustered data and the model was then used to infer the mean locations of the test clusters. Bucket dig locations were then simulated at the inferred mean locations for different durations and compared against the known bucket dig locations. This method was tested at an open pit mine in the Pilbara of Western Australia. The results demonstrate the advantage of the proposed method in inferencing the missing information of bucket environment interactions and therefore enables miners to continuously track the material movement.
    LAP: An Attention-Based Module for Faithful Interpretation and Knowledge Injection in Convolutional Neural Networks. (arXiv:2201.11808v4 [cs.CV] UPDATED)
    Despite the state-of-the-art performance of deep convolutional neural networks, they are susceptible to bias and malfunction in unseen situations. The complex computation behind their reasoning is not sufficiently human-understandable to develop trust. External explainer methods have tried to interpret the network decisions in a human-understandable way, but they are accused of fallacies due to their assumptions and simplifications. On the other side, the inherent self-interpretability of models, while being more robust to the mentioned fallacies, cannot be applied to the already trained models. In this work, we propose a new attention-based pooling layer, called Local Attention Pooling (LAP), that accomplishes self-interpretability and the possibility for knowledge injection while improving the model's performance. Moreover, several weakly-supervised knowledge injection methodologies are provided to enhance the process of training. We verified our claims by evaluating several LAP-extended models on three different datasets, including Imagenet. The proposed framework offers more valid human-understandable and more faithful-to-the-model interpretations than the commonly used white-box explainer methods.
    CECT: Controllable Ensemble CNN and Transformer for COVID-19 image classification by capturing both local and global image features. (arXiv:2302.02314v1 [eess.IV])
    Purpose: Most computer vision models are developed based on either convolutional neural network (CNN) or transformer, while the former (latter) method captures local (global) features. To relieve model performance limitations due to the lack of global (local) features, we develop a novel classification network named CECT by controllable ensemble CNN and transformer. Methods: The proposed CECT is composed of a CNN-based encoder block, a deconvolution-ensemble decoder block, and a transformer-based classification block. Different from conventional CNN- or transformer-based methods, our CECT can capture features at both multi-local and global scales, and the contribution of local features at different scales can be controlled with the proposed ensemble coefficients. Results: We evaluate CECT on two public COVID-19 datasets and it outperforms other state-of-the-art methods on all evaluation metrics. Conclusion: With remarkable feature capture ability, we believe CECT can also be used in other medical image classification scenarios to assist the diagnosis.
    A Theory of Link Prediction via Relational Weisfeiler-Leman. (arXiv:2302.02209v1 [cs.LG])
    Graph neural networks are prominent models for representation learning over graph-structured data. While the capabilities and limitations of these models are well-understood for simple graphs, our understanding remains highly incomplete in the context of knowledge graphs. The goal of this work is to provide a systematic understanding of the landscape of graph neural networks for knowledge graphs pertaining the prominent task of link prediction. Our analysis entails a unifying perspective on seemingly unrelated models, and unlocks a series of other models. The expressive power of various models is characterized via a corresponding relational Weisfeiler-Leman algorithm with different initialization regimes. This analysis is extended to provide a precise logical characterization of the class of functions captured by a class of graph neural networks. Our theoretical findings explain the benefits of some widely employed practical design choices, which are validated empirically.
    An Uncertainty-aware Loss Function for Training Neural Networks with Calibrated Predictions. (arXiv:2110.03260v2 [cs.LG] UPDATED)
    Uncertainty quantification of machine learning and deep learning methods plays an important role in enhancing trust to the obtained result. In recent years, a numerous number of uncertainty quantification methods have been introduced. Monte Carlo dropout (MC-Dropout) is one of the most well-known techniques to quantify uncertainty in deep learning methods. In this study, we propose two new loss functions by combining cross entropy with Expected Calibration Error (ECE) and Predictive Entropy (PE). The obtained results clearly show that the new proposed loss functions lead to having a calibrated MC-Dropout method. Our results confirmed the great impact of the new hybrid loss functions for minimising the overlap between the distributions of uncertainty estimates for correct and incorrect predictions without sacrificing the model's overall performance.
    Extending Bootstrap AMG for Clustering of Attributed Graphs. (arXiv:2109.09367v4 [cs.LG] UPDATED)
    In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [1, 2]. The augmented graph is then embedded in a Euclidean space associated to its Laplacian and we cluster vertices via a modified K-means algorithm, using a new vector-valued distance in the embedding space. Main novelty of our method, which can be classified as an early fusion method, i.e., a method in which additional information on vertices are fused to the structure information before applying clustering, is the interpretation of attributes as new realizations of graph vertices, which can be dealt with as coordinate vectors in a related Euclidean space. This allows us to extend a scalable generalized spectral clustering procedure which substitutes graph Laplacian eigenvectors with some vectors, named algebraically smooth vectors, obtained by a linear-time complexity Algebraic MultiGrid (AMG) method. We discuss the performance of our proposed clustering method by comparison with recent literature approaches and public available results. Extensive experiments on different types of synthetic datasets and real-world attributed graphs show that our new algorithm, embedding attributes information in the clustering, outperforms structure-only-based methods, when the attributed network has an ambiguous structure. Furthermore, our new method largely outperforms the method which originally proposed the graph augmentation, showing that our embedding strategy and vector-valued distance are very effective in taking advantages from the augmented-graph representation.
    Parallelizing Contextual Bandits. (arXiv:2105.10590v2 [stat.ML] UPDATED)
    Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, \textit{simultaneously} proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual bandit algorithms applicable to problems with bounded eluder dimension whose regret is nearly identical to their perfectly sequential counterparts -- given access to the same total number of oracle queries -- up to a lower-order ``burn-in" term. We further show these algorithms can be specialized to the class of linear reward functions where we introduce and analyze several new linear bandit algorithms which explicitly introduce diversity into their action selection. Finally, we also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems, to demonstrate the utility of parallelized bandits in practical settings.
    Adversarial Learning Data Augmentation for Graph Contrastive Learning in Recommendation. (arXiv:2302.02317v1 [cs.IR])
    Recently, Graph Neural Networks (GNNs) achieve remarkable success in Recommendation. To reduce the influence of data sparsity, Graph Contrastive Learning (GCL) is adopted in GNN-based CF methods for enhancing performance. Most GCL methods consist of data augmentation and contrastive loss (e.g., InfoNCE). GCL methods construct the contrastive pairs by hand-crafted graph augmentations and maximize the agreement between different views of the same node compared to that of other nodes, which is known as the InfoMax principle. However, improper data augmentation will hinder the performance of GCL. InfoMin principle, that the good set of views shares minimal information and gives guidelines to design better data augmentation. In this paper, we first propose a new data augmentation (i.e., edge-operating including edge-adding and edge-dropping). Then, guided by InfoMin principle, we propose a novel theoretical guiding contrastive learning framework, named Learnable Data Augmentation for Graph Contrastive Learning (LDA-GCL). Our methods include data augmentation learning and graph contrastive learning, which follow the InfoMin and InfoMax principles, respectively. In implementation, our methods optimize the adversarial loss function to learn data augmentation and effective representations of users and items. Extensive experiments on four public benchmark datasets demonstrate the effectiveness of LDA-GCL.
    Bag-of-Vectors Autoencoders for Unsupervised Conditional Text Generation. (arXiv:2110.07002v2 [cs.CL] UPDATED)
    Text autoencoders are often used for unsupervised conditional text generation by applying mappings in the latent space to change attributes to the desired values. Recently, Mai et al. (2020) proposed Emb2Emb, a method to learn these mappings in the embedding space of an autoencoder. However, their method is restricted to autoencoders with a single-vector embedding, which limits how much information can be retained. We address this issue by extending their method to Bag-of-Vectors Autoencoders (BoV-AEs), which encode the text into a variable-size bag of vectors that grows with the size of the text, as in attention-based models. This allows to encode and reconstruct much longer texts than standard autoencoders. Analogous to conventional autoencoders, we propose regularization techniques that facilitate learning meaningful operations in the latent space. Finally, we adapt Emb2Emb for a training scheme that learns to map an input bag to an output bag, including a novel loss function and neural architecture. Our empirical evaluations on unsupervised sentiment transfer show that our method performs substantially better than a standard autoencoder.
    Counterfactual Risk Assessments under Unmeasured Confounding. (arXiv:2212.09844v2 [econ.EM] UPDATED)
    Statistical risk assessments inform consequential decisions, such as pretrial release in criminal justice and loan approvals in consumer finance, by counterfactually predicting an outcome under a proposed decision (e.g., would the applicant default if we approved this loan?). There may, however, have been unmeasured confounders that jointly affected decisions and outcomes in the historical data. We propose a mean outcome sensitivity model that bounds the extent to which unmeasured confounders could affect outcomes on average. The mean outcome sensitivity model partially identifies the conditional likelihood of the outcome under the proposed decision, popular predictive performance metrics, and predictive disparities. We derive their identified sets and develop procedures for the confounding-robust learning and evaluation of statistical risk assessments. We propose a nonparametric regression procedure for the bounds on the conditional likelihood of the outcome under the proposed decision, and estimators for the bounds on predictive performance and disparities. Applying our methods to a real-world credit-scoring task, we show how varying assumptions on unmeasured confounding lead to substantive changes in the credit score's predictions and evaluations of its predictive disparities.
    On a continuous time model of gradient descent dynamics and instability in deep learning. (arXiv:2302.01952v1 [stat.ML])
    The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
    Guaranteed Tensor Recovery Fused Low-rankness and Smoothness. (arXiv:2302.02155v1 [cs.LG])
    The tensor data recovery task has thus attracted much research attention in recent years. Solving such an ill-posed problem generally requires to explore intrinsic prior structures underlying tensor data, and formulate them as certain forms of regularization terms for guiding a sound estimate of the restored tensor. Recent research have made significant progress by adopting two insightful tensor priors, i.e., global low-rankness (L) and local smoothness (S) across different tensor modes, which are always encoded as a sum of two separate regularization terms into the recovery models. However, unlike the primary theoretical developments on low-rank tensor recovery, these joint L+S models have no theoretical exact-recovery guarantees yet, making the methods lack reliability in real practice. To this crucial issue, in this work, we build a unique regularization term, which essentially encodes both L and S priors of a tensor simultaneously. Especially, by equipping this single regularizer into the recovery models, we can rigorously prove the exact recovery guarantees for two typical tensor recovery tasks, i.e., tensor completion (TC) and tensor robust principal component analysis (TRPCA). To the best of our knowledge, this should be the first exact-recovery results among all related L+S methods for tensor recovery. Significant recovery accuracy improvements over many other SOTA methods in several TC and TRPCA tasks with various kinds of visual tensor data are observed in extensive experiments. Typically, our method achieves a workable performance when the missing rate is extremely large, e.g., 99.5%, for the color image inpainting task, while all its peers totally fail in such challenging case.
    Dueling RL: Reinforcement Learning with Trajectory Preferences. (arXiv:2111.04850v3 [cs.LG] UPDATED)
    We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problems with trajectory preferences.
    IoT Botnet Detection Using an Economic Deep Learning Model. (arXiv:2302.02013v1 [cs.CR])
    The rapid progress in technology innovation usage and distribution has increased in the last decade. The rapid growth of the Internet of Things (IoT) systems worldwide has increased network security challenges created by malicious third parties. Thus, reliable intrusion detection and network forensics systems that consider security concerns and IoT systems limitations are essential to protect such systems. IoT botnet attacks are one of the significant threats to enterprises and individuals. Thus, this paper proposed an economic deep learning-based model for detecting IoT botnet attacks along with different types of attacks. The proposed model achieved higher accuracy than the state-of-the-art detection models using a smaller implementation budget and accelerating the training and detecting processes.
    Self-Supervised Transformer Architecture for Change Detection in Radio Access Networks. (arXiv:2302.02025v1 [cs.LG])
    Radio Access Networks (RANs) for telecommunications represent large agglomerations of interconnected hardware consisting of hundreds of thousands of transmitting devices (cells). Such networks undergo frequent and often heterogeneous changes caused by network operators, who are seeking to tune their system parameters for optimal performance. The effects of such changes are challenging to predict and will become even more so with the adoption of 5G/6G networks. Therefore, RAN monitoring is vital for network operators. We propose a self-supervised learning framework that leverages self-attention and self-distillation for this task. It works by detecting changes in Performance Measurement data, a collection of time-varying metrics which reflect a set of diverse measurements of the network performance at the cell level. Experimental results show that our approach outperforms the state of the art by 4% on a real-world based dataset consisting of about hundred thousands timeseries. It also has the merits of being scalable and generalizable. This allows it to provide deep insight into the specifics of mode of operation changes while relying minimally on expert knowledge.
    Generalization of Deep Reinforcement Learning for Jammer-Resilient Frequency and Power Allocation. (arXiv:2302.02250v1 [cs.NI])
    We tackle the problem of joint frequency and power allocation while emphasizing the generalization capability of a deep reinforcement learning model. Most of the existing methods solve reinforcement learning-based wireless problems for a specific pre-determined wireless network scenario. The performance of a trained agent tends to be very specific to the network and deteriorates when used in a different network operating scenario (e.g., different in size, neighborhood, and mobility, among others). We demonstrate our approach to enhance training to enable a higher generalization capability during inference of the deployed model in a distributed multi-agent setting in a hostile jamming environment. With all these, we show the improved training and inference performance of the proposed methods when tested on previously unseen simulated wireless networks of different sizes and architectures. More importantly, to prove practical impact, the end-to-end solution was implemented on the embedded software-defined radio and validated using over-the-air evaluation.
    Counterfactual Identifiability of Bijective Causal Models. (arXiv:2302.02228v1 [stat.ML])
    We study counterfactual identifiability in causal models with bijective generation mechanisms (BGM), a class that generalizes several widely-used causal models in the literature. We establish their counterfactual identifiability for three common causal structures with unobserved confounding, and propose a practical learning method that casts learning a BGM as structured generative modeling. Learned BGMs enable efficient counterfactual estimation and can be obtained using a variety of deep conditional generative models. We evaluate our techniques in a visual task and demonstrate its application in a real-world video streaming simulation task.
    Cross-Frequency Time Series Meta-Forecasting. (arXiv:2302.02077v1 [cs.LG])
    Meta-forecasting is a newly emerging field which combines meta-learning and time series forecasting. The goal of meta-forecasting is to train over a collection of source time series and generalize to new time series one-at-a-time. Previous approaches in meta-forecasting achieve competitive performance, but with the restriction of training a separate model for each sampling frequency. In this work, we investigate meta-forecasting over different sampling frequencies, and introduce a new model, the Continuous Frequency Adapter (CFA), specifically designed to learn frequency-invariant representations. We find that CFA greatly improves performance when generalizing to unseen frequencies, providing a first step towards forecasting over larger multi-frequency datasets.
    Augmenting Rule-based DNS Censorship Detection at Scale with Machine Learning. (arXiv:2302.02031v1 [cs.LG])
    The proliferation of global censorship has led to the development of a plethora of measurement platforms to monitor and expose it. Censorship of the domain name system (DNS) is a key mechanism used across different countries. It is currently detected by applying heuristics to samples of DNS queries and responses (probes) for specific destinations. These heuristics, however, are both platform-specific and have been found to be brittle when censors change their blocking behavior, necessitating a more reliable automated process for detecting censorship. In this paper, we explore how machine learning (ML) models can (1) help streamline the detection process, (2) improve the usability of large-scale datasets for censorship detection, and (3) discover new censorship instances and blocking signatures missed by existing heuristic methods. Our study shows that supervised models, trained using expert-derived labels on instances of known anomalies and possible censorship, can learn the detection heuristics employed by different measurement platforms. More crucially, we find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing heuristics. Moreover, both methods demonstrate the capability to uncover a substantial number of new DNS blocking signatures, i.e., injected fake IP addresses overlooked by existing heuristics. These results are underpinned by an important methodological finding: comparing the outputs of models trained using the same probes but with labels arising from independent processes allows us to more reliably detect cases of censorship in the absence of ground-truth labels of censorship.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v4 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other. Our code is available at https://github.com/chaitjo/efficient-gnns
    Support Vector Regression: Risk Quadrangle Framework. (arXiv:2212.09178v3 [stat.ML] UPDATED)
    This paper investigates Support Vector Regression (SVR) in the context of the fundamental risk quadrangle paradigm. It is shown that both formulations of SVR, $\varepsilon$-SVR and $\nu$-SVR, correspond to the minimization of equivalent regular error measures (Vapnik error and superquantile (CVaR) norm, respectively) with a regularization penalty. These error measures, in turn, give rise to corresponding risk quadrangles. By constructing the fundamental risk quadrangle, which corresponds to SVR, we show that SVR is the asymptotically unbiased estimator of the average of two symmetric conditional quantiles. Furthermore, the technique used for the construction of quadrangles serves as a powerful tool in proving the equivalence between $\varepsilon$-SVR and $\nu$-SVR. Additionally, SVR is formulated as a regular deviation minimization problem with a regularization penalty by invoking Error Shaping Decomposition of Regression and the dual formulation of SVR in the risk quadrangle framework is derived.
    Fixed-kinetic Neural Hamiltonian Flows for enhanced interpretability and reduced complexity. (arXiv:2302.01955v1 [cs.LG])
    Normalizing Flows (NF) are Generative models which are particularly robust and allow for exact sampling of the learned distribution. They however require the design of an invertible mapping, whose Jacobian determinant has to be computable. Recently introduced, Neural Hamiltonian Flows (NHF) are based on Hamiltonian dynamics-based Flows, which are continuous, volume-preserving and invertible and thus make for natural candidates for robust NF architectures. In particular, their similarity to classical Mechanics could lead to easier interpretability of the learned mapping. However, despite being Physics-inspired architectures, the originally introduced NHF architecture still poses a challenge to interpretability. For this reason, in this work, we introduce a fixed kinetic energy version of the NHF model. Inspired by physics, our approach improves interpretability and requires less parameters than previously proposed architectures. We then study the robustness of the NHF architectures to the choice of hyperparameters. We analyze the impact of the number of leapfrog steps, the integration time and the number of neurons per hidden layer, as well as the choice of prior distribution, on sampling a multimodal 2D mixture. The NHF architecture is robust to these choices, especially the fixed-kinetic energy model. Finally, we adapt NHF to the context of Bayesian inference and illustrate our method on sampling the posterior distribution of two cosmological parameters knowing type Ia supernovae observations.
  • Open

    Proper Learning of Linear Dynamical Systems as a Non-Commutative Polynomial Optimisation Problem. (arXiv:2002.01444v4 [math.OC] UPDATED)
    There has been much recent progress in forecasting the next observation of a linear dynamical system (LDS), which is known as the improper learning, as well as in the estimation of its system matrices, which is known as the proper learning of LDS. We present an approach to proper learning of LDS, which in spite of the non-convexity of the problem, guarantees global convergence of numerical solutions to a least-squares estimator. We present promising computational results.  ( 2 min )
    Toward Large Kernel Models. (arXiv:2302.02605v1 [cs.LG])
    Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by the discovery of their equivalence to wide neural networks in certain regimes. However, a key feature of DNNs is their ability to scale the model size and training data size independently, whereas in traditional kernel machines model size is tied to data size. Because of this coupling, scaling kernel machines to large data has been computationally challenging. In this paper, we provide a way forward for constructing large-scale general kernel models, which are a generalization of kernel machines that decouples the model and data, allowing training on large datasets. Specifically, we introduce EigenPro 3.0, an algorithm based on projected dual preconditioned SGD and show scaling to model and data sizes which have not been possible with existing kernel methods.  ( 2 min )
    Sampling-Based Accuracy Testing of Posterior Estimators for General Inference. (arXiv:2302.03026v1 [stat.ML])
    Parameter inference, i.e. inferring the posterior distribution of the parameters of a statistical model given some data, is a central problem to many scientific disciplines. Posterior inference with generative models is an alternative to methods such as Markov Chain Monte Carlo, both for likelihood-based and simulation-based inference. However, assessing the accuracy of posteriors encoded in generative models is not straightforward. In this paper, we introduce `distance to random point' (DRP) coverage testing as a method to estimate coverage probabilities of generative posterior estimators. Our method differs from previously-existing coverage-based methods, which require posterior evaluations. We prove that our approach is necessary and sufficient to show that a posterior estimator is optimal. We demonstrate the method on a variety of synthetic examples, and show that DRP can be used to test the results of posterior inference analyses in high-dimensional spaces. We also show that our method can detect non-optimal inferences in cases where existing methods fail.  ( 2 min )
    Identifiability of latent-variable and structural-equation models: from linear to nonlinear. (arXiv:2302.02672v1 [stat.ML])
    An old problem in multivariate statistics is that linear Gaussian models are often unidentifiable, i.e. some parameters cannot be uniquely estimated. In factor analysis, an orthogonal rotation of the factors is unidentifiable, while in linear regression, the direction of effect cannot be identified. For such linear models, non-Gaussianity of the (latent) variables has been shown to provide identifiability. In the case of factor analysis, this leads to independent component analysis, while in the case of the direction of effect, non-Gaussian versions of structural equation modelling solve the problem. More recently, we have shown how even general nonparametric nonlinear versions of such models can be estimated. Non-Gaussianity is not enough in this case, but assuming we have time series, or that the distributions are suitably modulated by some observed auxiliary variables, the models are identifiable. This paper reviews the identifiability theory for the linear and nonlinear cases, considering both factor analytic models and structural equation models.
    Counterfactual Identifiability of Bijective Causal Models. (arXiv:2302.02228v1 [stat.ML])
    We study counterfactual identifiability in causal models with bijective generation mechanisms (BGM), a class that generalizes several widely-used causal models in the literature. We establish their counterfactual identifiability for three common causal structures with unobserved confounding, and propose a practical learning method that casts learning a BGM as structured generative modeling. Learned BGMs enable efficient counterfactual estimation and can be obtained using a variety of deep conditional generative models. We evaluate our techniques in a visual task and demonstrate its application in a real-world video streaming simulation task.
    Structural Explanations for Graph Neural Networks using HSIC. (arXiv:2302.02139v1 [cs.LG])
    Graph neural networks (GNNs) are a type of neural model that tackle graphical tasks in an end-to-end manner. Recently, GNNs have been receiving increased attention in machine learning and data mining communities because of the higher performance they achieve in various tasks, including graph classification, link prediction, and recommendation. However, the complicated dynamics of GNNs make it difficult to understand which parts of the graph features contribute more strongly to the predictions. To handle the interpretability issues, recently, various GNN explanation methods have been proposed. In this study, a flexible model agnostic explanation method is proposed to detect significant structures in graphs using the Hilbert-Schmidt independence criterion (HSIC), which captures the nonlinear dependency between two variables through kernels. More specifically, we extend the GraphLIME method for node explanation with a group lasso and a fused lasso-based node explanation method. The group and fused regularization with GraphLIME enables the interpretation of GNNs in substructure units. Then, we show that the proposed approach can be used for the explanation of sequential graph classification tasks. Through experiments, it is demonstrated that our method can identify crucial structures in a target graph in various settings.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v4 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other. Our code is available at https://github.com/chaitjo/efficient-gnns
    A Fast Bootstrap Algorithm for Causal Inference with Large Data. (arXiv:2302.02859v1 [stat.ME])
    Estimating causal effects from large experimental and observational data has become increasingly prevalent in both industry and research. The bootstrap is an intuitive and powerful technique used to construct standard errors and confidence intervals of estimators. Its application however can be prohibitively demanding in settings involving large data. In addition, modern causal inference estimators based on machine learning and optimization techniques exacerbate the computational burden of the bootstrap. The bag of little bootstraps has been proposed in non-causal settings for large data but has not yet been applied to evaluate the properties of estimators of causal effects. In this paper, we introduce a new bootstrap algorithm called causal bag of little bootstraps for causal inference with large data. The new algorithm significantly improves the computational efficiency of the traditional bootstrap while providing consistent estimates and desirable confidence interval coverage. We describe its properties, provide practical considerations, and evaluate the performance of the proposed algorithm in terms of bias, coverage of the true 95% confidence intervals, and computational time in a simulation study. We apply it in the evaluation of the effect of hormone therapy on the average time to coronary heart disease using a large observational data set from the Women's Health Initiative.
    Exploring validation metrics for offline model-based optimisation. (arXiv:2211.10747v2 [stat.ML] UPDATED)
    In offline model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of desirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and apply this to conditional denoising diffusion models. Empirically, we find that two validation metrics -- agreement and Frechet distance -- correlate quite well with the ground truth. When there is high variability in conditional generation, feedback is required in the form of an approximated version of the real-world scoring function. Furthermore, we find that generating high-scoring samples may require heavily weighting the generative model in favour of sample quality, potentially at the cost of sample diversity.
    Nonparametric Density Estimation under Distribution Drift. (arXiv:2302.02460v1 [cs.LG])
    We study nonparametric density estimation in non-stationary drift settings. Given a sequence of independent samples taken from a distribution that gradually changes in time, the goal is to compute the best estimate for the current distribution. We prove tight minimax risk bounds for both discrete and continuous smooth densities, where the minimum is over all possible estimates and the maximum is over all possible distributions that satisfy the drift constraints. Our technique handles a broad class of drift models, and generalizes previous results on agnostic learning under drift.
    Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees. (arXiv:2206.02659v4 [cs.LG] UPDATED)
    We consider transfer learning approaches that fine-tune a pretrained deep neural network on a target task. We study the generalization properties of fine-tuning to understand the problem of overfitting, which commonly occurs in practice. Previous works have shown that constraining the distance from the initialization of fine-tuning improves generalization. Using a PAC-Bayesian analysis, we observe that besides distance from initialization, Hessians affect generalization through the noise stability of deep neural networks against noise injections. Motivated by the observation, we develop Hessian distance-based generalization bounds for a wide range of fine-tuning methods. Additionally, we study the robustness of fine-tuning in the presence of noisy labels. We design an algorithm incorporating consistent losses and distance-based regularization for fine-tuning, along with a generalization error guarantee under class conditional independent noise in the training set labels. We perform a detailed empirical study of our algorithm on various noisy environments and architectures. On six image classification tasks whose training labels are generated with programmatic labeling, we find a 3.26% accuracy gain over prior fine-tuning methods. Meanwhile, the Hessian distance measure of the fine-tuned model decreases by six times more than existing approaches.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v4 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(N^{-1}+m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size, $m$ is the worker number, and $1+\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(N^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+\lambda^{1+\alpha} + \phi_{\mathcal{S}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD is positively correlated with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
    Multiscale Graph Comparison via the Embedded Laplacian Discrepancy. (arXiv:2201.12064v2 [stat.ML] UPDATED)
    Laplacian eigenvectors capture natural community structures on graphs and are widely used in spectral clustering and manifold learning. The use of Laplacian eigenvectors as embeddings for the purpose of multiscale graph comparison has however been limited. Here we propose the Embedded Laplacian Discrepancy (ELD) as a simple and fast approach to compare graphs (of potentially different sizes) based on the similarity of the graphs' community structures. The ELD operates by representing graphs as point clouds in a common, low-dimensional space, on which a natural Wasserstein-based distance can be efficiently computed. A main challenge in comparing graphs through any eigenvector-based approaches is the potential ambiguity that could arise due to sign-flips and basis symmetries. The ELD leverages a simple symmetrization trick to bypass any sign ambiguities. For comparing graphs that do not have any ambiguities due to basis symmetries (i.e. the spectrums are simple), we show that the ELD becomes a natural pseudo-metric that enjoys nice properties such as invariance under graph isomorphism. For comparing graphs with non-simple spectrums, we propose a procedure to approximate the ELD via a simple perturbation technique to resolve any ambiguity from basis symmetries. We show that such perturbations are stable using matrix perturbation theory under mild assumptions that are straightforward to verify in practice. We demonstrate the excellent applicability of the ELD approach on both simulated and real datasets.
    Generalization Bounds with Data-dependent Fractal Dimensions. (arXiv:2302.02766v1 [stat.ML])
    Providing generalization guarantees for modern neural networks has been a crucial task in statistical learning. Recently, several studies have attempted to analyze the generalization error in such settings by using tools from fractal geometry. While these works have successfully introduced new mathematical tools to apprehend generalization, they heavily rely on a Lipschitz continuity assumption, which in general does not hold for neural networks and might make the bounds vacuous. In this work, we address this issue and prove fractal geometry-based generalization bounds without requiring any Lipschitz assumption. To achieve this goal, we build up on a classical covering argument in learning theory and introduce a data-dependent fractal dimension. Despite introducing a significant amount of technical complications, this new notion lets us control the generalization error (over either fixed or random hypothesis spaces) along with certain mutual information (MI) terms. To provide a clearer interpretation to the newly introduced MI terms, as a next step, we introduce a notion of "geometric stability" and link our bounds to the prior art. Finally, we make a rigorous connection between the proposed data-dependent dimension and topological data analysis tools, which then enables us to compute the dimension in a numerically efficient way. We support our theory with experiments conducted on various settings.
    Variational Information Pursuit for Interpretable Predictions. (arXiv:2302.02876v1 [cs.LG])
    There is a growing interest in the machine learning community in developing predictive algorithms that are "interpretable by design". Towards this end, recent work proposes to make interpretable decisions by sequentially asking interpretable queries about data until a prediction can be made with high confidence based on the answers obtained (the history). To promote short query-answer chains, a greedy procedure called Information Pursuit (IP) is used, which adaptively chooses queries in order of information gain. Generative models are employed to learn the distribution of query-answers and labels, which is in turn used to estimate the most informative query. However, learning and inference with a full generative model of the data is often intractable for complex tasks. In this work, we propose Variational Information Pursuit (V-IP), a variational characterization of IP which bypasses the need for learning generative models. V-IP is based on finding a query selection strategy and a classifier that minimizes the expected cross-entropy between true and predicted labels. We then demonstrate that the IP strategy is the optimal solution to this problem. Therefore, instead of learning generative models, we can use our optimal strategy to directly pick the most informative query given any history. We then develop a practical algorithm by defining a finite-dimensional parameterization of our strategy and classifier using deep networks and train them end-to-end using our objective. Empirically, V-IP is 10-100x faster than IP on different Vision and NLP tasks with competitive performance. Moreover, V-IP finds much shorter query chains when compared to reinforcement learning which is typically used in sequential-decision-making problems. Finally, we demonstrate the utility of V-IP on challenging tasks like medical diagnosis where the performance is far superior to the generative modelling approach.
    Optimal Transport Guided Unsupervised Learning for Enhancing low-quality Retinal Images. (arXiv:2302.02991v1 [eess.IV])
    Real-world non-mydriatic retinal fundus photography is prone to artifacts, imperfections and low-quality when certain ocular or systemic co-morbidities exist. Artifacts may result in inaccuracy or ambiguity in clinical diagnoses. In this paper, we proposed a simple but effective end-to-end framework for enhancing poor-quality retinal fundus images. Leveraging the optimal transport theory, we proposed an unpaired image-to-image translation scheme for transporting low-quality images to their high-quality counterparts. We theoretically proved that a Generative Adversarial Networks (GAN) model with a generator and discriminator is sufficient for this task. Furthermore, to mitigate the inconsistency of information between the low-quality images and their enhancements, an information consistency mechanism was proposed to maximally maintain structural consistency (optical discs, blood vessels, lesions) between the source and enhanced domains. Extensive experiments were conducted on the EyeQ dataset to demonstrate the superiority of our proposed method perceptually and quantitatively.
    Robust Empirical Risk Minimization with Tolerance. (arXiv:2210.00635v2 [cs.LG] UPDATED)
    Developing simple, sample-efficient learning algorithms for robust classification is a pressing issue in today's tech-dominated world, and current theoretical techniques requiring exponential sample complexity and complicated improper learning rules fall far from answering the need. In this work we study the fundamental paradigm of (robust) $\textit{empirical risk minimization}$ (RERM), a simple process in which the learner outputs any hypothesis minimizing its training error. RERM famously fails to robustly learn VC classes (Montasser et al., 2019a), a bound we show extends even to `nice' settings such as (bounded) halfspaces. As such, we study a recent relaxation of the robust model called $\textit{tolerant}$ robust learning (Ashtiani et al., 2022) where the output classifier is compared to the best achievable error over slightly larger perturbation sets. We show that under geometric niceness conditions, a natural tolerant variant of RERM is indeed sufficient for $\gamma$-tolerant robust learning VC classes over $\mathbb{R}^d$, and requires only $\tilde{O}\left( \frac{VC(H)d\log \frac{D}{\gamma\delta}}{\epsilon^2}\right)$ samples for robustness regions of (maximum) diameter $D$.
    Asymptotically Minimax Optimal Fixed-Budget Best Arm Identification for Expected Simple Regret Minimization. (arXiv:2302.02988v1 [cs.LG])
    We investigate fixed-budget best arm identification (BAI) for expected simple regret minimization. In each round of an adaptive experiment, a decision maker draws one of multiple treatment arms based on past observations and subsequently observes the outcomes of the chosen arm. After the experiment, the decision maker recommends a treatment arm with the highest projected outcome. We evaluate this decision in terms of the expected simple regret, a difference between the expected outcomes of the best and recommended treatment arms. Due to the inherent uncertainty, we evaluate the regret using the minimax criterion. For distributions with fixed variances (location-shift models), such as Gaussian distributions, we derive asymptotic lower bounds for the worst-case expected simple regret. Then, we show that the Random Sampling (RS)-Augmented Inverse Probability Weighting (AIPW) strategy proposed by Kato et al. (2022) is asymptotically minimax optimal in the sense that the leading factor of its worst-case expected simple regret asymptotically matches our derived worst-case lower bound. Our result indicates that, for location-shift models, the optimal RS-AIPW strategy draws treatment arms with varying probabilities based on their variances. This result contrasts with the results of Bubeck et al. (2011), which shows that drawing each treatment arm with an equal ratio is minimax optimal in a bounded outcome setting.
    OTRE: Where Optimal Transport Guided Unpaired Image-to-Image Translation Meets Regularization by Enhancing. (arXiv:2302.03003v1 [eess.IV])
    Non-mydriatic retinal color fundus photography (CFP) is widely available due to the advantage of not requiring pupillary dilation, however, is prone to poor quality due to operators, systemic imperfections, or patient-related causes. Optimal retinal image quality is mandated for accurate medical diagnoses and automated analyses. Herein, we leveraged the \emph{Optimal Transport (OT)} theory to propose an unpaired image-to-image translation scheme for mapping low-quality retinal CFPs to high-quality counterparts. Furthermore, to improve the flexibility, robustness, and applicability of our image enhancement pipeline in the clinical practice, we generalized a state-of-the-art model-based image reconstruction method, regularization by denoising, by plugging in priors learned by our OT-guided image-to-image translation network. We named it as \emph{regularization by enhancing (RE)}. We validated the integrated framework, OTRE, on three publicly available retinal image datasets by assessing the quality after enhancement and their performance on various downstream tasks, including diabetic retinopathy grading, vessel segmentation, and diabetic lesion segmentation. The experimental results demonstrated the superiority of our proposed framework over some state-of-the-art unsupervised competitors and a state-of-the-art supervised method.
    Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence. (arXiv:2210.06819v2 [cs.LG] UPDATED)
    The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
    On Best-Arm Identification with a Fixed Budget in Non-Parametric Multi-Armed Bandits. (arXiv:2210.00895v2 [cs.LG] UPDATED)
    We lay the foundations of a non-parametric theory of best-arm identification in multi-armed bandits with a fixed budget T. We consider general, possibly non-parametric, models D for distributions over the arms; an overarching example is the model D = P(0,1) of all probability distributions over [0,1]. We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities that correspond to infima over Kullback-Leibler divergences between some distributions in D and a given distribution. This is made possible by a refined analysis of the successive-rejects strategy of Audibert, Bubeck, and Munos (2010). We finally provide lower bounds on the same average log-probability, also in terms of the same new information-theoretic quantities; these lower bounds are larger when the (natural) assumptions on the considered strategies are stronger. All these new upper and lower bounds generalize existing bounds based, e.g., on gaps between distributions.
    $z$-SignFedAvg: A Unified Stochastic Sign-based Compression for Federated Learning. (arXiv:2302.02589v1 [cs.LG])
    Federated Learning (FL) is a promising privacy-preserving distributed learning paradigm but suffers from high communication cost when training large-scale machine learning models. Sign-based methods, such as SignSGD \cite{bernstein2018signsgd}, have been proposed as a biased gradient compression technique for reducing the communication cost. However, sign-based algorithms could diverge under heterogeneous data, which thus motivated the development of advanced techniques, such as the error-feedback method and stochastic sign-based compression, to fix this issue. Nevertheless, these methods still suffer from slower convergence rates. Besides, none of them allows multiple local SGD updates like FedAvg \cite{mcmahan2017communication}. In this paper, we propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression, which not only allows one to flexibly control the tradeoff between gradient bias and convergence performance, but also provides a unified viewpoint to existing stochastic sign-based methods. More importantly, the unified noisy perturbation scheme enables the development of the very first sign-based FedAvg algorithm ($z$-SignFedAvg) to accelerate the convergence. Theoretically, we show that $z$-SignFedAvg achieves a faster convergence rate than existing sign-based methods and, under the uniformly distributed noise, can enjoy the same convergence rate as its uncompressed counterpart. Extensive experiments are conducted to demonstrate that the $z$-SignFedAvg can achieve competitive empirical performance on real datasets and outperforms existing schemes.
    In Search of Insights, Not Magic Bullets: Towards Demystification of the Model Selection Dilemma in Heterogeneous Treatment Effect Estimation. (arXiv:2302.02923v1 [stat.ML])
    Personalized treatment effect estimates are often of interest in high-stakes applications -- thus, before deploying a model estimating such effects in practice, one needs to be sure that the best candidate from the ever-growing machine learning toolbox for this task was chosen. Unfortunately, due to the absence of counterfactual information in practice, it is usually not possible to rely on standard validation metrics for doing so, leading to a well-known model selection dilemma in the treatment effect estimation literature. While some solutions have recently been investigated, systematic understanding of the strengths and weaknesses of different model selection criteria is still lacking. In this paper, instead of attempting to declare a global `winner', we therefore empirically investigate success- and failure modes of different selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the DGP used for testing, and provide interesting insights into the relative (dis)advantages of different criteria alongside desiderata for the design of further illuminating empirical studies in this context.
    The SSL Interplay: Augmentations, Inductive Bias, and Generalization. (arXiv:2302.02774v1 [stat.ML])
    Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in a theory friendly setup, and highlight several insights for SSL practitioners that arise from our theory.
    Causal Shift-Response Functions with Neural Networks: The Health Benefits of Lowering Air Quality Standards in the US. (arXiv:2302.02560v1 [cs.LG])
    Policymakers are required to evaluate the health benefits of reducing the National Ambient Air Quality Standards (NAAQS; i.e., the safety standards) for fine particulate matter PM 2.5 before implementing new policies. We formulate this objective as a shift-response function (SRF) and develop methods to analyze the problem using methods for causal inference, specifically under the stochastic interventions framework. SRFs model the average change in an outcome of interest resulting from a hypothetical shift in the observed exposure distribution. We propose a new broadly applicable doubly-robust method to learn SRFs using targeted regularization with neural networks. We evaluate our proposed method under various benchmarks specific for marginal estimates as a function of continuous exposure. Finally, we implement our estimator in the motivating application that considers the potential reduction in deaths from lowering the NAAQS from the current level of 12 $\mu g/m^3$ to levels that are recently proposed by the Environmental Protection Agency in the US (10, 9, and 8 $\mu g/m^3$).
    Uncertainty Calibration and its Application to Object Detection. (arXiv:2302.02622v1 [cs.CV])
    Image-based environment perception is an important component especially for driver assistance systems or autonomous driving. In this scope, modern neuronal networks are used to identify multiple objects as well as the according position and size information within a single frame. The performance of such an object detection model is important for the overall performance of the whole system. However, a detection model might also predict these objects under a certain degree of uncertainty. [...] In this work, we examine the semantic uncertainty (which object type?) as well as the spatial uncertainty (where is the object and how large is it?). We evaluate if the predicted uncertainties of an object detection model match with the observed error that is achieved on real-world data. In the first part of this work, we introduce the definition for confidence calibration of the semantic uncertainty in the context of object detection, instance segmentation, and semantic segmentation. We integrate additional position information in our examinations to evaluate the effect of the object's position on the semantic calibration properties. Besides measuring calibration, it is also possible to perform a post-hoc recalibration of semantic uncertainty that might have turned out to be miscalibrated. [...] The second part of this work deals with the spatial uncertainty obtained by a probabilistic detection model. [...] We review and extend common calibration methods so that it is possible to obtain parametric uncertainty distributions for the position information in a more flexible way. In the last part, we demonstrate a possible use-case for our derived calibration methods in the context of object tracking. [...] We integrate our previously proposed calibration techniques and demonstrate the usefulness of semantic and spatial uncertainty calibration in a subsequent process. [...]
    Interpolation for Robust Learning: Data Augmentation on Geodesics. (arXiv:2302.02092v1 [cs.LG])
    We propose to study and promote the robustness of a model as per its performance through the interpolation of training data distributions. Specifically, (1) we augment the data by finding the worst-case Wasserstein barycenter on the geodesic connecting subpopulation distributions of different categories. (2) We regularize the model for smoother performance on the continuous geodesic path connecting subpopulation distributions. (3) Additionally, we provide a theoretical guarantee of robustness improvement and investigate how the geodesic location and the sample size contribute, respectively. Experimental validations of the proposed strategy on four datasets, including CIFAR-100 and ImageNet, establish the efficacy of our method, e.g., our method improves the baselines' certifiable robustness on CIFAR10 up to $7.7\%$, with $16.8\%$ on empirical robustness on CIFAR-100. Our work provides a new perspective of model robustness through the lens of Wasserstein geodesic-based interpolation with a practical off-the-shelf strategy that can be combined with existing robust training methods.
    A Log-Linear Non-Parametric Online Changepoint Detection Algorithm based on Functional Pruning. (arXiv:2302.02718v1 [stat.ME])
    Online changepoint detection aims to detect anomalies and changes in real-time in high-frequency data streams, sometimes with limited available computational resources. This is an important task that is rooted in many real-world applications, including and not limited to cybersecurity, medicine and astrophysics. While fast and efficient online algorithms have been recently introduced, these rely on parametric assumptions which are often violated in practical applications. Motivated by data streams from the telecommunications sector, we build a flexible nonparametric approach to detect a change in the distribution of a sequence. Our procedure, NP-FOCuS, builds a sequential likelihood ratio test for a change in a set of points of the empirical cumulative density function of our data. This is achieved by keeping track of the number of observations above or below those points. Thanks to functional pruning ideas, NP-FOCuS has a computational cost that is log-linear in the number of observations and is suitable for high-frequency data streams. In terms of detection power, NP-FOCuS is seen to outperform current nonparametric online changepoint techniques in a variety of settings. We demonstrate the utility of the procedure on both simulated and real data.
    High-dimensional Location Estimation via Norm Concentration for Subgamma Vectors. (arXiv:2302.02497v1 [math.ST])
    In location estimation, we are given $n$ samples from a known distribution $f$ shifted by an unknown translation $\lambda$, and want to estimate $\lambda$ as precisely as possible. Asymptotically, the maximum likelihood estimate achieves the Cram\'er-Rao bound of error $\mathcal N(0, \frac{1}{n\mathcal I})$, where $\mathcal I$ is the Fisher information of $f$. However, the $n$ required for convergence depends on $f$, and may be arbitrarily large. We build on the theory using \emph{smoothed} estimators to bound the error for finite $n$ in terms of $\mathcal I_r$, the Fisher information of the $r$-smoothed distribution. As $n \to \infty$, $r \to 0$ at an explicit rate and this converges to the Cram\'er-Rao bound. We (1) improve the prior work for 1-dimensional $f$ to converge for constant failure probability in addition to high probability, and (2) extend the theory to high-dimensional distributions. In the process, we prove a new bound on the norm of a high-dimensional random variable whose 1-dimensional projections are subgamma, which may be of independent interest.
    Efficient Variational Bayes Learning of Graphical Models with Smooth Structural Changes. (arXiv:2009.07703v3 [stat.ML] UPDATED)
    Estimating time-varying graphical models are of paramount importance in various social, financial, biological, and engineering systems, since the evolution of such networks can be utilized for example to spot trends, detect anomalies, predict vulnerability, and evaluate the impact of interventions. Existing methods require extensive tuning of parameters that control the graph sparsity and temporal smoothness. Furthermore, these methods are computationally burdensome with time complexity $O(NP^3)$ for $P$ variables and $N$ time points. As a remedy, we propose a low-complexity tuning-free Bayesian approach, named BASS. Specifically, we impose temporally-dependent spike-and-slab priors on the graphs such that they are sparse and varying smoothly across time. A variational inference algorithm is then derived to learn the graph structures from the data automatically. Owning to the pseudo-likelihood and the mean-field approximation, the time complexity of BASS is only $O(NP^2)$. Additionally, by identifying the frequency-domain resemblance to the time-varying graphical models, we show that BASS can be extended to learning frequency-varying inverse spectral density matrices, and yields graphical models for multivariate stationary time series. Numerical results on both synthetic and real data show that that BASS can better recover the underlying true graphs, while being more efficient than the existing methods, especially for high-dimensional cases.
    Decision-Aware Conditional GANs for Time Series Data. (arXiv:2009.12682v4 [cs.LG] UPDATED)
    We introduce the decision-aware time-series conditional generative adversarial network (DAT-CGAN) as a method for time-series generation. The framework adopts a multi-Wasserstein loss on structured decision-related quantities, capturing the heterogeneity of decision-related data and providing new effectiveness in supporting the decision processes of end users. We improve sample efficiency through an overlapped block-sampling method, and provide a theoretical characterization of the generalization properties of DAT-CGAN. The framework is demonstrated on financial time series for a multi-time-step portfolio choice problem. We demonstrate better generative quality in regard to underlying data and different decision-related quantities than strong, GAN-based baselines.
    Strong Consistency and Rate of Convergence of Switched Least Squares System Identification for Autonomous Markov Jump Linear Systems. (arXiv:2112.10753v2 [cs.LG] UPDATED)
    In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state observations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-independent rate of convergence shows that, almost surely, the system identification error is $\mathcal{O}\big(\sqrt{\log(T)/T} \big)$ where $T$ is the time horizon. These results show that switched least squares method for MJS has the same rate of convergence as least squares method for autonomous linear systems. We derive our results by imposing a general stability assumption on the model called stability in the average sense. We show that stability in the average sense is a weaker form of stability compared to the stability assumptions commonly imposed in the literature. We present numerical examples to illustrate the performance of the proposed method.
    Pre-screening breast cancer with machine learning and deep learning. (arXiv:2302.02406v1 [stat.ML])
    We suggest that deep learning can be used for pre-screening cancer by analyzing demographic and anthropometric information of patients, as well as biological markers obtained from routine blood samples and relative risks obtained from meta-analysis and international databases. We applied feature selection algorithms to a database of 116 women, including 52 healthy women and 64 women diagnosed with breast cancer, to identify the best pre-screening predictors of cancer. We utilized the best predictors to perform k-fold Monte Carlo cross-validation experiments that compare deep learning against traditional machine learning algorithms. Our results indicate that a deep learning model with an input-layer architecture that is fine-tuned using feature selection can effectively distinguish between patients with and without cancer. Additionally, compared to machine learning, deep learning has the lowest uncertainty in its predictions. These findings suggest that deep learning algorithms applied to cancer pre-screening offer a radiation-free, non-invasive, and affordable complement to screening methods based on imagery. The implementation of deep learning algorithms in cancer pre-screening offer opportunities to identify individuals who may require imaging-based screening, can encourage self-examination, and decrease the psychological externalities associated with false positives in cancer screening. The integration of deep learning algorithms for both screening and pre-screening will ultimately lead to earlier detection of malignancy, reducing the healthcare and societal burden associated to cancer treatment.
    Parallelizing Contextual Bandits. (arXiv:2105.10590v2 [stat.ML] UPDATED)
    Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, \textit{simultaneously} proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual bandit algorithms applicable to problems with bounded eluder dimension whose regret is nearly identical to their perfectly sequential counterparts -- given access to the same total number of oracle queries -- up to a lower-order ``burn-in" term. We further show these algorithms can be specialized to the class of linear reward functions where we introduce and analyze several new linear bandit algorithms which explicitly introduce diversity into their action selection. Finally, we also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems, to demonstrate the utility of parallelized bandits in practical settings.
    Learning Variational Models with Unrolling and Bilevel Optimization. (arXiv:2209.12651v3 [stat.ML] UPDATED)
    In this paper we consider the problem learning of variational models in the context of supervised learning via risk minimization. Our goal is to provide a deeper understanding of the two approaches of learning of variational models via bilevel optimization and via algorithm unrolling. The former considers the variational model as a lower level optimization problem below the risk minimization problem, while the latter replaces the lower level optimization problem by an algorithm that solves said problem approximately. Both approaches are used in practice, but, unrolling is much simpler from a computational point of view. To analyze and compare the two approaches, we consider a simple toy model, and compute all risks and the respective estimators explicitly. We show that unrolling can be better than the bilevel optimization approach, but also that the performance of unrolling can depend significantly on further parameters, sometimes in unexpected ways: While the stepsize of the unrolled algorithm matters a lot, the number of unrolled iterations only matters if the number is even or odd, and these two cases are notably different.
    Reinforcement Learning in Low-Rank MDPs with Density Features. (arXiv:2302.02252v1 [cs.LG])
    MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density features, i.e., the right matrix, which induce powerful models for state-occupancy distributions. This setting not only sheds light on leveraging unsupervised learning in RL, but also enables plug-in solutions for convex RL. In the offline setting, we propose an algorithm for off-policy estimation of occupancies that can handle non-exploratory data. Using this as a subroutine, we further devise an online algorithm that constructs exploratory data distributions in a level-by-level manner. As a central technical challenge, the additive error of occupancy estimation is incompatible with the multiplicative definition of data coverage. In the absence of strong assumptions like reachability, this incompatibility easily leads to exponential error blow-up, which we overcome via novel technical tools. Our results also readily extend to the representation learning setting, when the density features are unknown and must be learned from an exponentially large candidate set.
    Sequential change detection via backward confidence sequences. (arXiv:2302.02544v1 [math.ST])
    We present a simple reduction from sequential estimation to sequential changepoint detection (SCD). In short, suppose we are interested in detecting changepoints in some parameter or functional $\theta$ of the underlying distribution. We demonstrate that if we can construct a confidence sequence (CS) for $\theta$, then we can also successfully perform SCD for $\theta$. This is accomplished by checking if two CSs -- one forwards and the other backwards -- ever fail to intersect. Since the literature on CSs has been rapidly evolving recently, the reduction provided in this paper immediately solves several old and new change detection problems. Further, our "backward CS", constructed by reversing time, is new and potentially of independent interest. We provide strong nonasymptotic guarantees on the frequency of false alarms and detection delay, and demonstrate numerical effectiveness on several problems.
    A Characteristic Function for Shapley-Value-Based\\Attribution of Anomaly Scores. (arXiv:2004.04464v2 [cs.LG] UPDATED)
    In anomaly detection, the degree of irregularity is often summarized as a real-valued anomaly score. We address the problem of attributing such anomaly scores to input features for interpreting the results of anomaly detection. We particularly investigate the use of the Shapley value for attributing anomaly scores of semi-supervised detection methods. We propose a characteristic function specifically designed for attributing anomaly scores. The idea is to approximate the absence of some features by locally minimizing the anomaly score with regard to the to-be-absent features. We examine the applicability of the proposed characteristic function and other general approaches for interpreting anomaly scores on multiple datasets and multiple anomaly detection methods. The results indicate the potential utility of the attribution methods including the proposed one.
    On Private and Robust Bandits. (arXiv:2302.02526v1 [cs.LG])
    We study private and robust multi-armed bandits (MABs), where the agent receives Huber's contaminated heavy-tailed rewards and meanwhile needs to ensure differential privacy. We first present its minimax lower bound, characterizing the information-theoretic limit of regret with respect to privacy budget, contamination level and heavy-tailedness. Then, we propose a meta-algorithm that builds on a private and robust mean estimation sub-routine \texttt{PRM} that essentially relies on reward truncation and the Laplace mechanism only. For two different heavy-tailed settings, we give specific schemes of \texttt{PRM}, which enable us to achieve nearly-optimal regret. As by-products of our main results, we also give the first minimax lower bound for private heavy-tailed MABs (i.e., without contamination). Moreover, our two proposed truncation-based \texttt{PRM} achieve the optimal trade-off between estimation accuracy, privacy and robustness. Finally, we support our theoretical results with experimental studies.
    Random Forests for time-fixed and time-dependent predictors: The DynForest R package. (arXiv:2302.02670v1 [stat.ML])
    The R package DynForest implements random forests for predicting a categorical or a (multiple causes) time-to-event outcome based on time-fixed and time-dependent predictors. Through the random forests, the time-dependent predictors can be measured with error at subject-specific times, and they can be endogeneous (i.e., impacted by the outcome process). They are modeled internally using flexible linear mixed models (thanks to lcmm package) with time-associations pre-specified by the user. DynForest computes dynamic predictions that take into account all the information from time-fixed and time-dependent predictors. DynForest also provides information about the most predictive variables using variable importance and minimal depth. Variable importance can also be computed on groups of variables. To display the results, several functions are available such as summary and plot functions. This paper aims to guide the user with a step-by-step example of the different functions for fitting random forests within DynForest.
    Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v5 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,\delta)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to investigate individual privacy across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2% higher than that of the class with the highest accuracy.
    RLSbench: Domain Adaptation Under Relaxed Label Shift. (arXiv:2302.03020v1 [cs.LG])
    Despite the emergence of principled methods for domain adaptation under label shift, the sensitivity of these methods for minor shifts in the class conditional distributions remains precariously under explored. Meanwhile, popular deep domain adaptation heuristics tend to falter when faced with shifts in label proportions. While several papers attempt to adapt these heuristics to accommodate shifts in label proportions, inconsistencies in evaluation criteria, datasets, and baselines, make it hard to assess the state of the art. In this paper, we introduce RLSbench, a large-scale relaxed label shift benchmark, consisting of >500 distribution shift pairs that draw on 14 datasets across vision, tabular, and language modalities and compose them with varying label proportions. First, we evaluate 13 popular domain adaptation methods, demonstrating more widespread failures under label proportion shifts than were previously known. Next, we develop an effective two-step meta-algorithm that is compatible with most deep domain adaptation heuristics: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. The meta-algorithm improves existing domain adaptation heuristics often by 2--10\% accuracy points under extreme label proportion shifts and has little (i.e., <0.5\%) effect when label proportions do not shift. We hope that these findings and the availability of RLSbench will encourage researchers to rigorously evaluate proposed methods in relaxed label shift settings. Code is publicly available at https://github.com/acmi-lab/RLSbench.
    Noise-cleaning the precision matrix of fMRI time series. (arXiv:2302.02951v1 [cond-mat.stat-mech])
    We present a comparison between various algorithms of inference of covariance and precision matrices in small datasets of real vectors, of the typical length and dimension of human brain activity time series retrieved by functional Magnetic Resonance Imaging (fMRI). Assuming a Gaussian model underlying the neural activity, the problem consists in denoising the empirically observed matrices in order to obtain a better estimator of the true precision and covariance matrices. We consider several standard noise-cleaning algorithms and compare them on two types of datasets. The first type are time series of fMRI brain activity of human subjects at rest. The second type are synthetic time series sampled from a generative Gaussian model of which we can vary the fraction of dimensions per sample q = N/T and the strength of off-diagonal correlations. The reliability of each algorithm is assessed in terms of test-set likelihood and, in the case of synthetic data, of the distance from the true precision matrix. We observe that the so called Optimal Rotationally Invariant Estimator, based on Random Matrix Theory, leads to a significantly lower distance from the true precision matrix in synthetic data, and higher test likelihood in natural fMRI data. We propose a variant of the Optimal Rotationally Invariant Estimator in which one of its parameters is optimised by cross-validation. In the severe undersampling regime (large q) typical of fMRI series, it outperforms all the other estimators. We furthermore propose a simple algorithm based on an iterative likelihood gradient ascent, providing an accurate estimation for weakly correlated datasets.
    Decentralized Differentially Private Without-Replacement Stochastic Gradient Descent. (arXiv:1809.02727v4 [cs.LG] UPDATED)
    While machine learning has achieved remarkable results in a wide variety of domains, the training of models often requires large datasets that may need to be collected from different individuals. As sensitive information may be contained in the individual's dataset, sharing training data may lead to severe privacy concerns. Therefore, there is a compelling need to develop privacy-aware machine learning methods, for which one effective approach is to leverage the generic framework of differential privacy. Considering that stochastic gradient descent (SGD) is one of the most commonly adopted methods for large-scale machine learning problems, a decentralized differentially private SGD algorithm is proposed in this work. Particularly, we focus on SGD without replacement due to its favorable structure for practical implementation. Both privacy and convergence analysis are provided for the proposed algorithm. Finally, extensive experiments are performed to demonstrate the effectiveness of the proposed method.
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v4 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    The Dual PC Algorithm and the Role of Gaussianity for Structure Learning of Bayesian Networks. (arXiv:2112.09036v4 [stat.ML] UPDATED)
    Learning the graphical structure of Bayesian networks is key to describing data-generating mechanisms in many complex applications but poses considerable computational challenges. Observational data can only identify the equivalence class of the directed acyclic graph underlying a Bayesian network model, and a variety of methods exist to tackle the problem. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by reverse-engineering the conditional independence (CI) relationships holding in the variable distribution. The dual PC algorithm is a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. By exploiting block matrix inversions we can simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies show that the dual PC algorithm outperforms the classic PC algorithm both in terms of run time and in recovering the underlying network structure, even in the presence of deviations from Gaussianity. Additionally, we show that the dual PC algorithm applies for Gaussian copula models, and demonstrate its performance in that setting.
    Side Effects of Learning from Low-dimensional Data Embedded in a Euclidean Space. (arXiv:2203.00614v5 [cs.LG] UPDATED)
    The low-dimensional manifold hypothesis posits that the data found in many applications, such as those involving natural images, lie (approximately) on low-dimensional manifolds embedded in a high-dimensional Euclidean space. In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input. However, one often needs to consider evaluating the optimized network at points outside the training distribution. This paper considers the case in which the training data is distributed in a linear subspace of $\mathbb R^d$. We derive estimates on the variation of the learning function, defined by a neural network, in the direction transversal to the subspace. We study the potential regularization effects associated with the network's depth and noise in the codimension of the data manifold. We also present additional side effects in training due to the presence of noise.
    Probabilistic Contrastive Learning Recovers the Correct Aleatoric Uncertainty of Ambiguous Inputs. (arXiv:2302.02865v1 [cs.LG])
    Contrastively trained encoders have recently been proven to invert the data-generating process: they encode each input, e.g., an image, into the true latent vector that generated the image (Zimmermann et al., 2021). However, real-world observations often have inherent ambiguities. For instance, images may be blurred or only show a 2D view of a 3D object, so multiple latents could have generated them. This makes the true posterior for the latent vector probabilistic with heteroscedastic uncertainty. In this setup, we extend the common InfoNCE objective and encoders to predict latent distributions instead of points. We prove that these distributions recover the correct posteriors of the data-generating process, including its level of aleatoric uncertainty, up to a rotation of the latent space. In addition to providing calibrated uncertainty estimates, these posteriors allow the computation of credible intervals in image retrieval. They comprise images with the same latent as a given query, subject to its uncertainty.
    The Missing Indicator Method: From Low to High Dimensions. (arXiv:2211.09259v2 [cs.LG] UPDATED)
    Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records.
    Deep Latent State Space Models for Time-Series Generation. (arXiv:2212.12749v3 [stat.ML] UPDATED)
    Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being x100 faster than other baselines on long sequences.
    Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization. (arXiv:2102.11050v4 [cs.LG] UPDATED)
    Motivated by online decision-making in time-varying combinatorial environments, we study the problem of transforming offline algorithms to their online counterparts. We focus on offline combinatorial problems that are amenable to a constant factor approximation using a greedy algorithm that is robust to local errors. For such problems, we provide a general framework that efficiently transforms offline robust greedy algorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have $O(\sqrt{T})$ (approximate) regret under the full information setting. We further introduce a bandit extension of Blackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transform greedy robust offline algorithms into a $O(T^{2/3})$ (approximate) regret in the bandit setting. Demonstrating the flexibility of our framework, we apply our offline-to-online transformation to several problems at the intersection of revenue management, market design, and online optimization, including product ranking optimization in online platforms, reserve price optimization in auctions, and submodular maximization. We also extend our reduction to greedy-like first order methods used in continuous optimization, such as those used for maximizing continuous strong DR monotone submodular functions subject to convex constraints. We show that our transformation, when applied to these applications, leads to new regret bounds or improves the current known bounds. We complement our theoretical studies by conducting numerical simulations for two of our applications, in both of which we observe that the numerical performance of our transformations outperforms the theoretical guarantees in practical instances.
    First steps towards quantum machine learning applied to the classification of event-related potentials. (arXiv:2302.02648v1 [cs.HC])
    Low information transfer rate is a major bottleneck for brain-computer interfaces based on non-invasive electroencephalography (EEG) for clinical applications. This led to the development of more robust and accurate classifiers. In this study, we investigate the performance of quantum-enhanced support vector classifier (QSVC). Training (predicting) balanced accuracy of QSVC was 83.17 (50.25) %. This result shows that the classifier was able to learn from EEG data, but that more research is required to obtain higher predicting accuracy. This could be achieved by a better configuration of the classifier, such as increasing the number of shots.
    Adversarial Bandits with Knapsacks. (arXiv:1811.11881v10 [cs.DS] UPDATED)
    We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the "classic" adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to the algorithm's reward. We design an algorithm with competitive ratio O(log T) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version and use it as a subroutine to solve the latter.
    Sparse GCA and Thresholded Gradient Descent. (arXiv:2107.00371v2 [stat.ML] UPDATED)
    Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple datasets. It generalizes canonical correlation analysis that is designed for two datasets. We study sparse GCA when there are potentially multiple generalized correlation tuples in data and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as generalized eigenvalue problems at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic datasets.
    Support Vector Regression: Risk Quadrangle Framework. (arXiv:2212.09178v3 [stat.ML] UPDATED)
    This paper investigates Support Vector Regression (SVR) in the context of the fundamental risk quadrangle paradigm. It is shown that both formulations of SVR, $\varepsilon$-SVR and $\nu$-SVR, correspond to the minimization of equivalent regular error measures (Vapnik error and superquantile (CVaR) norm, respectively) with a regularization penalty. These error measures, in turn, give rise to corresponding risk quadrangles. By constructing the fundamental risk quadrangle, which corresponds to SVR, we show that SVR is the asymptotically unbiased estimator of the average of two symmetric conditional quantiles. Furthermore, the technique used for the construction of quadrangles serves as a powerful tool in proving the equivalence between $\varepsilon$-SVR and $\nu$-SVR. Additionally, SVR is formulated as a regular deviation minimization problem with a regularization penalty by invoking Error Shaping Decomposition of Regression and the dual formulation of SVR in the risk quadrangle framework is derived.
    Random Forest Weighted Local Fr\'echet Regression. (arXiv:2202.04912v2 [stat.ML] UPDATED)
    Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and M\"uller (2019) established a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method utilizes these weights as the local average to solve the conditional Fr\'echet mean, while the second method performs local linear Fr\'echet regression, both significantly improving existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order Mmn -estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to human mortality distribution data.
    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v4 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.
    A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation. (arXiv:2211.14296v2 [cs.LG] UPDATED)
    The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization.
    SE(3) diffusion model with application to protein backbone generation. (arXiv:2302.02277v1 [cs.LG])
    The design of novel protein structures remains a challenge in protein engineering for applications across biomedicine and chemistry. In this line of work, a diffusion model over rigid bodies in 3D (referred to as frames) has shown success in generating novel, functional protein backbones that have not been observed in nature. However, there exists no principled methodological framework for diffusion on SE(3), the space of orientation preserving rigid motions in R3, that operates on frames and confers the group invariance. We address these shortcomings by developing theoretical foundations of SE(3) invariant diffusion models on multiple frames followed by a novel framework, FrameDiff, for learning the SE(3) equivariant score over multiple frames. We apply FrameDiff on monomer backbone generation and find it can generate designable monomers up to 500 amino acids without relying on a pretrained protein structure prediction network that has been integral to previous methods. We find our samples are capable of generalizing beyond any known protein structure.
    Direct Uncertainty Quantification. (arXiv:2302.02420v1 [cs.LG])
    Traditional neural networks are simple to train but they produce overconfident predictions, while Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming. This paper introduces a new approach, direct uncertainty quantification (DirectUQ), that combines their advantages where the neural network directly models uncertainty in output space, and captures both aleatoric and epistemic uncertainty. DirectUQ can be derived as an alternative variational lower bound, and hence benefits from collapsed variational inference that provides improved regularizers. On the other hand, like non-probabilistic models, DirectUQ enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. Experiments show that DirectUQ and ensembles of DirectUQ provide a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.
    Multi-Center Federated Learning: Clients Clustering for Better Personalization. (arXiv:2005.01026v3 [cs.LG] UPDATED)
    Federated learning has received great attention for its capability to train a large-scale model in a decentralized manner without needing to access user data directly. It helps protect the users' private data from centralized collecting. Unlike distributed machine learning, federated learning aims to tackle non-IID data from heterogeneous sources in various real-world applications, such as those on smartphones. Existing federated learning approaches usually adopt a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. However, due to the diverse nature of user behaviors, assigning users' gradients to different global models (i.e., centers) can better capture the heterogeneity of data distributions across users. Our paper proposes a novel multi-center aggregation mechanism for federated learning, which learns multiple global models from the non-IID user data and simultaneously derives the optimal matching between users and centers. We formulate the problem as a joint optimization that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Our experimental results on benchmark datasets show that our method outperforms several popular federated learning methods.
    A Permutation-free Kernel Two-Sample Test. (arXiv:2211.14908v2 [stat.ME] UPDATED)
    The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance metric between distributions that has found utility in two-sample testing. The usual kernel-MMD test statistic is a degenerate U-statistic under the null, and thus it has an intractable limiting distribution. Hence, to design a level-$\alpha$ test, one usually selects the rejection threshold as the $(1-\alpha)$-quantile of the permutation distribution. The resulting nonparametric test has finite-sample validity but suffers from large computational cost, since every permutation takes quadratic time. We propose the cross-MMD, a new quadratic-time MMD test statistic based on sample-splitting and studentization. We prove that under mild assumptions, the cross-MMD has a limiting standard Gaussian distribution under the null. Importantly, we also show that the resulting test is consistent against any fixed alternative, and when using the Gaussian kernel, it has minimax rate-optimal power against local alternatives. For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.  ( 2 min )
    Improving Fair Training under Correlation Shifts. (arXiv:2302.02323v1 [cs.LG])
    Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make two contributions for solving this problem. First, we analytically show that existing in-processing fair algorithms have fundamental limits in accuracy and group fairness. We introduce the notion of correlation shifts, which can explicitly capture the change of the above bias. Second, we propose a novel pre-processing step that samples the input data to reduce correlation shifts and thus enables the in-processing approaches to overcome their limitations. We formulate an optimization problem for adjusting the data ratio among labels and sensitive groups to reflect the shifted correlation. A key benefit of our approach lies in decoupling the roles of pre- and in-processing approaches: correlation adjustment via pre-processing and unfairness mitigation on the processed data via in-processing. Experiments show that our framework effectively improves existing in-processing fair algorithms w.r.t. accuracy and fairness, both on synthetic and real datasets.  ( 2 min )
    Bayesian Fixed-Budget Best-Arm Identification. (arXiv:2211.08572v2 [cs.LG] UPDATED)
    Fixed-budget best-arm identification (BAI) is a bandit problem where the agent maximizes the probability of identifying the optimal arm within a fixed budget of observations. In this work, we study this problem in the Bayesian setting. We propose a Bayesian elimination algorithm and derive an upper bound on its probability of misidentifying the optimal arm. The bound reflects the quality of the prior and is the first distribution-dependent bound in this setting. We prove it using a frequentist-like argument, where we carry the prior through, and then integrate out the bandit instance at the end. We also provide the first lower bound on the probability of misidentification in a $2$-armed Bayesian bandit and show that our upper bound (almost) matches the lower bound. Our experiments show that Bayesian elimination is superior to frequentist methods and competitive with the state-of-the-art Bayesian algorithms that have no guarantees in our setting.  ( 2 min )
    U-Clip: On-Average Unbiased Stochastic Gradient Clipping. (arXiv:2302.02971v1 [cs.LG])
    U-Clip is a simple amendment to gradient clipping that can be applied to any iterative gradient optimization algorithm. Like regular clipping, U-Clip involves using gradients that are clipped to a prescribed size (e.g. with component wise or norm based clipping) but instead of discarding the clipped portion of the gradient, U-Clip maintains a buffer of these values that is added to the gradients on the next iteration (before clipping). We show that the cumulative bias of the U-Clip updates is bounded by a constant. This implies that the clipped updates are unbiased on average. Convergence follows via a lemma that guarantees convergence with updates $u_i$ as long as $\sum_{i=1}^t (u_i - g_i) = o(t)$ where $g_i$ are the gradients. Extensive experimental exploration is performed on CIFAR10 with further validation given on ImageNet.  ( 2 min )
    Adapting to Continuous Covariate Shift via Online Density Ratio Estimation. (arXiv:2302.02552v1 [cs.LG])
    Dealing with distribution shifts is one of the central challenges for modern machine learning. One fundamental situation is the \emph{covariate shift}, where the input distributions of data change from training to testing stages while the input-conditional output distribution remains unchanged. In this paper, we initiate the study of a more challenging scenario -- \emph{continuous} covariate shift -- in which the test data appear sequentially, and their distributions can shift continuously. Our goal is to adaptively train the predictor such that its prediction risk accumulated over time can be minimized. Starting with the importance-weighted learning, we show the method works effectively if the time-varying density ratios of test and train inputs can be accurately estimated. However, existing density ratio estimation methods would fail due to data scarcity at each time step. To this end, we propose an online method that can appropriately reuse historical information. Our density ratio estimation method is proven to perform well by enjoying a dynamic regret bound, which finally leads to an excess risk guarantee for the predictor. Empirical results also validate the effectiveness.  ( 2 min )
    Reinforcement Learning with History-Dependent Dynamic Contexts. (arXiv:2302.02061v1 [cs.LG])
    We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.  ( 2 min )
    ODEWS: The Overdraft Early Warning System. (arXiv:2302.02455v1 [stat.ML])
    When a customer overdraws their account and their balance is negative they are assessed an overdraft fee. Americans pay approximately \$15 billion in unnecessary overdraft fees a year, often in \$35 increments; users of the Mint personal finance app pay approximately \$250 million in fees a year in particular. These overdraft fees are an excessive financial burden and lead to cascading overdraft fees trapping customers in financial hardship. To address this problem, we have created an ML-driven overdraft early warning system (ODEWS) that assesses a customer's risk of overdrafting within the next week using their banking and transaction data in the Mint app. At-risk customers are sent an alert so they can take steps to avoid the fee, ultimately changing their behavior and financial habits. The system deployed resulted in a \$3 million savings in overdraft fees for Mint customers compared to a control group. Moreover, the methodology outlined here can be generalized to provide ML-driven personalized financial advice for many different personal finance goals--increase credit score, build emergency savings fund, pay down debut, allocate capital for investment.  ( 2 min )
    Hierarchical Sliced Wasserstein Distance. (arXiv:2209.13570v5 [stat.ML] UPDATED)
    Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNet.  ( 2 min )
    TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Data. (arXiv:2302.02224v1 [cs.LG])
    This work investigates the intersection of cross modal learning and semi supervised learning, where we aim to improve the supervised learning performance of the primary modality by borrowing missing information from an unlabeled modality. We investigate this problem from a Nadaraya Watson (NW) kernel regression perspective and show that this formulation implicitly leads to a kernelized cross attention module. To this end, we propose The Attention Patch (TAP), a simple neural network plugin that allows data level knowledge transfer from the unlabeled modality. We provide numerical simulations on three real world datasets to examine each aspect of TAP and show that a TAP integration in a neural network can improve generalization performance using the unlabeled modality.  ( 2 min )
    Refined Value-Based Offline RL under Realizability and Partial Coverage. (arXiv:2302.02392v1 [cs.LG])
    In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.  ( 2 min )
    Tighter Information-Theoretic Generalization Bounds from Supersamples. (arXiv:2302.02432v1 [stat.ML])
    We present a variety of novel information-theoretic generalization bounds for learning algorithms, from the supersample setting of Steinke & Zakynthinou (2020)-the setting of the "conditional mutual information" framework. Our development exploits projecting the loss pair (obtained from a training instance and a testing instance) down to a single number and correlating loss values with a Rademacher sequence (and its shifted variants). The presented bounds include square-root bounds, fast-rate bounds, including those based on variance and sharpness, and bounds for interpolating algorithms etc. We show theoretically or empirically that these bounds are tighter than all information-theoretic bounds known to date on the same supersample setting.  ( 2 min )
    Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs. (arXiv:2106.01128v2 [cs.LG] UPDATED)
    The ability to align points across two related yet incomparable point clouds (e.g. living in different spaces) plays an important role in machine learning. The Gromov-Wasserstein (GW) framework provides an increasingly popular answer to such problems, by seeking a low-distortion, geometry-preserving assignment between these points. As a non-convex, quadratic generalization of optimal transport (OT), GW is NP-hard. While practitioners often resort to solving GW approximately as a nested sequence of entropy-regularized OT problems, the cubic complexity (in the number $n$ of samples) of that approach is a roadblock. We show in this work how a recent variant of the OT problem that restricts the set of admissible couplings to those having a low-rank factorization is remarkably well suited to the resolution of GW: when applied to GW, we show that this approach is not only able to compute a stationary point of the GW problem in time $O(n^2)$, but also uniquely positioned to benefit from the knowledge that the initial cost matrices are low-rank, to yield a linear time $O(n)$ GW approximation. Our approach yields similar results, yet orders of magnitude faster computation than the SoTA entropic GW approaches, on both simulated and real data.  ( 2 min )
    On Over-Squashing in Message Passing Neural Networks: The Impact of Width, Depth, and Topology. (arXiv:2302.02941v1 [cs.LG])
    Message Passing Neural Networks (MPNNs) are instances of Graph Neural Networks that leverage the graph to send messages over the edges. This inductive bias leads to a phenomenon known as over-squashing, where a node feature is insensitive to information contained at distant nodes. Despite recent methods introduced to mitigate this issue, an understanding of the causes for over-squashing and of possible solutions are lacking. In this theoretical work, we prove that: (i) Neural network width can mitigate over-squashing, but at the cost of making the whole network more sensitive; (ii) Conversely, depth cannot help mitigate over-squashing: increasing the number of layers leads to over-squashing being dominated by vanishing gradients; (iii) The graph topology plays the greatest role, since over-squashing occurs between nodes at high commute (access) time. Our analysis provides a unified framework to study different recent methods introduced to cope with over-squashing and serves as a justification for a class of methods that fall under `graph rewiring'.  ( 2 min )
    Guaranteed Tensor Recovery Fused Low-rankness and Smoothness. (arXiv:2302.02155v1 [cs.LG])
    The tensor data recovery task has thus attracted much research attention in recent years. Solving such an ill-posed problem generally requires to explore intrinsic prior structures underlying tensor data, and formulate them as certain forms of regularization terms for guiding a sound estimate of the restored tensor. Recent research have made significant progress by adopting two insightful tensor priors, i.e., global low-rankness (L) and local smoothness (S) across different tensor modes, which are always encoded as a sum of two separate regularization terms into the recovery models. However, unlike the primary theoretical developments on low-rank tensor recovery, these joint L+S models have no theoretical exact-recovery guarantees yet, making the methods lack reliability in real practice. To this crucial issue, in this work, we build a unique regularization term, which essentially encodes both L and S priors of a tensor simultaneously. Especially, by equipping this single regularizer into the recovery models, we can rigorously prove the exact recovery guarantees for two typical tensor recovery tasks, i.e., tensor completion (TC) and tensor robust principal component analysis (TRPCA). To the best of our knowledge, this should be the first exact-recovery results among all related L+S methods for tensor recovery. Significant recovery accuracy improvements over many other SOTA methods in several TC and TRPCA tasks with various kinds of visual tensor data are observed in extensive experiments. Typically, our method achieves a workable performance when the missing rate is extremely large, e.g., 99.5%, for the color image inpainting task, while all its peers totally fail in such challenging case.  ( 2 min )
    An Asymptotically Optimal Algorithm for the One-Dimensional Convex Hull Feasibility Problem. (arXiv:2302.02033v1 [stat.ML])
    This work studies the pure-exploration setting for the convex hull feasibility (CHF) problem where one aims to efficiently and accurately determine if a given point lies in the convex hull of means of a finite set of distributions. We give a complete characterization of the sample complexity of the CHF problem in the one-dimensional setting. We present the first asymptotically optimal algorithm called Thompson-CHF, whose modular design consists of a stopping rule and a sampling rule. In addition, we provide an extension of the algorithm that generalizes several important problems in the multi-armed bandit literature. Finally, we further investigate the Gaussian bandit case with unknown variances and address how the Thompson-CHF algorithm can be adjusted to be asymptotically optimal in this setting.  ( 2 min )
    Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation. (arXiv:2302.02570v1 [cs.AI])
    We consider the task of evaluating policies of algorithmic resource allocation through randomized controlled trials (RCTs). Such policies are tasked with optimizing the utilization of limited intervention resources, with the goal of maximizing the benefits derived. Evaluation of such allocation policies through RCTs proves difficult, notwithstanding the scale of the trial, because the individuals' outcomes are inextricably interlinked through resource constraints controlling the policy decisions. Our key contribution is to present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We identify conditions under which such reassignments are permissible and can be leveraged to construct counterfactual trials, whose outcomes can be accurately ascertained, for free. We prove theoretically that such an estimator is more accurate than common estimators based on sample means -- we show that it returns an unbiased estimate and simultaneously reduces variance. We demonstrate the value of our approach through empirical experiments on synthetic, semi-synthetic as well as real case study data and show improved estimation accuracy across the board.  ( 2 min )
    Polynomial-time sparse measure recovery. (arXiv:2204.07879v3 [cs.LG] UPDATED)
    Many problems in computer science reduce to the recovery of an $n$-sparse measure from its (generalized) moments. Sparse measure recovery has been the research focus in super-resolution, tensor decomposition, and learning neural networks. The existing methods use either convex relaxations or overparameterization for recovery. Here, we propose recovery with non-convex optimization without overparameterization. Our algorithm is a (sub)gradient descent method optimizing a non-convex energy function studied in physics. We establish the global convergence of gradient descent on the energy function. This result enables us to solve super-resolution in $O(n^2)$ time, which significantly improves upon $O(n^3)$ time for solving convex relaxations. For a particular neural network, we prove the global convergence of subgradient descent on the population loss without overparameterization. The studied network has zero-one activations, and inputs drawn from the unit sphere.  ( 2 min )
    Offline Learning in Markov Games with General Function Approximation. (arXiv:2302.02571v1 [cs.LG])
    We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium -- such as Nash equilibrium and (Coarse) Correlated Equilibrium -- from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation, handling all 3 equilibria in a unified manner. By using Bellman-consistent pessimism, we obtain interval estimation for policies' returns, and use both the upper and the lower bounds to obtain a relaxation on the gap of a candidate policy, which becomes our optimization objective. Our results generalize prior works and provide several additional insights. Importantly, we require a data coverage condition that improves over the recently proposed "unilateral concentrability". Our condition allows selective coverage of deviation policies that optimally trade-off between their greediness (as approximate best responses) and coverage, and we show scenarios where this leads to significantly better guarantees. As a new connection, we also show how our algorithmic framework can subsume seemingly different solution concepts designed for the special case of two-player zero-sum games.  ( 2 min )
    On a continuous time model of gradient descent dynamics and instability in deep learning. (arXiv:2302.01952v1 [stat.ML])
    The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.  ( 2 min )
    Domain Adaptation via Rebalanced Sub-domain Alignment. (arXiv:2302.02009v1 [cs.LG])
    Unsupervised domain adaptation (UDA) is a technique used to transfer knowledge from a labeled source domain to a different but related unlabeled target domain. While many UDA methods have shown success in the past, they often assume that the source and target domains must have identical class label distributions, which can limit their effectiveness in real-world scenarios. To address this limitation, we propose a novel generalization bound that reweights source classification error by aligning source and target sub-domains. We prove that our proposed generalization bound is at least as strong as existing bounds under realistic assumptions, and we empirically show that it is much stronger on real-world data. We then propose an algorithm to minimize this novel generalization bound. We demonstrate by numerical experiments that this approach improves performance in shifted class distribution scenarios compared to state-of-the-art methods.  ( 2 min )

  • Open

    [P] AI/ML Engineering Tutor
    I am looking for a AI/Machine Learning Engineering tutor. I am specifically interested in help developing and building ~2 Medium/large NLP projects, using SOTA LLM’s (GPT3, etc.). After these, I would potentially be interested in diving into RL. I am looking for a little bit of theory when needed, but predominantly hands-on coding project help (eg. real-time coding help if I get stuck). Reach out if you are interested! (Of course $ negotiated). submitted by /u/ThoseWhoAbandonViews [link] [comments]  ( 42 min )
    Is there Powerpoint Ai maker that works ? [D]
    It seems the others like tome ai, and chatbcg are completely trafficked out. Also you cant even save the powerpoints, so please is there one that i can use. I also have my own text that id like to turn into a powerpoint submitted by /u/Ancient_Vegetable_62 [link] [comments]  ( 43 min )
    [Discussion] Can an AMD Ryzen 5 3400G computer with 16GB of RAM effectively train an AI model?
    I'm exploring the possibility of using my AMD Ryzen 5 3400G computer with 16GB of RAM to train an AI model. I'm curious to know if this setup is adequate for the task and, if so, what kind of AI models would be appropriate. I'm interested in understanding any limitations and drawbacks that I may face with this setup. If you have any relevant experience or information, I would greatly appreciate your participation in this discussion. Thanks! <3 submitted by /u/erikaonline [link] [comments]  ( 43 min )
    [D] Image object detection, but for 1 dimensional data?
    I have had a lot of fun and success using YOLO and other image object detection models on 2D or 3D image data for personal projects. I am now working on some projects where I need to scan long periods of timeseries data and find specific waveforms that are variable durations. Are there techniques or models that function like YOLO that can scan large amounts of data and only highlight specific segments of interest as specific classes? If it doesn’t exist, I wonder how well the underlying CNN architecture of YOLO would translate to 1 dimensional CNN architectures. Any info is appreciated, thanks! submitted by /u/Optoplasm [link] [comments]  ( 43 min )
    [P] Best way to add a sampling step within a neural network end-to-end?
    I'm looking to combine two separate models together end-to-end, but need help understanding the best way to connect discrete parts. The first part: I trained a classifier that given an input vector (512 dimensional) is able to predict one of twenty possible labels. The second part: given an input label (from the previous classifier), embed the label and use that label to make a prediction. Both models work decently, but I'm wondering if I can make this end-to-end and get some serious gains. To do this, I'd need a way of sampling from the first softmax. Once I have a sample, I can get the embedding of the sampled class, continue as normal, and hopefully propagate the loss through everything. Are there any similar examples I can look at? Is there a term for this in the literature? submitted by /u/geomtry [link] [comments]  ( 44 min )
    [P] We've built ChatGPT for your pdf files.
    My friends and I were really excited to try out ChatGPT when it was released. We were amazed by its power and capabilities. We were thinking how great it would be if people could use ChatGPT to ask questions from their own data. So me and my friends had decided to build such a tool. We are delighted to announce that we are releasing a public beta of our tool> After you upload all your pdf files into it. It can search for relevant documents without perfect keyword match, summarize takeaways from the document specific to your question, and extract key information from the document. It can help with brainstorming and summarization. We would love if you try it out and let us know what you think about it. We look forward to hearing your feedback Link: https://askcorpora.com submitted by /u/Brian-Hose225 [link] [comments]  ( 43 min )
    [D] Can output time frame cover input time frame in machine learning?
    I recently had a disagreement with a friend and would like to hear other opinions. Say for a website, using the user actions for first week period, we want to predict total sales within 3 weeks. But one of the inputs is sales in the first week, so the output -total sales of 3 weeks- is including the sales in the first week. Is it ok to choose this output? Or should we adjust it in a way to prevent it from overlapping with the input time period and choose for ex. sales within 2 weeks after the first week for output What is the reasoning? submitted by /u/dencan06 [link] [comments]  ( 43 min )
    [N] Microsoft announces new "next-generation" LLM, will be integrated with Bing and Edge
    https://www.theverge.com/2023/2/7/23587454/microsoft-bing-edge-chatgpt-ai submitted by /u/currentscurrents [link] [comments]  ( 44 min )
    [Discussion] Is ChatGPT and/or OpenAI really the leader in the space?
    Or is it someone else (who just may or may not be as well known)? submitted by /u/wonderingandthinking [link] [comments]  ( 44 min )
    [D] Multi-class classifications when a few of the classes are not mutually-exclusive
    I am dealing with a multi-class classification problem. I know one of the main assumption of this problem is that the classes are mutually exclusive. However, I realized that in my problem, some of these classes may happen together. So my problem is not an entirely a milt-class nor a multi-label. One solution is to relax the exclusivity assumption and fit a model, however, I am not sure how realistic is that. I was wondering if there is a better way to approach this problem? Briefly, the problem is in ads domain where a user can do task A or B after seeing an ad or can do both A&B at the same time. submitted by /u/hopedallas [link] [comments]  ( 43 min )
    [D] Which is the fastest and lightweight ultra realistic TTS for real-time voice cloning?
    Hey everyone, I want to make a personal voice assistant who sounds exactly like a real person. I tried some TTS like tortoise TTS and coqui TTS, it done a good job but it takes too long time to perform. So is there any other good realistic sounding TTS which I can use with my own voice cloning training dataset? Also I'm a bit amazed by the TTS used by eleven labs, so can someone explain how can I achieve that level of real-time efficiency in a voice assistant? submitted by /u/akshaysri0001 [link] [comments]  ( 43 min )
    [D] Artificial Intelligence for Manufacturing
    Manufacturing 4.0 is undergoing a revolution with the integration of Artificial Intelligence (AI). AI is poised to revolutionize the process industry, where controlling input variables leads to an output. The current process industry, including pharmaceuticals, chemicals, and energy production, relies on human operators to turn knobs to achieve optimal output. However, this system is limited by several factors, including slow training, poor retention of large data sets, inaccurate sensors, and complex decision-making processes. ​ https://preview.redd.it/y6stc52zqsga1.png?width=734&format=png&auto=webp&s=c8f24ff9d7f7975380e6cd2fbb5021fce67c0e86 Here are some details about problems and AI solutions: 1) It takes forever to train this employee. This employee is running little mini experi…  ( 45 min )
    [N] Getty Images Claims Stable Diffusion Has Stolen 12 Million Copyrighted Images, Demands $150,000 For Each Image
    From Article: Getty Images new lawsuit claims that Stability AI, the company behind Stable Diffusion's AI image generator, stole 12 million Getty images with their captions, metadata, and copyrights "without permission" to "train its Stable Diffusion algorithm." The company has asked the court to order Stability AI to remove violating images from its website and pay $150,000 for each. However, it would be difficult to prove all the violations. Getty submitted over 7,000 images, metadata, and copyright registration, used by Stable Diffusion. submitted by /u/vadhavaniyafaijan [link] [comments]  ( 49 min )
    [D] question on first time training a model
    Basically I saw a stream the other day where someone used data from a person's YouTube channel and somehow used this data to create an AI version of them and interviewed them. It was fascinating and pretty accurate How difficult would this be to do myself? I don't even know where to start. Does anyone have any pointers? Is this a very large task that I'm underestimating or is it actually feasible? Here is the stream in question. The video and audio would be cool to have but I mean that's not necessary, even just having the text aspect would be pretty wild on its own. https://youtu.be/hjoYy5IVtfo (skip to any point, most of it is filled with the bot responding) submitted by /u/cobalt1137 [link] [comments]  ( 43 min )
    Model/paper ideas: reinforcement learning with a deterministic environment [D]
    I have a problem I need to solve that, as far as I can tell, doesn't fit very well into most of the existing RL literature. Essentially the task is to create on optimal plan over a time horizon extending a flexible number of steps into the future. The action space is both discrete and continuous - there are multiple available distinct actions, some of which need to be given continuous (but constrained) parameters. In this problem however, the state of the environment is known ahead of time for all the future time steps, and the updated state of the agent after each action can be calculated deterministically given the action and the environment state. Modelling the entire problem as a MILP is not feasible due to the size of the action and state space, and we have a very large data set for agent and environment state to play with. Does anyone have any suggestions for papers or models that might be appropriate for this scenario? submitted by /u/EmbarrassedFuel [link] [comments]  ( 44 min )
    [N] Beyond Transformers with PyNeuraLogic
    Going beyond Transformers? 🤖 In this article, I'm discussing how we can use the power of hybrid architecture, i.e., marrying deep learning with symbolic artificial intelligence, for implementing different kinds of Transformers. Including the one used in GPT-3! https://towardsdatascience.com/beyond-transformers-with-pyneuralogic-10b70cdc5e45 ​ ​ The attention computation graph visualized submitted by /u/Lukas_Zahradnik [link] [comments]  ( 42 min )
    [D] Name your favourite Github repositories for data scientists
    I thought it may be useful to gather the most popular repositories for data scientists. The goal is to read excellent code and learn from other projects. Please provide a short description of the project. submitted by /u/Illustrious-Law-2556 [link] [comments]  ( 42 min )
    [Discussion] Best practices for taking deep learning models to bare metal MCUs
    I would like to know what are some of the best practice is to convert pytorch to embedded C (bare metal micro-controllers) during A. initial phase and B. for deployment. A. Initial phase is to understand the profiling of the model performance (RAM usage and processing time) for a targetted hardware. I understand that Tensorflow lite might be the best route for initial profiling but there are restrictions. It will be great if you could tell the framework that you follow. Currently framework: 1. Pytorch -> 2. ONNX -> 3. Keras -> 4. Tensorflowlite or 5. Tensorflowlite micro B. Deployment is to run inference for production in a targetted hardware. I think hand coding in C is the best way. Please ignore optimisation techniques in the workflow for simplicity. submitted by /u/ramv0001 [link] [comments]  ( 43 min )
    [D] A ML-powered music description/tag generator? (a reverse-MusicML)?
    I know there are no useful text-to-music generators (YET), but is there at least a model where you can upload/input a recording and get a text description/hashtag list from it? Like a reverse MusicML? I have a very large personal catalog of music I am prepping for sale (approaching 500 songs), and this would be a very handy tool, especially if it came up with tags of genres/similar artists I am not aware of. submitted by /u/dreternal [link] [comments]  ( 43 min )
    [D] Papers that inject embeddings into LMs
    I am looking for papers that inject information into LMs directly using embeddings (without formatting information as text). I find it notoriously hard to search for these paper because they could come from various different domains, so I thought asking here might be a good option to reach people from many different domains. Some examples I already found are from the domain of knowledge graph augmented LMs: ERNIE https://arxiv.org/abs/1904.09223 K-BERT https://arxiv.org/abs/1909.07606 Prefix Tuning / Prompt Tuning are also somewhat similar to the idea, but they dont depend on any external information. Can you think of other papers that inject additional information into LMs via embeddings? submitted by /u/_Arsenie_Boca_ [link] [comments]  ( 44 min )
    [P] Pythae 0.1.0 is out and supports distributed training for 25 Variational Autoencoders
    📢 News 📢 Pythae 0.1.0 is now out and supports distributed training using PyTorch DDP ! Train your favorite Variational Autoencoders (VAEs) faster 🏎️ and on larger datasets, still with a few lines of code 🖥️. 👉github: https://github.com/clementchadebec/benchmark_VAE 👉pypi: https://pypi.org/project/pythae/ https://preview.redd.it/jk4ukkgarpga1.png?width=1335&format=png&auto=webp&s=07c1ab2eaad104879637ad04472935d87baa31e9 submitted by /u/cchad-8 [link] [comments]  ( 43 min )
    [P] ChatGPT without size limits: upload any pdf and apply any prompt to it
    hi all! I created a simple free tool where you can summarize and query documents of any size and estimate the cost to do so: https://www.wrotescan.com You can edit the prompts as well as automatically chunk and combine documents. There's also a cost estimator for any pdf you upload. Let me know if you want me to run some examples for you! Send me a pdf and tell me what you'd like summarized or extracted. Tips Please be sure to keep {text} in both prompts or the program will not input your document's text into the map reduce summarizer. {text} can only appear once in each prompt. It is where the text from each chunk to be summarized is input into the prompts. Create a temporary OpenAI key / org to use with this site so you do not have to provide credit card information then be sure to delete the temp key when you are done. Learnings Some interesting learnings I had while creating the tool: - Minimizing the number of steps through the AI improved summarization, so map reduce was often better than a more advanced refine workflow which passes the output through the model many more times. - LangChain is great for managing multiple step language model calls and bypassing the current limitations of ChatGPT submitted by /u/aicharades [link] [comments]  ( 50 min )
  • Open

    Im making Faceless Youtube videos with this AI software. Ultra easy and super fast.
    submitted by /u/aortigby [link] [comments]  ( 40 min )
    Bright Eye: free mobile AI app that generates art, code, text (short stories, poems, answers), analyzes photos, and more!
    Hey guys, I’m the co-founder of a tech startup focused on providing free AI services. We’re one of the first mobile multipurpose AI apps. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA (Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature, story-lines, jokes, general information, etc. We’d love to have some people try it out, give us feedback, and keep in touch with us. https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/BrightEyeuser [link] [comments]  ( 41 min )
    Reimagining how developer-facing docs work using GPT-3
    Developers, have you ever found yourself stuck on a problem, searching through countless pages of outdated documentation, only to end up in a developer community asking for help? We know the feeling all too well. In response, we created https://www.mendable.ai/. Mendable is a GPT-3 (davinci + LangChain) powered developer support AI, designed to provide you with customized answers to your problems in real-time. We believe that developer-facing documentation should be tailored to the individual. We ingest documentation, community messages, and resources (blog posts, how-to guides, white papers, etc.) to serve as a middle-man between the entire knowledge base of a piece of software and you, resulting in a GPT like docs search. Our goal is to redefine search, making it easier for you to find the information you need, when you need it. Wasting time searching through outdated documentation is something that should be in the past. If you want to learn more about what we’re doing, feel free to visit our website and hit us up! submitted by /u/SideGuideDevs [link] [comments]  ( 41 min )
    Google REVEALS ChatGPT Rival Bard AI!
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    AI Dream 51 - InstantArt : Beeple inspired Synthwave
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Microsoft announces AI-powered Bing search and Edge browser
    submitted by /u/citizentim [link] [comments]  ( 40 min )
    Improved ChatGPT comes to Microsoft Bing and Edge
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Created an AI database tool where you ask questions and it generates the query code. It's like a query co-pilot.
    submitted by /u/Mogen1000 [link] [comments]  ( 42 min )
    🚨 If 2022 is the year of pixels for generative AI, then 2023 is the year of sound waves
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
    Introducing Bard: Google’s ChatGPT Competitor
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Is Jasper AI Worth Your Investment ? A Comprehensive Review
    Hello everyone! Have you heard of Jasper AI, the new copywriting software that promises to revolutionize the way we create content? In today's fast-paced world, time is a precious commodity, and finding ways to streamline our work is crucial. That's where Jasper AI comes in - it claims to be a one-stop-shop for all your copywriting needs. But is it worth the hype? First things first, what exactly is Jasper AI? It's an artificial intelligence tool that uses machine learning algorithms to generate high-quality, unique and compelling content for your website, social media, email, and other marketing channels. All you have to do is input the desired tone and style, topic, and target audience, and Jasper AI will generate a complete copy for you. It's like having a team of copywriters working …  ( 43 min )
    GIF2GIF Extension In Stable Diffusion! Easy Gif Processing!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    AI won’t make artists redundant - thanks to information theory
    submitted by /u/pmigdal [link] [comments]  ( 42 min )
    OpenAssistant | ChatGPT's Open Alternative (We need your help!)
    submitted by /u/pentin0 [link] [comments]  ( 40 min )
    AI: THE FUTURE REPLACEMENT OF HUMANS?
    submitted by /u/nowadayswow [link] [comments]  ( 40 min )
    A new video transformation model from Runway.ml
    https://medium.com/seeds-for-the-future/the-next-step-for-generative-ai-830112890d04?sk=1d6b4c96cc6cb0a4690bcf9df0d12bcc submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    skeptic ai books/audiobooks/documentaries like unabomer's manifeso on the technology?
    i want to listen or read or watch something that predicts all the bad things (maybe in the guise of good) like how unabomber wrote his manifesto for the technological revolution but for the ai revolution submitted by /u/tonyhyeok [link] [comments]  ( 41 min )
    Benedict Cumberbatch calling the Dr
    submitted by /u/Esportage [link] [comments]  ( 39 min )
    AI voice Quentin Tarantino recalls what inspired him to create Kill Bill (via Eleven Labs)
    submitted by /u/CoolkidRR [link] [comments]  ( 41 min )
    I spent half a year doing research and testing to develop an AI tool which creates the perfect long-form blog articles
    Good content is key no matter what type of website or business you run, from blogs to SaaS tools or service based companies. Not only will it help you to rank higher in Google for the relevant keywords, but it also helps to attract visitors by providing them something of value for free to convert them into your funnel with a newsletter or free trial. Usually creating this content either required a lot of time, a lot of money, or both. That is why I launched https://writeseed.com It is powered by GPT-3 to create content for you with the help of AI. You only need to provide it with a general niche or keyword and it will provide you with a selection of blog post outlines, which are then used to write a complete 1,000+ word article. You also get a free stock photo which is relevant to the topic each time. The quality of the content is so good, I often get the feedback that people are surprised this is possible at all. We achieve these results by using our own proprietary fine-tuning, as well as a special way of processing your input and the output from GPT-3. It took me half a year of research and comparing the outputs of other AI writing tools to get to this point and I am really proud of the result. Besides blog articles the platform offers over 20 templates of content, from product descriptions to Tweets, cold emails, Quora answers, questions, jokes, stories, paraphrasing or summarizing your existing content, auto-correct spelling and grammar, press releases, real-estate listings. And we continuously add new templates to cover even more use-cases. On top of that we are one of the very few, maybe the only tool, which offers you unlimited content for a flat monthly price. Some of our users are creating 200+ full-length articles per month to rank their websites or run Amazon product blogs. Please give it a try with the 7-day free trial, of course you can also create unlimited content during the free trial! submitted by /u/spacpro [link] [comments]  ( 44 min )
    Google's Bard will be more useful and interesting than ChatGPT as it has Internet access
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    You.com released v2 of YouChat, adding multimedia content to their chat agent for search
    submitted by /u/quanik_314 [link] [comments]  ( 40 min )
    New AI Streamer 24/7
    ​ https://preview.redd.it/nidkmqbtsnga1.png?width=1536&format=png&auto=webp&s=27719de8800d612139e93fe177e2fa63f03f4e0e https://www.twitch.tv/ai_media ​ Help me to learn by chatting with me! submitted by /u/BlueBug02 [link] [comments]  ( 40 min )
  • Open

    How to deal with planning before the match and actually playing the match?
    I was wondering: how to deal with a game like RPGs and card games where you have both a planning phase (character building for RPGs and deck building for card games) and an execution phase (actually playing through a battle or a card duel)? The most obvious approach to me is making two different models, but their successes are highly correlated, i.e. a model can play a RPG battle and lose mostly because of the pre-selected equipments/skills or it can win despite the poor choice of character building. How to deal with this problem? Do you know any "success case" where I can see how they dealt with it? Thank you very much! submitted by /u/victorsevero [link] [comments]  ( 41 min )
    Presenting the AI Soccer World Cup! Place your bets!
    submitted by /u/xWh0am1 [link] [comments]  ( 41 min )
    Applications of policy evaluation (prediction)
    Really elementary question, but are there applications where people have shown interesting results by just solving a prediction problem (evaluate a policy) instead of solving a control problem (find the optimal policy)? It seems like control applications are more popular, but perhaps I'm missing something very obvious. Phrased differently, when would you want to do prediction without doing control? submitted by /u/wardellinthehouse [link] [comments]  ( 41 min )
    Building a Game-Playing AI with Reinforcement Learning in Python
    submitted by /u/Historical-Pen9653 [link] [comments]  ( 41 min )
    if a policy is greedy with respect to its own value function, then it is an optimal policy
    I’m doing the RL specialization from university of alberta and just ran into this question. I think that it is false, because although the policy may be greedy to its own value function, the value function might not be always optimal, you would need to iterate through it some times in order to make it optimal, or close to it. The course says that “If a policy is greedy with respect to its own value function, it follows from the policy improvement theorem and the Bellman optimality equation that it must be an optimal policy. How does that make sense if the value function is not the optimal one? Thanks in advance submitted by /u/enzodtz [link] [comments]  ( 44 min )
  • Open

    Guidance on 512 node input and 512 node output
    I have 512 input nodes which all have a float value and an associated classification (0 or 1). I believe I need 512 output layers which correspond to the classification of each input node. Can anyone give me some guidance into an approach I can use for this to get this done? I know that’s not a lot to go on but any help is appreciated, I am using tensorflow by the way. Thank you. submitted by /u/waysteman [link] [comments]  ( 41 min )
  • Open

    Share medical image research on Amazon SageMaker Studio Lab for free
    This post is co-written with Stephen Aylward, Matt McCormick, Brianna Major from Kitware and Justin Kirby from the Frederick National Laboratory for Cancer Research (FNLCR). Amazon SageMaker Studio Lab provides no-cost access to a machine learning (ML) development environment to everyone with an email address. Like the fully featured Amazon SageMaker Studio, Studio Lab allows […]  ( 8 min )
    Amazon SageMaker Automatic Model Tuning now supports three new completion criteria for hyperparameter optimization
    Amazon SageMaker has announced the support of three new completion criteria for Amazon SageMaker automatic model tuning, providing you with an additional set of levers to control the stopping criteria of the tuning job when finding the best hyperparameter configuration for your model. In this post, we discuss these new completion criteria, when to use them, and […]  ( 8 min )
  • Open

    Roses are red
    I had so much fun getting GPT-3 to generate simple one-line Valentine's Day cards last year that this year I decided to see if I could generate cards with more complicated messages. I focused on the classic "roses are red, violets are blue" rhyme, figuring that  ( 5 min )
    Bonus: more Valentine rhymes
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Google Research, 2022 & beyond: Algorithms for efficient deep learning
    Posted by Sanjiv Kumar, VP and Google Fellow, Google Research (This is Part 4 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) The explosion in deep learning a decade ago was catapulted in part by the convergence of new algorithms and architectures, a marked increase in data, and access to greater compute. In the last 10 years, AI and ML models have become bigger and more sophisticated — they’re deeper, more complex, with more parameters, and trained on much more data, resulting in some of the most transformative outcomes in the history of machine learning. As these models increasingly find themselves deployed in production and business applications, the efficiency and costs of these models has gon…  ( 97 min )
  • Open

    DSC Weekly 7 February 2023 – Machine Learning Controversy: From No-Code to No-Math
    Announcements Machine Learning Controversy: From No-Code to No-Math One controversial topic in machine learning circles is code versus no-code. Can you be a real data scientist if you don’t code? Of course you can: You may be leveraging platforms and the code is one or two layers below the responsibilities of your job. Maybe you… Read More »DSC Weekly 7 February 2023 – Machine Learning Controversy: From No-Code to No-Math The post DSC Weekly 7 February 2023 – Machine Learning Controversy: From No-Code to No-Math appeared first on Data Science Central.  ( 21 min )
    Top 8 Digital Marketing Trends That’ll Make a Comeback in 2023
    Digital marketing has become an efficient practice for growing and established businesses to promote their products and services. More and more organizations depend on digital channels to connect with customers and generate a vast customer base.  Because of its growing popularity, different digital marketing trends will reign in 2023. If businesses can pay attention to… Read More »Top 8 Digital Marketing Trends That’ll Make a Comeback in 2023 The post Top 8 Digital Marketing Trends That’ll Make a Comeback in 2023 appeared first on Data Science Central.  ( 20 min )
    The Impact of Data Labeling 2023: Current Trends & Future Demands
    Data labeling and/or data annotation has long been a critical component of many machine learning and AI initiatives. In recent years, the demand for accurate and reliable data labeling has risen dramatically as the process becomes increasingly vital to the success of numerous projects. But what is data labeling exactly? Data Labeling 2023 – how… Read More »The Impact of Data Labeling 2023: Current Trends & Future Demands The post The Impact of Data Labeling 2023: Current Trends & Future Demands appeared first on Data Science Central.  ( 22 min )
    A Quick Guide To iOS And Android App development In 2023
    If you look around, you’ll observe we have apps for everything. They’re like food, and we consume every bit of it. If you have a business, you must consider app development along with a website. Well, you’re at the right place. The app development journey can be overwhelming as it comes with tons of responsibilities,… Read More »A Quick Guide To iOS And Android App development In 2023 The post A Quick Guide To iOS And Android App development In 2023 appeared first on Data Science Central.  ( 22 min )
    Best 9 Mobile Apps to Develop Your Data Science Skills in 2023
    Mobile Apps to Develop Your Data Science Skills -Mobile phones are the most preferred medium of accomplishing minute-to-minutest tasks on a daily basis. We don’t need to visit any particular restaurant to take away the food, we can do this by just sitting on our favorite couch at home, thanks to food ordering apps. Not… Read More »Best 9 Mobile Apps to Develop Your Data Science Skills in 2023 The post Best 9 Mobile Apps to Develop Your Data Science Skills in 2023 appeared first on Data Science Central.  ( 23 min )
    Top Benefits of IoT for SMEs
    The Internet of Things (IoT) has been all the rage in the world for quite some time now. And understandably so, especially considering the many benefits it brings to the table. It has rendered the technology a lucrative choice for small-to-mid-sized enterprises (SMEs). Why so? Well, here are some of the benefits of IoT for… Read More »Top Benefits of IoT for SMEs The post Top Benefits of IoT for SMEs appeared first on Data Science Central.  ( 19 min )
  • Open

    Vietnam’s VinBrain Deploys Healthcare AI Models to 100+ Hospitals
    Doctors rarely make diagnoses based on a single factor — they look at a mix of data types, such as a patient’s symptoms, laboratory and radiology reports, and medical history. VinBrain, a Vietnam-based health-tech startup, is ensuring that AI diagnostics can take a similarly holistic view across vital signs, blood tests, medical images and more. Read article >  ( 6 min )
  • Open

    Mediant approximation trick
    Suppose you are trying to approximate some number x and you’ve got it sandwiched between two rational numbers: a/b < x < c/d. Now you’d like a better approximation. What would you do? The obvious approach would be to take the average of a/b and c/d. That’s fine, except it could be a fair amount […] Mediant approximation trick first appeared on John D. Cook.  ( 6 min )
  • Open

    Solving a machine-learning mystery
    A new study shows how large language models like GPT-3 can learn a new task from just a few examples, without the need for any new training data.  ( 10 min )
  • Open

    Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds. (arXiv:2301.12485v2 [q-bio.BM] UPDATED)
    Proteins power a vast array of functional processes in living cells. The capability to create new proteins with designed structures and functions would thus enable the engineering of cellular behavior and development of protein-based therapeutics and materials. Structure-based protein design aims to find structures that are designable (can be realized by a protein sequence), novel (have dissimilar geometry from natural proteins), and diverse (span a wide range of geometries). While advances in protein structure prediction have made it possible to predict structures of novel protein sequences, the combinatorially large space of sequences and structures limits the practicality of search-based methods. Generative models provide a compelling alternative, by implicitly learning the low-dimensional structure of complex data distributions. Here, we leverage recent advances in denoising diffusion probabilistic models and equivariant neural networks to develop Genie, a generative model of protein structures that performs discrete-time diffusion using a cloud of oriented reference frames in 3D space. Through in silico evaluations, we demonstrate that Genie generates protein backbones that are more designable, novel, and diverse than existing models. This indicates that Genie is capturing key aspects of the distribution of protein structure space and facilitates protein design with high success rates. Code for generating new proteins and training new versions of Genie is available at https://github.com/aqlaboratory/genie.  ( 2 min )
    Anti-Symmetric DGN: a stable architecture for Deep Graph Networks. (arXiv:2210.09789v2 [cs.LG] UPDATED)
    Deep Graph Networks (DGNs) currently dominate the research landscape of learning from graphs, due to their efficiency and ability to implement an adaptive message-passing scheme between the nodes. However, DGNs are typically limited in their ability to propagate and preserve long-term dependencies between nodes, i.e., they suffer from the over-squashing phenomena. This reduces their effectiveness, since predictive problems may require to capture interactions at different, and possibly large, radii in order to be effectively solved. In this work, we present Anti-Symmetric Deep Graph Networks (A-DGNs), a framework for stable and non-dissipative DGN design, conceived through the lens of ordinary differential equations. We give theoretical proof that our method is stable and non-dissipative, leading to two key results: long-range information between nodes is preserved, and no gradient vanishing or explosion occurs in training. We empirically validate the proposed approach on several graph benchmarks, showing that A-DGN yields to improved performance and enables to learn effectively even when dozens of layers are used.  ( 2 min )
    Knowledge Extraction in Low-Resource Scenarios: Survey and Perspective. (arXiv:2202.08063v3 [cs.CL] CROSS LISTED)
    Knowledge Extraction (KE), aiming to extract structural information from unstructured texts, often suffers from data scarcity and emerging unseen types, i.e., low-resource scenarios. Many neural approaches to low-resource KE have been widely investigated and achieved impressive performance. In this paper, we present a literature review towards KE in low-resource scenarios, and systematically categorize existing works into three paradigms: (1) exploiting higher-resource data, (2) exploiting stronger models, and (3) exploiting data and models together. In addition, we highlight promising applications and outline some potential directions for future research. We hope that our survey can help both the academic and industrial communities to better understand this field, inspire more ideas, and boost broader applications.  ( 2 min )
    Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. (arXiv:2210.02720v2 [cs.LG] UPDATED)
    Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.  ( 2 min )
    A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs. (arXiv:2206.04798v2 [cs.AI] UPDATED)
    Reasoning on large-scale knowledge graphs has been long dominated by embedding methods. While path-based methods possess the inductive capacity that embeddings lack, they suffer from the scalability issue due to the exponential number of paths. Here we present A*Net, a scalable path-based method for knowledge graph reasoning. Inspired by the A* algorithm for shortest path problems, our A*Net learns a priority function to select important nodes and edges at each iteration, to reduce time and memory footprint for both training and inference. The ratio of selected nodes and edges can be specified to trade off between performance and efficiency. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, while merely visiting 10% nodes and 10% edges at each iteration. On a million-scale dataset ogbl-wikikg2, A*Net achieves competitive performance with embedding methods and converges faster. To our best knowledge, A*Net is the first path-based method for knowledge graph reasoning at such a scale.  ( 2 min )
    One-shot domain adaptation in video-based assessment of surgical skills. (arXiv:2301.00812v2 [cs.CV] UPDATED)
    Deep Learning (DL) has achieved automatic and objective assessment of surgical skills. However, DL models are data-hungry and restricted to their training domain. This prevents them from transitioning to new tasks where data is limited. Hence, domain adaptation is crucial to implement DL in real life. Here, we propose a meta-learning model, A-VBANet, that can deliver domain-agnostic surgical skill classification via one-shot learning. We develop the A-VBANet on five laparoscopic and robotic surgical simulators. Additionally, we test it on operating room (OR) videos of laparoscopic cholecystectomy. Our model successfully adapts with accuracies up to 99.5% in one-shot and 99.9% in few-shot settings for simulated tasks and 89.7% for laparoscopic cholecystectomy. For the first time, we provide a domain-agnostic procedure for video-based assessment of surgical skills. A significant implication of this approach is that it allows the use of data from surgical simulators to assess performance in the operating room.  ( 2 min )
    Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. (arXiv:2212.04979v2 [cs.CV] UPDATED)
    This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to "flattened frame embeddings", yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates $N$ token embeddings per frame for totally $T$ video frames. We flatten $(N \times T)$ token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT, ActivityNet Captions and VATEX. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA) and video captioning (MSR-VTT, ActivityNet, VATEX, Youcook2). Our approach establishes a simple video-text baseline for future research.  ( 2 min )
    On the Opportunity of Causal Deep Generative Models: A Survey and Future Directions. (arXiv:2301.12351v2 [cs.LG] UPDATED)
    Deep generative models have gained popularity in recent years due to their ability to accurately replicate inherent empirical distributions and yield novel samples. In particular, certain advances are proposed wherein the model engenders data examples following specified attributes. Nevertheless, several challenges still exist and are to be overcome, i.e., difficulty in extrapolating out-of-sample data and insufficient learning of disentangled representations. Structural causal models (SCMs), on the other hand, encapsulate the causal factors that govern a generative process and characterize a generative model based on causal relationships, providing crucial insights for addressing the current obstacles in deep generative models. In this paper, we present a comprehensive survey of Causal deep Generative Models (CGMs), which combine SCMs and deep generative models in a way that boosts several trustworthy properties such as robustness, fairness, and interpretability. We provide an overview of the recent advances in CGMs, categorize them based on generative types, and discuss how causality is introduced into the family of deep generative models. We also explore potential avenues for future research in this field.  ( 2 min )
    Post-Selection Confidence Bounds for Prediction Performance. (arXiv:2210.13206v3 [stat.ML] UPDATED)
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.  ( 3 min )
    When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning. (arXiv:2205.11027v2 [cs.LG] UPDATED)
    In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.  ( 2 min )
    S$^3$NN: Time Step Reduction of Spiking Surrogate Gradients for Training Energy Efficient Single-Step Spiking Neural Networks. (arXiv:2201.10879v2 [cs.LG] UPDATED)
    As the scales of neural networks increase, techniques that enable them to run with low computational cost and energy efficiency are required. From such demands, various efficient neural network paradigms, such as spiking neural networks (SNNs) or binary neural networks (BNNs), have been proposed. However, they have sticky drawbacks, such as degraded inference accuracy and latency. To solve these problems, we propose a single-step spiking neural network (S$^3$NN), an energy-efficient neural network with low computational cost and high precision. The proposed S$^3$NN processes the information between hidden layers by spikes as SNNs. Nevertheless, it has no temporal dimension so that there is no latency within training and inference phases as BNNs. Thus, the proposed S$^3$NN has a lower computational cost than SNNs that require time-series processing. However, S$^3$NN cannot adopt na\"{i}ve backpropagation algorithms due to the non-differentiability nature of spikes. We deduce a suitable neuron model by reducing the surrogate gradient for multi-time step SNNs to a single-time step. We experimentally demonstrated that the obtained surrogate gradient allows S$^3$NN to be trained appropriately. We also showed that the proposed S$^3$NN could achieve comparable accuracy to full-precision networks while being highly energy-efficient.  ( 2 min )
    Consistent Range Approximation for Fair Predictive Modeling. (arXiv:2212.10839v2 [cs.LG] UPDATED)
    This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.  ( 2 min )
    Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. (arXiv:2206.09889v3 [cs.MA] UPDATED)
    We introduce Nocturne, a new 2D driving simulator for investigating multi-agent coordination under partial observability. The focus of Nocturne is to enable research into inference and theory of mind in real-world multi-agent settings without the computational overhead of computer vision and feature extraction from images. Agents in this simulator only observe an obstructed view of the scene, mimicking human visual sensing constraints. Unlike existing benchmarks that are bottlenecked by rendering human-like observations directly using a camera input, Nocturne uses efficient intersection methods to compute a vectorized set of visible features in a C++ back-end, allowing the simulator to run at over 2000 steps-per-second. Using open-source trajectory and map data, we construct a simulator to load and replay arbitrary trajectories and scenes from real-world driving data. Using this environment, we benchmark reinforcement-learning and imitation-learning agents and demonstrate that the agents are quite far from human-level coordination ability and deviate significantly from the expert trajectories.
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v2 [physics.ao-ph] UPDATED)
    Machine learning models have been employed to perform either physics-free data-driven or hybrid dynamical downscaling of climate data. Most of these implementations operate over relatively small downscaling factors because of the challenge of recovering fine-scale information from coarse data. This limits their compatibility with many global climate model outputs, often available between $\sim$50--100 km resolution, to scales of interest such as cloud resolving or urban scales. This study systematically examines the capability of convolutional neural networks (CNNs) to downscale surface wind speed data over land surface from different coarse resolutions (25 km, 48 km, and 100 km resolution) to 3 km. For each downscaling factor, we consider three CNN configurations that generate super-resolved predictions of fine-scale wind speed, which take between 1 to 3 input fields: coarse wind speed, fine-scale topography, and diurnal cycle. In addition to fine-scale wind speeds, probability density function parameters are generated, through which sample wind speeds can be generated accounting for the intrinsic stochasticity of wind speed. For generalizability assessment, CNN models are tested on regions with different topography and climate that are unseen during training. The evaluation of super-resolved predictions focuses on subgrid-scale variability and the recovery of extremes. Models with coarse wind and fine topography as inputs exhibit the best performance compared with other model configurations, operating across the same downscaling factor. Our diurnal cycle encoding results in lower out-of-sample generalizability compared with other input configurations.  ( 2 min )
    Learning Counterfactually Invariant Predictors. (arXiv:2207.09768v2 [cs.LG] UPDATED)
    Counterfactual invariance has proven an essential property for predictors that are fair, robust, and generalizable in the real world. We propose a general definition of counterfactual invariance and provide simple graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of (conditional independence in) the observational distribution. Any predictor that satisfies our criterion is provably counterfactually invariant. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactual Invariance Prediction (CIP), based on a kernel-based conditional dependence measure called Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various types of data including tabular, high-dimensional, and real-world dataset.  ( 2 min )
    A Systematic Survey of Molecular Pre-trained Models. (arXiv:2210.16484v2 [cs.LG] UPDATED)
    Deep learning has achieved remarkable success in learning representations for molecules, which is crucial for various biochemical applications, ranging from property prediction to drug design. However, training Deep Neural Networks (DNNs) from scratch often requires abundant labeled molecules, which are expensive to acquire in the real world. To alleviate this issue, tremendous efforts have been devoted to Molecular Pre-trained Models (MPMs), where DNNs are pre-trained using large-scale unlabeled molecular databases and then fine-tuned over specific downstream tasks. Despite the prosperity, there lacks a systematic review of this fast-growing field. In this paper, we present the first survey that summarizes the current progress of MPMs. We first highlight the limitations of training molecular representation models from scratch to motivate MPM studies. Next, we systematically review recent advances on this topic from several key perspectives, including molecular descriptors, encoder architectures, pre-training strategies, and applications. We also highlight the challenges and promising avenues for future research, providing a useful resource for both machine learning and scientific communities.  ( 2 min )
    Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences. (arXiv:2210.15906v3 [cs.AI] UPDATED)
    Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement on AI agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines. The supplementary video and source code are available at: https://guansuns.github.io/pages/rba.
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v2 [stat.ML] UPDATED)
    Spurious correlations, or correlations that change across domains where a model can be deployed, present significant challenges to real-world applications of machine learning models. However, such correlations are not always "spurious"; often, they provide valuable prior information for a prediction. Here, we present a test-time adaptation method that exploits the spurious correlation phenomenon, in contrast to recent approaches that attempt to eliminate spurious correlations through invariance. We consider situations where the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, but the generative model for features $p(\mathbf{x}|y, z)$ is constant. We note that this corresponds to an expanded version of the label shift assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we train a classifier to predict $p(y, z|\mathbf{x})$ on the source distribution, and propose a test-time label shift correction that adapts to changes in the marginal distribution $p(y, z)$ using unlabeled samples from the target domain. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on two different image datasets -- the CheXpert chest X-ray dataset and the Colored MNIST dataset -- and show a significant improvement over baseline methods. Code reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .
    IR-MCL: Implicit Representation-Based Online Global Localization. (arXiv:2210.03113v2 [cs.RO] UPDATED)
    Determining the state of a mobile robot is an essential building block of robot navigation systems. In this paper, we address the problem of estimating the robots pose in an indoor environment using 2D LiDAR data and investigate how modern environment models can improve gold standard Monte-Carlo localization (MCL) systems. We propose a neural occupancy field to implicitly represent the scene using a neural network. With the pretrained network, we can synthesize 2D LiDAR scans for an arbitrary robot pose through volume rendering. Based on the implicit representation, we can obtain the similarity between a synthesized and actual scan as an observation model and integrate it into an MCL system to perform accurate localization. We evaluate our approach on self-recorded datasets and three publicly available ones. We show that we can accurately and efficiently localize a robot using our approach surpassing the localization performance of state-of-the-art methods. The experiments suggest that the presented implicit representation is able to predict more accurate 2D LiDAR scans leading to an improved observation model for our particle filter-based localization. The code of our approach will be available at: https://github.com/PRBonn/ir-mcl.
    Causal Modeling of Policy Interventions From Sequences of Treatments and Outcomes using Gaussian Processes. (arXiv:2209.04142v4 [cs.LG] UPDATED)
    A treatment policy defines when and what treatments are applied to affect some outcome of interest. Data-driven decision-making requires the ability to predict what happens if a policy is changed. Existing methods that predict how the outcome evolves under different scenarios assume that the tentative sequences of future treatments are fixed in advance, while in practice the treatments are determined stochastically by a policy and may depend for example on the efficiency of previous treatments. Therefore, the current methods are not applicable if the treatment policy is unknown or a counterfactual analysis is needed. To handle these limitations, we model the treatments and outcomes jointly in continuous time, by combining Gaussian processes and point processes. Our model enables the estimation of a treatment policy from observational sequences of treatments and outcomes, and it can predict the interventional and counterfactual progression of the outcome after an intervention on the treatment policy (in contrast with the causal effect of a single treatment). We show with real-world and semi-synthetic data on blood glucose progression that our method can answer causal queries more accurately than existing alternatives.
    Vicarious Offense and Noise Audit of Offensive Speech Classifiers. (arXiv:2301.12534v2 [cs.CL] UPDATED)
    This paper examines social web content moderation from two key perspectives: automated methods (machine moderators) and human evaluators (human moderators). We conduct a noise audit at an unprecedented scale using nine machine moderators trained on well-known offensive speech data sets evaluated on a corpus sampled from 92 million YouTube comments discussing a multitude of issues relevant to US politics. We introduce a first-of-its-kind data set of vicarious offense. We ask annotators: (1) if they find a given social media post offensive; and (2) how offensive annotators sharing different political beliefs would find the same content. Our experiments with machine moderators reveal that moderation outcomes wildly vary across different machine moderators. Our experiments with human moderators suggest that (1) political leanings considerably affect first-person offense perspective; (2) Republicans are the worst predictors of vicarious offense; (3) predicting vicarious offense for the Republicans is most challenging than predicting vicarious offense for the Independents and the Democrats; and (4) disagreement across political identity groups considerably increases when sensitive issues such as reproductive rights or gun control/rights are discussed. Both experiments suggest that offense, is indeed, highly subjective and raise important questions concerning content moderation practices.  ( 2 min )
    PDEBENCH: An Extensive Benchmark for Scientific Machine Learning. (arXiv:2210.07182v4 [cs.LG] UPDATED)
    Machine learning-based modeling of physical systems has experienced increased interest in recent years. Despite some impressive progress, there is still a lack of benchmarks for Scientific ML that are easy to use but still challenging and representative of a wide range of problems. We introduce PDEBench, a benchmark suite of time-dependent simulation tasks based on Partial Differential Equations (PDEs). PDEBench comprises both code and data to benchmark the performance of novel machine learning models against both classical numerical simulations and machine learning baselines. Our proposed set of benchmark problems contribute the following unique features: (1) A much wider range of PDEs compared to existing benchmarks, ranging from relatively common examples to more realistic and difficult problems; (2) much larger ready-to-use datasets compared to prior work, comprising multiple simulation runs across a larger number of initial and boundary conditions and PDE parameters; (3) more extensible source codes with user-friendly APIs for data generation and baseline results with popular machine learning models (FNO, U-Net, PINN, Gradient-Based Inverse Method). PDEBench allows researchers to extend the benchmark freely for their own purposes using a standardized API and to compare the performance of new models to existing baseline methods. We also propose new evaluation metrics with the aim to provide a more holistic understanding of learning methods in the context of Scientific ML. With those metrics we identify tasks which are challenging for recent ML methods and propose these tasks as future challenges for the community. The code is available at https://github.com/pdebench/PDEBench.
    SketchySGD: Reliable Stochastic Optimization via Robust Curvature Estimates. (arXiv:2211.08597v3 [math.OC] UPDATED)
    We introduce SketchySGD, a stochastic quasi-Newton method that uses sketching to approximate the curvature of the loss function. SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems show that SketchySGD can achieve comparable or better results to popular stochastic gradient methods with minimal hyperparameter tuning. The robustness of SketchySGD to hyperparameters is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance.
    RL4ReAl: Reinforcement Learning for Register Allocation. (arXiv:2204.02013v2 [cs.LG] UPDATED)
    We aim to automate decades of research and experience in register allocation, leveraging machine learning. We tackle this problem by embedding a multi-agent reinforcement learning algorithm within LLVM, training it with the state of the art techniques. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training and inference. Our approach is architecture independent: we show experimental results targeting Intel x86 and ARM AArch64. Our results match or out-perform the heavily tuned, production-grade register allocators of LLVM.  ( 2 min )
    SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. (arXiv:2210.01108v2 [cs.CL] UPDATED)
    We propose MINT, a new Multilingual INTimacy analysis dataset covering 13,372 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. We benchmarked a list of popular multilingual pre-trained language models. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis (https://sites.google.com/umich.edu/semeval-2023-tweet-intimacy).
    New Machine Learning Techniques for Simulation-Based Inference: InferoStatic Nets, Kernel Score Estimation, and Kernel Likelihood Ratio Estimation. (arXiv:2210.01680v2 [stat.ML] UPDATED)
    We propose an intuitive, machine-learning approach to multiparameter inference, dubbed the InferoStatic Networks (ISN) method, to model the score and likelihood ratio estimators in cases when the probability density can be sampled but not computed directly. The ISN uses a backend neural network that models a scalar function called the inferostatic potential $\varphi$. In addition, we introduce new strategies, respectively called Kernel Score Estimation (KSE) and Kernel Likelihood Ratio Estimation (KLRE), to learn the score and the likelihood ratio functions from simulated data. We illustrate the new techniques with some toy examples and compare to existing approaches in the literature. We mention en passant some new loss functions that optimally incorporate latent information from simulations into the training procedure.
    Conditional Antibody Design as 3D Equivariant Graph Translation. (arXiv:2208.06073v4 [q-bio.BM] UPDATED)
    Antibody design is valuable for therapeutic usage and biological research. Existing deep-learning-based methods encounter several key issues: 1) incomplete context for Complementarity-Determining Regions (CDRs) generation; 2) incapability of capturing the entire 3D geometry of the input structure; 3) inefficient prediction of the CDR sequences in an autoregressive manner. In this paper, we propose Multi-channel Equivariant Attention Network (MEAN) to co-design 1D sequences and 3D structures of CDRs. To be specific, MEAN formulates antibody design as a conditional graph translation problem by importing extra components including the target antigen and the light chain of the antibody. Then, MEAN resorts to E(3)-equivariant message passing along with a proposed attention mechanism to better capture the geometrical correlation between different components. Finally, it outputs both the 1D sequences and 3D structure via a multi-round progressive full-shot scheme, which enjoys more efficiency and precision against previous autoregressive approaches. Our method significantly surpasses state-of-the-art models in sequence and structure modeling, antigen-binding CDR design, and binding affinity optimization. Specifically, the relative improvement to baselines is about 23% in antigen-binding CDR design and 34% for affinity optimization.
    Testing Rare Downstream Safety Violations via Upstream Adaptive Sampling of Perception Error Models. (arXiv:2209.09674v3 [cs.RO] UPDATED)
    Testing black-box perceptual-control systems in simulation faces two difficulties. Firstly, perceptual inputs in simulation lack the fidelity of real-world sensor inputs. Secondly, for a reasonably accurate perception system, encountering a rare failure trajectory may require running infeasibly many simulations. This paper combines perception error models -- surrogates for a sensor-based detection system -- with state-dependent adaptive importance sampling. This allows us to efficiently assess the rare failure probabilities for real-world perceptual control systems within simulation. Our experiments with an autonomous braking system equipped with an RGB obstacle-detector show that our method can calculate accurate failure probabilities with an inexpensive number of simulations. Further, we show how choice of safety metric can influence the process of learning proposal distributions capable of reliably sampling high-probability failures.
    Physically Consistent Learning of Conservative Lagrangian Systems with Gaussian Processes. (arXiv:2206.12272v3 [cs.LG] UPDATED)
    This paper proposes a physically consistent Gaussian Process (GP) enabling the identification of uncertain Lagrangian systems. The function space is tailored according to the energy components of the Lagrangian and the differential equation structure, analytically guaranteeing physical and mathematical properties such as energy conservation and quadratic form. The novel formulation of Cholesky decomposed matrix kernels allow the probabilistic preservation of positive definiteness. Only differential input-to-output measurements of the function map are required while Gaussian noise is permitted in torques, velocities, and accelerations. We demonstrate the effectiveness of the approach in numerical simulation.
    Socially Fair Reinforcement Learning. (arXiv:2208.12584v2 [cs.LG] UPDATED)
    We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v3 [stat.ME] UPDATED)
    We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.
    Searching for the Essence of Adversarial Perturbations. (arXiv:2205.15357v3 [cs.LG] UPDATED)
    Neural networks have demonstrated state-of-the-art performance in various machine learning fields. However, the introduction of malicious perturbations in input data, known as adversarial examples, has been shown to deceive neural network predictions. This poses potential risks for real-world applications such as autonomous driving and text identification. In order to mitigate these risks, a comprehensive understanding of the mechanisms underlying adversarial examples is essential. In this study, we demonstrate that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's incorrect prediction, in contrast to the widely held belief that human-unidentifiable characteristics play a critical role in fooling a network. This concept of human-recognizable characteristics enables us to explain key features of adversarial perturbations, including their existence, transferability among different neural networks, and increased interpretability for adversarial training. We also uncover two unique properties of adversarial perturbations that deceive neural networks: masking and generation. Additionally, a special class, the complementary class, is identified when neural networks classify input images. The presence of human-recognizable information in adversarial perturbations allows researchers to gain insight into the working principles of neural networks and may lead to the development of techniques for detecting and defending against adversarial attacks.
    GOOD: Exploring Geometric Cues for Detecting Objects in an Open World. (arXiv:2212.11720v3 [cs.CV] UPDATED)
    We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.  ( 2 min )
    Realizable Learning is All You Need. (arXiv:2111.04746v3 [cs.LG] UPDATED)
    The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.  ( 2 min )
    Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure. (arXiv:2206.03569v3 [cs.LG] UPDATED)
    The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $\epsilon$-optimal policy is $\tilde{\Omega}\left(|S||A|H^3 / \epsilon^2\right)$ over worst case instances of an MDP with state space $S$, action space $A$, and horizon $H$. We consider a class of MDPs for which the associated optimal $Q^*$ function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in $|S|$ and $|A|$ due to the low rank structure, we show that without imposing further assumptions beyond low rank of $Q^*$, if one is constrained to estimate the $Q$ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon $H$ to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/\epsilon^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $\epsilon$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v2 [stat.ML] UPDATED)
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.
    Self-Programming Artificial Intelligence Using Code-Generating Language Models. (arXiv:2205.00167v2 [cs.AI] UPDATED)
    Recent progress in large-scale language models has enabled breakthroughs in previously intractable computer programming tasks. Prior work in meta-learning and neural architecture search has led to substantial successes across various task domains, spawning myriad approaches for algorithmically optimizing the design and learning dynamics of deep learning models. At the intersection of these research areas, we implement a code-generating language model with the ability to modify its own source code. Self-programming AI algorithms have been of interest since the dawn of AI itself. Although various theoretical formulations of generalized self-programming AI have been posed, no such system has been successfully implemented to date under real-world computational constraints. Applying AI-based code generation to AI itself, we develop and experimentally validate the first practical implementation of a self-programming AI system. We empirically show that a self-programming AI implemented using a code generation model can successfully modify its own source code to improve performance and program sub-models to perform auxiliary tasks. Our model can self-modify various properties including model architecture, computational capacity, and learning dynamics.  ( 2 min )
    PyGlove: Efficiently Exchanging ML Ideas as Code. (arXiv:2302.01918v1 [cs.LG])
    The increasing complexity and scale of machine learning (ML) has led to the need for more efficient collaboration among multiple teams. For example, when a research team invents a new architecture like "ResNet," it is desirable for multiple engineering teams to adopt it. However, the effort required for each team to study and understand the invention does not scale well with the number of teams or inventions. In this paper, we present an extension of our PyGlove library to easily and scalably share ML ideas. PyGlove represents ideas as symbolic rule-based patches, enabling researchers to write down the rules for models they have not seen. For example, an inventor can write rules that will "add skip-connections." This permits a network effect among teams: at once, any team can issue patches to all other teams. Such a network effect allows users to quickly surmount the cost of adopting PyGlove by writing less code quicker, providing a benefit that scales with time. We describe the new paradigm of organizing ML through symbolic patches and compare it to existing approaches. We also perform a case study of a large codebase where PyGlove led to an 80% reduction in the number of lines of code.  ( 2 min )
    Fast Feature Selection with Fairness Constraints. (arXiv:2202.13718v2 [cs.LG] UPDATED)
    We study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-submodular functions. The proposed algorithm achieves exponentially fast parallel run time in the adaptive query model, scaling much better than prior work. Furthermore, our extension allows the use of downward-closed constraints, which can be used to encode certain fairness criteria into the feature selection process. We prove strong approximation guarantees for the algorithm based on standard assumptions. These guarantees are applicable to many parametric models, including Generalized Linear Models. Finally, we demonstrate empirically that the proposed algorithm competes favorably with state-of-the-art techniques for feature selection, on real-world and synthetic datasets.  ( 2 min )
    Fast Bayesian Optimization of Needle-in-a-Haystack Problems using Zooming Memory-Based Initialization (ZoMBI). (arXiv:2208.13771v2 [cs.LG] UPDATED)
    Needle-in-a-Haystack problems exist across a wide range of applications including rare disease prediction, ecological resource management, fraud detection, and material property optimization. A Needle-in-a-Haystack problem arises when there is an extreme imbalance of optimum conditions relative to the size of the dataset. For example, only $0.82\%$ out of $146$k total materials in the open-access Materials Project database have a negative Poisson's ratio. However, current state-of-the-art optimization algorithms are not designed with the capabilities to find solutions to these challenging multidimensional Needle-in-a-Haystack problems, resulting in slow convergence to a global optimum or pigeonholing into a local minimum. In this paper, we present a Zooming Memory-Based Initialization algorithm, entitled ZoMBI. ZoMBI actively extracts knowledge from the previously best-performing evaluated experiments to iteratively zoom in the sampling search bounds towards the global optimum "needle" and then prunes the memory of low-performing historical experiments to accelerate compute times by reducing the algorithm time complexity from $O(n^3)$ to $O(\phi^3)$ for $\phi$ forward experiments per activation, which trends to a constant $O(1)$ over several activations. Additionally, ZoMBI implements two custom adaptive acquisition functions to further guide the sampling of new experiments toward the global optimum. We validate the algorithm's optimization performance on three real-world datasets exhibiting Needle-in-a-Haystack and further stress-test the algorithm's performance on an additional 174 analytical datasets. The ZoMBI algorithm demonstrates compute time speed-ups of 400x compared to traditional Bayesian optimization as well as efficiently discovering optima in under 100 experiments that are up to 3x more highly optimized than those discovered by similar methods MiP-EGO, TuRBO, and HEBO.
    Accelerometry-based classification of circulatory states during out-of-hospital cardiac arrest. (arXiv:2205.06540v2 [eess.SP] UPDATED)
    Objective: Exploit accelerometry data for an automatic, reliable, and prompt detection of spontaneous circulation during cardiac arrest, as this is both vital for patient survival and practically challenging. Methods: We developed a machine learning algorithm to automatically predict the circulatory state during cardiopulmonary resuscitation from 4-second-long snippets of accelerometry and electrocardiogram (ECG) data from pauses of chest compressions of real-world defibrillator records. The algorithm was trained based on 422 cases from the German Resuscitation Registry, for which ground truth labels were created by a manual annotation of physicians. It uses a kernelized Support Vector Machine classifier based on 49 features, which partially reflect the correlation between accelerometry and electrocardiogram data. Results: Evaluating 50 different test-training data splits, the proposed algorithm exhibits a balanced accuracy of 81.2%, a sensitivity of 80.6%, and a specificity of 81.8%, whereas using only ECG leads to a balanced accuracy of 76.5%, a sensitivity of 80.2%, and a specificity of 72.8%. Conclusion: The first method employing accelerometry for pulse/no-pulse decision yields a significant increase in performance compared to single ECG-signal usage. Significance: This shows that accelerometry provides relevant information for pulse/no-pulse decisions. In application, such an algorithm may be used to simplify retrospective annotation for quality management and, moreover, to support clinicians to assess circulatory state during cardiac arrest treatment.
    The Solvability of Interpretability Evaluation Metrics. (arXiv:2205.08696v2 [cs.LG] UPDATED)
    Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric, which can be solved by beam search. This observation leads to the obvious yet unaddressed question: why do we use explainers (e.g., LIME) not based on solving the target metric, if the metric value represents explanation quality? We present a series of investigations showing strong performance of this beam search explainer and discuss its broader implication: a definition-evaluation duality of interpretability concepts. We implement the explainer and release the Python solvex package for models of text, image and tabular domains.  ( 2 min )
    LEAF: Navigating Concept Drift in Cellular Networks. (arXiv:2109.03011v5 [cs.NI] UPDATED)
    Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target to be predicted changes. Mitigating concept drift is an essential part of operationalizing machine learning models in general, but is of particular importance in networking's highly dynamic deployment environments. In this paper, we first characterize concept drift in a large cellular network for a major metropolitan area in the United States. We find that concept drift occurs across many important key performance indicators (KPIs), independently of the model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. We then show that frequent model retraining with newly available data is not sufficient to mitigate concept drift, and can even degrade model accuracy further. Finally, we develop a new methodology for concept drift mitigation, Local Error Approximation of Features (LEAF). LEAF works by detecting drift; explaining the features and time intervals that contribute the most to drift; and mitigates it using forgetting and over-sampling. We evaluate LEAF against industry-standard mitigation approaches (notably, periodic retraining) with more than four years of cellular KPI data. Our initial tests with a major cellular provider in the US show that LEAF consistently outperforms periodic and triggered retraining on complex, real-world data while reducing costly retraining operations.  ( 2 min )
    Online Verification of Deep Neural Networks under Domain Shift or Network Updates. (arXiv:2106.12732v2 [cs.LG] UPDATED)
    Although neural networks are widely used, it remains challenging to formally verify the safety and robustness of neural networks in real-world applications. Existing methods are designed to verify the network before deployment, which are limited to relatively simple specifications and fixed networks. These methods are not ready to be applied to real-world problems with complex and/or dynamically changing specifications and networks. To effectively handle such problems, verification needs to be performed online when these changes take place. However, it is still challenging to run existing verification algorithms online. Our key insight is that we can leverage the temporal dependencies of these changes to accelerate the verification process. This paper establishes a novel framework for scalable online verification to solve real-world verification problems with dynamically changing specifications and/or networks. We propose three types of acceleration algorithms: Branch Management to reduce repetitive computation, Perturbation Tolerance to tolerate changes, and Incremental Computation to reuse previous results. Experiment results show that our algorithms achieve up to $100\times$ acceleration, and thus show a promising way to extend neural network verification to real-world applications.  ( 2 min )
    Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality. (arXiv:2205.13521v2 [cs.AI] UPDATED)
    Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.  ( 2 min )
    Towards Optimal Branching of Linear and Semidefinite Relaxations for Neural Network Robustness Certification. (arXiv:2101.09306v2 [cs.LG] UPDATED)
    In this paper, we study certifying the robustness of ReLU neural networks against adversarial input perturbations. To diminish the relaxation error suffered by the popular linear programming (LP) and semidefinite programming (SDP) certification methods, we take a branch-and-bound approach to propose partitioning the input uncertainty set and solving the relaxations on each part separately. We show that this approach reduces relaxation error, and that the error is eliminated entirely upon performing an LP relaxation with a partition intelligently designed to exploit the nature of the ReLU activations. To scale this approach to large networks, we consider using a coarser partition whereby the number of parts in the partition is reduced. We prove that computing such a coarse partition that directly minimizes the LP relaxation error is NP-hard. By instead minimizing the worst-case LP relaxation error, we develop a closed-form branching scheme. We extend the analysis to the SDP, where the feasible set geometry is exploited to design a branching scheme that minimizes the worst-case SDP relaxation error. Experiments on MNIST, CIFAR-10, and Wisconsin breast cancer diagnosis classifiers demonstrate significant increases in the percentages of test samples certified. By independently increasing the input size and the number of layers, we empirically illustrate under which regimes the branched LP and branched SDP are best applied.  ( 2 min )
    Aligning Robot and Human Representations. (arXiv:2302.01928v1 [cs.RO])
    To act in the world, robots rely on a representation of salient task aspects: for example, to carry a cup of coffee, a robot must consider movement efficiency and cup orientation in its behaviour. However, if we want robots to act for and with people, their representations must not be just functional but also reflective of what humans care about, i.e. their representations must be aligned with humans'. In this survey, we pose that current reward and imitation learning approaches suffer from representation misalignment, where the robot's learned representation does not capture the human's representation. We suggest that because humans will be the ultimate evaluator of robot performance in the world, it is critical that we explicitly focus our efforts on aligning learned task representations with humans, in addition to learning the downstream task. We advocate that current representation learning approaches in robotics should be studied from the perspective of how well they accomplish the objective of representation alignment. To do so, we mathematically define the problem, identify its key desiderata, and situate current robot learning methods within this formalism. We conclude the survey by suggesting future directions for exploring open challenges.  ( 2 min )
    Enhancing Once-For-All: A Study on Parallel Blocks, Skip Connections and Early Exits. (arXiv:2302.01888v1 [cs.LG])
    The use of Neural Architecture Search (NAS) techniques to automate the design of neural networks has become increasingly popular in recent years. The proliferation of devices with different hardware characteristics using such neural networks, as well as the need to reduce the power consumption for their search, has led to the realisation of Once-For-All (OFA), an eco-friendly algorithm characterised by the ability to generate easily adaptable models through a single learning process. In order to improve this paradigm and develop high-performance yet eco-friendly NAS techniques, this paper presents OFAv2, the extension of OFA aimed at improving its performance while maintaining the same ecological advantage. The algorithm is improved from an architectural point of view by including early exits, parallel blocks and dense skip connections. The training process is extended by two new phases called Elastic Level and Elastic Height. A new Knowledge Distillation technique is presented to handle multi-output networks, and finally a new strategy for dynamic teacher network selection is proposed. These modifications allow OFAv2 to improve its accuracy performance on the Tiny ImageNet dataset by up to 12.07% compared to the original version of OFA, while maintaining the algorithm flexibility and advantages.  ( 2 min )
    Unsupervised hierarchical clustering using the learning dynamics of RBMs. (arXiv:2302.01851v1 [cs.LG])
    Datasets in the real world are often complex and to some degree hierarchical, with groups and sub-groups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these datasets is an important task that has many practical applications. To address this challenge, we present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM). Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v2 [stat.ML] UPDATED)
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.  ( 2 min )
    A Case Study for Compliance as Code with Graphs and Language Models: Public release of the Regulatory Knowledge Graph. (arXiv:2302.01842v1 [cs.AI])
    The paper presents a study on using language models to automate the construction of executable Knowledge Graph (KG) for compliance. The paper focuses on Abu Dhabi Global Market regulations and taxonomy, involves manual tagging a portion of the regulations, training BERT-based models, which are then applied to the rest of the corpus. Coreference resolution and syntax analysis were used to parse the relationships between the tagged entities and to form KG stored in a Neo4j database. The paper states that the use of machine learning models released by regulators to automate the interpretation of rules is a vital step towards compliance automation, demonstrates the concept querying with Cypher, and states that the produced sub-graphs combined with Graph Neural Networks (GNN) will achieve expandability in judgment automation systems. The graph is open sourced on GitHub to provide structured data for future advancements in the field.  ( 2 min )
    Online Ad Allocation with Predictions. (arXiv:2302.01827v1 [cs.LG])
    Display Ads and the generalized assignment problem are two well-studied online packing problems with important applications in ad allocation and other areas. In both problems, ad impressions arrive online and have to be allocated immediately to budget-constrained advertisers. Worst-case algorithms that achieve the ideal competitive ratio are known, but might act overly conservative given the predictable and usually tame nature of real-world input. Given this discrepancy, we develop an algorithm for both problems that incorporate machine-learned predictions and can thus improve the performance beyond the worst-case. Our algorithm is based on the work of Feldman et al. (2009) and similar in nature to Mahdian et al. (2007) who were the first to develop a learning-augmented algorithm for the related, but more structured Ad Words problem. We use a novel analysis to show that our algorithm is able to capitalize on a good prediction, while being robust against poor predictions. We experimentally evaluate our algorithm on synthetic and real-world data on a wide range of predictions. Our algorithm is consistently outperforming the worst-case algorithm without predictions.  ( 2 min )
    From Robustness to Privacy and Back. (arXiv:2302.01855v1 [cs.LG])
    We study the relationship between two desiderata of algorithms in statistical inference and machine learning: differential privacy and robustness to adversarial data corruptions. Their conceptual similarity was first observed by Dwork and Lei (STOC 2009), who observed that private algorithms satisfy robustness, and gave a general method for converting robust algorithms to private ones. However, all general methods for transforming robust algorithms into private ones lead to suboptimal error rates. Our work gives the first black-box transformation that converts any adversarially robust algorithm into one that satisfies pure differential privacy. Moreover, we show that for any low-dimensional estimation task, applying our transformation to an optimal robust estimator results in an optimal private estimator. Thus, we conclude that for any low-dimensional task, the optimal error rate for $\varepsilon$-differentially private estimators is essentially the same as the optimal error rate for estimators that are robust to adversarially corrupting $1/\varepsilon$ training samples. We apply our transformation to obtain new optimal private estimators for several high-dimensional tasks, including Gaussian (sparse) linear regression and PCA. Finally, we present an extension of our transformation that leads to approximate differentially private algorithms whose error does not depend on the range of the output space, which is impossible under pure differential privacy.  ( 2 min )
    AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners. (arXiv:2302.01877v1 [cs.LG])
    Diffusion models have demonstrated their powerful generative capability in many tasks, with great potential to serve as a paradigm for offline reinforcement learning. However, the quality of the diffusion model is limited by the insufficient diversity of training data, which hinders the performance of planning and the generalizability to new tasks. This paper introduces AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, not only for seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the generation of rich synthetic expert data for goal-conditioned tasks using guidance from reward gradients. It then selects high-quality data via a discriminator to finetune the diffusion model, which improves the generalization ability to unseen tasks. Empirical experiments on two benchmark environments and two carefully designed unseen tasks in KUKA industrial robot arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8% on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks, e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data.  ( 2 min )
    PINN Training using Biobjective Optimization: The Trade-off between Data Loss and Residual Loss. (arXiv:2302.01810v1 [cs.LG])
    Physics informed neural networks (PINNs) have proven to be an efficient tool to represent problems for which measured data are available and for which the dynamics in the data are expected to follow some physical laws. In this paper, we suggest a multiobjective perspective on the training of PINNs by treating the data loss and the residual loss as two individual objective functions in a truly biobjective optimization approach. As a showcase example, we consider COVID-19 predictions in Germany and built an extended susceptibles-infected-recovered (SIR) model with additionally considered leaky-vaccinated and hospitalized populations (SVIHR model) to model the transition rates and to predict future infections. SIR-type models are expressed by systems of ordinary differential equations (ODEs). We investigate the suitability of the generated PINN for COVID-19 predictions and compare the resulting predicted curves with those obtained by applying the method of non-standard finite differences to the system of ODEs and initial data. The approach is applicable to various systems of ODEs that define dynamical regimes. Those regimes do not need to be SIR-type models, and the corresponding underlying data sets do not have to be associated with COVID-19.  ( 2 min )
    Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach. (arXiv:2302.01887v1 [cs.LG])
    Natural language processing (NLP) is a promising approach for analyzing large volumes of climate-change and infrastructure-related scientific literature. However, best-in-practice NLP techniques require large collections of relevant documents (corpus). Furthermore, NLP techniques using machine learning and deep learning techniques require labels grouping the articles based on user-defined criteria for a significant subset of a corpus in order to train the supervised model. Even labeling a few hundred documents with human subject-matter experts is a time-consuming process. To expedite this process, we developed a weak supervision-based NLP approach that leverages semantic similarity between categories and documents to (i) establish a topic-specific corpus by subsetting a large-scale open-access corpus and (ii) generate category labels for the topic-specific corpus. In comparison with a months-long process of subject-matter expert labeling, we assign category labels to the whole corpus using weak supervision and supervised learning in about 13 hours. The labeled climate and NCF corpus enable targeted, efficient identification of documents discussing a topic (or combination of topics) of interest and identification of various effects of climate change on critical infrastructure, improving the usability of scientific literature and ultimately supporting enhanced policy and decision making. To demonstrate this capability, we conduct topic modeling on pairs of climate hazards and NCFs to discover trending topics at the intersection of these categories. This method is useful for analysts and decision-makers to quickly grasp the relevant topics and most important documents linked to the topic.  ( 2 min )
    Target specific peptide design using latent space approximate trajectory collector. (arXiv:2302.01435v1 [cs.CE])
    Despite the prevalence and many successes of deep learning applications in de novo molecular design, the problem of peptide generation targeting specific proteins remains unsolved. A main barrier for this is the scarcity of the high-quality training data. To tackle the issue, we propose a novel machine learning based peptide design architecture, called Latent Space Approximate Trajectory Collector (LSATC). It consists of a series of samplers on an optimization trajectory on a highly non-convex energy landscape that approximates the distributions of peptides with desired properties in a latent space. The process involves little human intervention and can be implemented in an end-to-end manner. We demonstrate the model by the design of peptide extensions targeting Beta-catenin, a key nuclear effector protein involved in canonical Wnt signalling. When compared with a random sampler, LSATC can sample peptides with $36\%$ lower binding scores in a $16$ times smaller interquartile range (IQR) and $284\%$ less hydrophobicity with a $1.4$ times smaller IQR. LSATC also largely outperforms other common generative models. Finally, we utilized a clustering algorithm to select 4 peptides from the 100 LSATC designed peptides for experimental validation. The result confirms that all the four peptides extended by LSATC show improved Beta-catenin binding by at least $20.0\%$, and two of the peptides show a $3$ fold increase in binding affinity as compared to the base peptide.
    Machine Learning Extreme Acoustic Non-reciprocity in a Linear Waveguide with Multiple Nonlinear Asymmetric Gates. (arXiv:2302.01746v1 [eess.AS])
    This work is a study of acoustic non-reciprocity exhibited by a passive one-dimensional linear waveguide incorporating two local strongly nonlinear, asymmetric gates. Two local nonlinear gates break the symmetry and linearity of the waveguide, yielding strong global non-reciprocal acoustics, in the way that extremely different acoustical responses occur depending on the side of application of harmonic excitation. To the authors' best knowledge that the present two-gated waveguide is capable of extremely high acoustic non-reciprocity, at a much higher level to what is reported by active or passive devices in the current literature; moreover, this extreme performance combines with acceptable levels of transmissibility in the desired direction of wave propagation. Machine learning is utilized for predictive design of this gated waveguide in terms of the measures of transmissibility and non-reciprocity, with the aim of reducing the required computational time for high-dimensional parameter space analysis. The study sheds new light into the physics of these media and considers the advantages and limitations of using neural networks to analyze this type of physical problems. In the predicted desirable parameter space for intense non-reciprocity, the maximum transmissibility reaches as much as 40%, and the transmitted energy from upstream to downstream varies up to nine orders of magnitude, depending on the direction of wave transmission. The machine learning tools along with the numerical methods of this work can inform predictive designs of practical non-reciprocal waveguides and acoustic metamaterials that incorporate local nonlinear gates. The current paper shows that combinations of nonlinear gates can lead to extremely high non-reciprocity while maintaining desired levels of transmissibility.  ( 2 min )
    AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge. (arXiv:2302.01738v1 [eess.IV])
    The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.  ( 2 min )
    Transformers in Action Recognition: A Review on Temporal Modeling. (arXiv:2302.01921v1 [cs.CV])
    In vision-based action recognition, spatio-temporal features from different modalities are used for recognizing activities. Temporal modeling is a long challenge of action recognition. However, there are limited methods such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNN) for modeling motion information in deep-based approaches. Recently, transformers success in modeling long-range dependencies in natural language processing (NLP) tasks has gotten great attention from other domains; including speech, image, and video, to rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic within the last few years is astounding. This paper especially reviews recent progress in deep learning methods for modeling temporal variations. It focuses on action recognition methods that use transformers for temporal modeling, discussing their main features, used modalities, and identifying opportunities and challenges for future research.
    Leveraging weak complementary labels to improve semantic segmentation of hepatocellular carcinoma and cholangiocarcinoma in H&E-stained slides. (arXiv:2302.01813v1 [cs.CV])
    In this paper, we present a deep learning segmentation approach to classify and quantify the two most prevalent primary liver cancers - hepatocellular carcinoma and intrahepatic cholangiocarcinoma - from hematoxylin and eosin (H&E) stained whole slide images. While semantic segmentation of medical images typically requires costly pixel-level annotations by domain experts, there often exists additional information which is routinely obtained in clinical diagnostics but rarely utilized for model training. We propose to leverage such weak information from patient diagnoses by deriving complementary labels that indicate to which class a sample cannot belong to. To integrate these labels, we formulate a complementary loss for segmentation. Motivated by the medical application, we demonstrate for general segmentation tasks that including additional patches with solely weak complementary labels during model training can significantly improve the predictive performance and robustness of a model. On the task of diagnostic differentiation between hepatocellular carcinoma and intrahepatic cholangiocarcinoma, we achieve a balanced accuracy of 0.91 (CI 95%: 0.86 - 0.95) at case level for 165 hold-out patients. Furthermore, we also show that leveraging complementary labels improves the robustness of segmentation and increases performance at case level.  ( 2 min )
    Coinductive guide to inductive transformer heads. (arXiv:2302.01834v1 [cs.LG])
    We argue that all building blocks of transformer models can be expressed with a single concept: combinatorial Hopf algebra. Transformer learning emerges as a result of the subtle interplay between the algebraic and coalgebraic operations of the combinatorial Hopf algebra. Viewed through this lens, the transformer model becomes a linear time-invariant system where the attention mechanism computes a generalized convolution transform and the residual stream serves as a unit impulse. Attention-only transformers then learn by enforcing an invariant between these two paths. We call this invariant Hopf coherence. Due to this, with a degree of poetic license, one could call combinatorial Hopf algebras "tensors with a built-in loss function gradient". This loss function gradient occurs within the single layers and no backward pass is needed. This is in contrast to automatic differentiation which happens across the whole graph and needs a explicit backward pass. This property is the result of the fact that combinatorial Hopf algebras have the surprising property of calculating eigenvalues by repeated squaring.  ( 2 min )
    Learning finite difference methods for reaction-diffusion type equations with FCNN. (arXiv:2201.01854v2 [cs.LG] UPDATED)
    In recent years, Physics-informed neural networks (PINNs) have been widely used to solve partial differential equations alongside numerical methods because PINNs can be trained without observations and deal with continuous-time problems directly. In contrast, optimizing the parameters of such models is difficult, and individual training sessions must be performed to predict the evolutions of each different initial condition. To alleviate the first problem, observed data can be injected directly into the loss function part. To solve the second problem, a network architecture can be built as a framework to learn a finite difference method. In view of the two motivations, we propose Five-point stencil CNNs (FCNNs) containing a five-point stencil kernel and a trainable approximation function for reaction-diffusion type equations including the heat, Fisher's, Allen-Cahn, and other reaction-diffusion equations with trigonometric function terms. We show that FCNNs can learn finite difference schemes using few data and achieve the low relative errors of diverse reaction-diffusion evolutions with unseen initial conditions. Furthermore, we demonstrate that FCNNs can still be trained well even with using noisy data.
    From slides (through tiles) to pixels: an explainability framework for weakly supervised models in pre-clinical pathology. (arXiv:2302.01653v1 [cs.CV])
    In pre-clinical pathology, there is a paradox between the abundance of raw data (whole slide images from many organs of many individual animals) and the lack of pixel-level slide annotations done by pathologists. Due to time constraints and requirements from regulatory authorities, diagnoses are instead stored as slide labels. Weakly supervised training is designed to take advantage of those data, and the trained models can be used by pathologists to rank slides by their probability of containing a given lesion of interest. In this work, we propose a novel contextualized eXplainable AI (XAI) framework and its application to deep learning models trained on Whole Slide Images (WSIs) in Digital Pathology. Specifically, we apply our methods to a multi-instance-learning (MIL) model, which is trained solely on slide-level labels, without the need for pixel-level annotations. We validate quantitatively our methods by quantifying the agreements of our explanations' heatmaps with pathologists' annotations, as well as with predictions from a segmentation model trained on such annotations. We demonstrate the stability of the explanations with respect to input shifts, and the fidelity with respect to increased model performance. We quantitatively evaluate the correlation between available pixel-wise annotations and explainability heatmaps. We show that the explanations on important tiles of the whole slide correlate with tissue changes between healthy regions and lesions, but do not exactly behave like a human annotator. This result is coherent with the model training strategy.  ( 2 min )
    Fixing by Mixing: A Recipe for Optimal Byzantine ML under Heterogeneity. (arXiv:2302.01772v1 [cs.LG])
    Byzantine machine learning (ML) aims to ensure the resilience of distributed learning algorithms to misbehaving (or Byzantine) machines. Although this problem received significant attention, prior works often assume the data held by the machines to be homogeneous, which is seldom true in practical settings. Data heterogeneity makes Byzantine ML considerably more challenging, since a Byzantine machine can hardly be distinguished from a non-Byzantine outlier. A few solutions have been proposed to tackle this issue, but these provide suboptimal probabilistic guarantees and fare poorly in practice. This paper closes the theoretical gap, achieving optimality and inducing good empirical results. In fact, we show how to automatically adapt existing solutions for (homogeneous) Byzantine ML to the heterogeneous setting through a powerful mechanism, we call nearest neighbor mixing (NNM), which boosts any standard robust distributed gradient descent variant to yield optimal Byzantine resilience under heterogeneity. We obtain similar guarantees (in expectation) by plugging NNM in the distributed stochastic heavy ball method, a practical substitute to distributed gradient descent. We obtain empirical results that significantly outperform state-of-the-art Byzantine ML solutions.  ( 2 min )
    Avalanche: A PyTorch Library for Deep Continual Learning. (arXiv:2302.01766v1 [cs.LG])
    Continual learning is the problem of learning from a nonstationary stream of data, a fundamental issue for sustainable and efficient training of deep neural networks over time. Unfortunately, deep learning libraries only provide primitives for offline training, assuming that model's architecture and data are fixed. Avalanche is an open source library maintained by the ContinualAI non-profit organization that extends PyTorch by providing first-class support for dynamic architectures, streams of datasets, and incremental training and evaluation methods. Avalanche provides a large set of predefined benchmarks and training algorithms and it is easy to extend and modular while supporting a wide range of continual learning scenarios. Documentation is available at \url{https://avalanche.continualai.org}.  ( 2 min )
    Certified Robustness of Learning-based Static Malware Detectors. (arXiv:2302.01757v1 [cs.CR])
    Certified defenses are a recent development in adversarial machine learning (ML), which aim to rigorously guarantee the robustness of ML models to adversarial perturbations. A large body of work studies certified defenses in computer vision, where $\ell_p$ norm-bounded evasion attacks are adopted as a tractable threat model. However, this threat model has known limitations in vision, and is not applicable to other domains -- e.g., where inputs may be discrete or subject to complex constraints. Motivated by this gap, we study certified defenses for malware detection, a domain where attacks against ML-based systems are a real and current threat. We consider static malware detection systems that operate on byte-level data. Our certified defense is based on the approach of randomized smoothing which we adapt by: (1) replacing the standard Gaussian randomization scheme with a novel deletion randomization scheme that operates on bytes or chunks of an executable; and (2) deriving a certificate that measures robustness to evasion attacks in terms of generalized edit distance. To assess the size of robustness certificates that are achievable while maintaining high accuracy, we conduct experiments on malware datasets using a popular convolutional malware detection model, MalConv. We are able to accurately classify 91% of the inputs while being certifiably robust to any adversarial perturbations of edit distance 128 bytes or less. By comparison, an existing certification of up to 128 bytes of substitutions (without insertions or deletions) achieves an accuracy of 78%. In addition, given that robustness certificates are conservative, we evaluate practical robustness to several recently published evasion attacks and, in some cases, find robustness beyond certified guarantees.  ( 2 min )
    Mind the Gap: Offline Policy Optimization for Imperfect Rewards. (arXiv:2302.01667v1 [cs.LG])
    Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, \textit{RGM (Reward Gap Minimization)}, which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.  ( 2 min )
    BackdoorBox: A Python Toolbox for Backdoor Learning. (arXiv:2302.01762v1 [cs.CR])
    Third-party resources ($e.g.$, samples, backbones, and pre-trained models) are usually involved in the training of deep neural networks (DNNs), which brings backdoor attacks as a new training-phase threat. In general, backdoor attackers intend to implant hidden backdoor in DNNs, so that the attacked DNNs behave normally on benign samples whereas their predictions will be maliciously changed to a pre-defined target label if hidden backdoors are activated by attacker-specified trigger patterns. To facilitate the research and development of more secure training schemes and defenses, we design an open-sourced Python toolbox that implements representative and advanced backdoor attacks and defenses under a unified and flexible framework. Our toolbox has four important and promising characteristics, including consistency, simplicity, flexibility, and co-development. It allows researchers and developers to easily implement and compare different methods on benchmark or their local datasets. This Python toolbox, namely \texttt{BackdoorBox}, is available at \url{https://github.com/THUYimingLi/BackdoorBox}.  ( 2 min )
    Creating Probabilistic Forecasts from Arbitrary Deterministic Forecasts using Conditional Invertible Neural Networks. (arXiv:2302.01800v1 [cs.LG])
    In various applications, probabilistic forecasts are required to quantify the inherent uncertainty associated with the forecast. However, numerous modern forecasting methods are still designed to create deterministic forecasts. Transforming these deterministic forecasts into probabilistic forecasts is often challenging and based on numerous assumptions that may not hold in real-world situations. Therefore, the present article proposes a novel approach for creating probabilistic forecasts from arbitrary deterministic forecasts. In order to implement this approach, we use a conditional Invertible Neural Network (cINN). More specifically, we apply a cINN to learn the underlying distribution of the data and then combine the uncertainty from this distribution with an arbitrary deterministic forecast to generate accurate probabilistic forecasts. Our approach enables the simple creation of probabilistic forecasts without complicated statistical loss functions or further assumptions. Besides showing the mathematical validity of our approach, we empirically show that our approach noticeably outperforms traditional methods for including uncertainty in deterministic forecasts and generally outperforms state-of-the-art probabilistic forecasting benchmarks.  ( 2 min )
    Leveraging a Probabilistic PCA Model to Understand the Multivariate Statistical Network Monitoring Framework for Network Security Anomaly Detection. (arXiv:2302.01759v1 [stat.ML])
    Network anomaly detection is a very relevant research area nowadays, especially due to its multiple applications in the field of network security. The boost of new models based on variational autoencoders and generative adversarial networks has motivated a reevaluation of traditional techniques for anomaly detection. It is, however, essential to be able to understand these new models from the perspective of the experience attained from years of evaluating network security data for anomaly detection. In this paper, we revisit anomaly detection techniques based on PCA from a probabilistic generative model point of view, and contribute a mathematical model that relates them. Specifically, we start with the probabilistic PCA model and explain its connection to the Multivariate Statistical Network Monitoring (MSNM) framework. MSNM was recently successfully proposed as a means of incorporating industrial process anomaly detection experience into the field of networking. We have evaluated the mathematical model using two different datasets. The first, a synthetic dataset created to better understand the analysis proposed, and the second, UGR'16, is a specifically designed real-traffic dataset for network security anomaly detection. We have drawn conclusions that we consider to be useful when applying generative models to network security detection.
    DEUP: Direct Epistemic Uncertainty Prediction. (arXiv:2102.08501v4 [cs.LG] UPDATED)
    Epistemic Uncertainty is a measure of the lack of knowledge of a learner which diminishes with more evidence. While existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, we argue that this does not capture the part of lack of knowledge induced by model misspecification. We discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor, is a sound measure of epistemic uncertainty which captures the effect of model misspecification. We thus propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. We discuss the merits of this novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty and addresses its major pitfall. Our framework, Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round. Through a wide set of experiments, we illustrate how existing methods in sequential model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations.
    Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers. (arXiv:2302.01925v1 [cs.LG])
    We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for nongeometric data, as well as novel RPEs operating on the sequences of tokens embedded in higher-dimensional Euclidean spaces (e.g. point clouds). FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE-mask. FLTs allow also for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and providing accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly tested FLTs on other data modalities and tasks, such as: image classification and 3D molecular modeling. For 3D-data FLTs are, to the best of our knowledge, the first Transformers architectures providing RPE-enhanced linear attention.
    Show me your NFT and I tell you how it will perform: Multimodal representation learning for NFT selling price prediction. (arXiv:2302.01676v1 [cs.LG])
    Non-Fungible Tokens (NFTs) represent deeds of ownership, based on blockchain technologies and smart contracts, of unique crypto assets on digital art forms (e.g., artworks or collectibles). In the spotlight after skyrocketing in 2021, NFTs have attracted the attention of crypto enthusiasts and investors intent on placing promising investments in this profitable market. However, the NFT financial performance prediction has not been widely explored to date. In this work, we address the above problem based on the hypothesis that NFT images and their textual descriptions are essential proxies to predict the NFT selling prices. To this purpose, we propose MERLIN, a novel multimodal deep learning framework designed to train Transformer-based language and visual models, along with graph neural network models, on collections of NFTs' images and texts. A key aspect in MERLIN is its independence on financial features, as it exploits only the primary data a user interested in NFT trading would like to deal with, i.e., NFT images and textual descriptions. By learning dense representations of such data, a price-category classification task is performed by MERLIN models, which can also be tuned according to user preferences in the inference phase to mimic different risk-return investment profiles. Experimental evaluation on a publicly available dataset has shown that MERLIN models achieve significant performances according to several financial assessment criteria, fostering profitable investments, and also beating baseline machine-learning classifiers based on financial features.  ( 2 min )
    Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. (arXiv:2302.01771v1 [stat.ML])
    Deep learning (DL) has emerged as a promising tool to downscale climate projections at regional-to-local scales from large-scale atmospheric fields following the perfect-prognosis (PP) approach. Given their complexity, it is crucial to properly evaluate these methods, especially when applied to changing climatic conditions where the ability to extrapolate/generalise is key. In this work, we intercompare several DL models extracted from the literature for the same challenging use-case (downscaling temperature in the CORDEX North America domain) and expand standard evaluation methods building on eXplainable artifical intelligence (XAI) techniques. We show how these techniques can be used to unravel the internal behaviour of these models, providing new evaluation dimensions and aiding in their diagnostic and design. These results show the usefulness of incorporating XAI techniques into statistical downscaling evaluation frameworks, especially when working with large regions and/or under climate change conditions.
    Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies. (arXiv:2302.01734v1 [cs.LG])
    Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
    Reinforcing User Retention in a Billion Scale Short Video Recommender System. (arXiv:2302.01724v1 [cs.LG])
    Recently, short video platforms have achieved rapid user growth by recommending interesting content to users. The objective of the recommendation is to optimize user retention, thereby driving the growth of DAU (Daily Active Users). Retention is a long-term feedback after multiple interactions of users and the system, and it is hard to decompose retention reward to each item or a list of items. Thus traditional point-wise and list-wise models are not able to optimize retention. In this paper, we choose reinforcement learning methods to optimize the retention as they are designed to maximize the long-term performance. We formulate the problem as an infinite-horizon request-based Markov Decision Process, and our objective is to minimize the accumulated time interval of multiple sessions, which is equal to improving the app open frequency and user retention. However, current reinforcement learning algorithms can not be directly applied in this setting due to uncertainty, bias, and long delay time incurred by the properties of user retention. We propose a novel method, dubbed RLUR, to address the aforementioned challenges. Both offline and live experiments show that RLUR can significantly improve user retention. RLUR has been fully launched in Kuaishou app for a long time, and achieves consistent performance improvement on user retention and DAU.  ( 2 min )
    Distributional constrained reinforcement learning for supply chain optimization. (arXiv:2302.01727v1 [cs.LG])
    This work studies reinforcement learning (RL) in the context of multi-period supply chains subject to constraints, e.g., on production and inventory. We introduce Distributional Constrained Policy Optimization (DCPO), a novel approach for reliable constraint satisfaction in RL. Our approach is based on Constrained Policy Optimization (CPO), which is subject to approximation errors that in practice lead it to converge to infeasible policies. We address this issue by incorporating aspects of distributional RL into DCPO. Specifically, we represent the return and cost value functions using neural networks that output discrete distributions, and we reshape costs based on the associated confidence. Using a supply chain case study, we show that DCPO improves the rate at which the RL policy converges and ensures reliable constraint satisfaction by the end of training. The proposed method also improves predictability, greatly reducing the variance of returns between runs, respectively; this result is significant in the context of policy gradient methods, which intrinsically introduce significant variance during training.  ( 2 min )
    Interpretations of Domain Adaptations via Layer Variational Analysis. (arXiv:2302.01798v1 [cs.LG])
    Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with corresponding data conditions. Moreover, our theoretical calculation yields intuitive interpretations towards the knowledge transfer process. Subsequently, an alternative method for network-based transfer learning is derived. The method shows an increase in efficiency and accuracy for domain adaptation. It is particularly advantageous when new domain data is sufficiently sparse during adaptation. Numerical experiments over diverse tasks validated our theory and verified that our analytic expression achieved better performance in domain adaptation than the gradient descent method.  ( 2 min )
    Command Line Interface Risk Modeling. (arXiv:2302.01749v1 [cs.CR])
    Protecting sensitive data is an essential part of security in cloud computing. However, only specific privileged individuals have access to view or interact with this data; therefore, it is unscalable to depend on these individuals also to maintain the software. A solution to this is to allow non-privileged individuals access to maintain these systems but mask sensitive information from egressing. To this end, we have created a machine-learning model to predict and redact fields with sensitive data. This work concentrates on Azure PowerShell, showing how it applies to other command-line interfaces and APIs. Using the F5-score as a weighted metric, we demonstrate different transformation techniques to map this problem from an unknown field to the well-researched area of natural language processing.  ( 2 min )
    Motion ID: Human Authentication Approach. (arXiv:2302.01751v1 [cs.CR])
    We introduce a novel approach to user authentication called Motion ID. The method employs motion sensing provided by inertial measurement units (IMUs), using it to verify the persons identity via short time series of IMU data captured by the mobile device. The paper presents two labeled datasets with unlock events: the first features IMU measurements, provided by six users who continuously collected data on six different smartphones for a period of 12 weeks. The second one contains 50 hours of IMU data for one specific motion pattern, provided by 101 users. Moreover, we present a two-stage user authentication process that employs motion pattern identification and user verification and is based on data preprocessing and machine learning. The Results section details the assessment of the method proposed, comparing it with existing biometric authentication methods and the Android biometric standard. The method has demonstrated high accuracy, indicating that it could be successfully used in combination with existing methods. Furthermore, the method exhibits significant promise as a standalone solution. We provide the datasets to the scholarly community and share our project code.  ( 2 min )
    Leveraging Contaminated Datasets to Learn Clean-Data Distribution with Purified Generative Adversarial Networks. (arXiv:2302.01722v1 [cs.LG])
    Generative adversarial networks (GANs) are known for their strong abilities on capturing the underlying distribution of training instances. Since the seminal work of GAN, many variants of GAN have been proposed. However, existing GANs are almost established on the assumption that the training dataset is clean. But in many real-world applications, this may not hold, that is, the training dataset may be contaminated by a proportion of undesired instances. When training on such datasets, existing GANs will learn a mixture distribution of desired and contaminated instances, rather than the desired distribution of desired data only (target distribution). To learn the target distribution from contaminated datasets, two purified generative adversarial networks (PuriGAN) are developed, in which the discriminators are augmented with the capability to distinguish between target and contaminated instances by leveraging an extra dataset solely composed of contamination instances. We prove that under some mild conditions, the proposed PuriGANs are guaranteed to converge to the distribution of desired instances. Experimental results on several datasets demonstrate that the proposed PuriGANs are able to generate much better images from the desired distribution than comparable baselines when trained on contaminated datasets. In addition, we also demonstrate the usefulness of PuriGAN on downstream applications by applying it to the tasks of semi-supervised anomaly detection on contaminated datasets and PU-learning. Experimental results show that PuriGAN is able to deliver the best performance over comparable baselines on both tasks.  ( 2 min )
    Better Training of GFlowNets with Local Credit and Incomplete Trajectories. (arXiv:2302.01687v1 [cs.LG])
    Generative Flow Networks or GFlowNets are related to Monte-Carlo Markov chain methods (as they sample from a distribution specified by an energy function), reinforcement learning (as they learn a policy to sample composed objects through a sequence of steps), generative models (as they learn to represent and sample from a distribution) and amortized variational methods (as they can be used to learn to approximate and sample from an otherwise intractable posterior, given a prior and a likelihood). They are trained to generate an object $x$ through a sequence of steps with probability proportional to some reward function $R(x)$ (or $\exp(-\mathcal{E}(x))$ with $\mathcal{E}(x)$ denoting the energy function), given at the end of the generative trajectory. Like for other RL settings where the reward is only given at the end, the efficiency of training and credit assignment may suffer when those trajectories are longer. With previous GFlowNet work, no learning was possible from incomplete trajectories (lacking a terminal state and the computation of the associated reward). In this paper, we consider the case where the energy function can be applied not just to terminal states but also to intermediate states. This is for example achieved when the energy function is additive, with terms available along the trajectory. We show how to reparameterize the GFlowNet state flow function to take advantage of the partial reward already accrued at each state. This enables a training objective that can be applied to update parameters even with incomplete trajectories. Even when complete trajectories are available, being able to obtain more localized credit and gradients is found to speed up training convergence, as demonstrated across many simulations.  ( 2 min )
    Learning End-to-End Channel Coding with Diffusion Models. (arXiv:2302.01714v1 [cs.IT])
    It is a known problem that deep-learning-based end-to-end (E2E) channel coding systems depend on a known and differentiable channel model, due to the learning process and based on the gradient-descent optimization methods. This places the challenge to approximate or generate the channel or its derivative from samples generated by pilot signaling in real-world scenarios. Currently, there are two prevalent methods to solve this problem. One is to generate the channel via a generative adversarial network (GAN), and the other is to, in essence, approximate the gradient via reinforcement learning methods. Other methods include using score-based methods, variational autoencoders, or mutual-information-based methods. In this paper, we focus on generative models and, in particular, on a new promising method called diffusion models, which have shown a higher quality of generation in image-based tasks. We will show that diffusion models can be used in wireless E2E scenarios and that they work as good as Wasserstein GANs while having a more stable training procedure and a better generalization ability in testing.  ( 2 min )
    Where and How to Improve Graph-based Spatio-temporal Predictors. (arXiv:2302.01701v1 [stat.ML])
    This paper introduces a novel residual correlation analysis, called AZ-analysis, to assess the optimality of spatio-temporal predictive models. The proposed AZ-analysis constitutes a valuable asset for discovering and highlighting those space-time regions where the model can be improved with respect to performance. The AZ-analysis operates under very mild assumptions and is based on a spatio-temporal graph that encodes serial and functional dependencies in the data; asymptotically distribution-free summary statistics identify existing residual correlation in space and time regions, hence localizing time frames and/or communities of sensors, where the predictor can be improved.  ( 2 min )
    Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective. (arXiv:2302.01735v1 [cs.CV])
    For medical image segmentation, contrastive learning is the dominant practice to improve the quality of visual representations by contrasting semantically similar and dissimilar pairs of samples. This is enabled by the observation that without accessing ground truth label, negative examples with truly dissimilar anatomical features, if sampled, can significantly improve the performance. In reality, however, these samples may come from similar anatomical features and the models may struggle to distinguish the minority tail-class samples, making the tail classes more prone to misclassification, both of which typically lead to model collapse. In this paper, we propose ARCO, a semi-supervised contrastive learning (CL) framework with stratified group sampling theory in medical image segmentation. In particular, we first propose building ARCO through the concept of variance-reduced estimation, and show that certain variance-reduction techniques are particularly beneficial in medical image segmentation tasks with extremely limited labels. Furthermore, we theoretically prove these sampling techniques are universal in variance reduction. Finally, we experimentally validate our approaches on three benchmark datasets with different label settings, and our methods consistently outperform state-of-the-art semi- and fully-supervised methods. Additionally, we augment the CL frameworks with these sampling techniques and demonstrate significant gains over previous methods. We believe our work is an important step towards semi-supervised medical image segmentation by quantifying the limitation of current self-supervision objectives for accomplishing medical image analysis tasks.  ( 2 min )
    A Systematic Evaluation of Backdoor Trigger Characteristics in Image Classification. (arXiv:2302.01740v1 [cs.CV])
    Deep learning achieves outstanding results in many machine learning tasks. Nevertheless, it is vulnerable to backdoor attacks that modify the training set to embed a secret functionality in the trained model. The modified training samples have a secret property, i.e., a trigger. At inference time, the secret functionality is activated when the input contains the trigger, while the model functions correctly in other cases. While there are many known backdoor attacks (and defenses), deploying a stealthy attack is still far from trivial. Successfully creating backdoor triggers heavily depends on numerous parameters. Unfortunately, research has not yet determined which parameters contribute most to the attack performance. This paper systematically analyzes the most relevant parameters for the backdoor attacks, i.e., trigger size, position, color, and poisoning rate. Using transfer learning, which is very common in computer vision, we evaluate the attack on numerous state-of-the-art models (ResNet, VGG, AlexNet, and GoogLeNet) and datasets (MNIST, CIFAR10, and TinyImageNet). Our attacks cover the majority of backdoor settings in research, providing concrete directions for future works. Our code is publicly available to facilitate the reproducibility of our results.  ( 2 min )
    Improving the Timing Resolution of Positron Emission Tomography Detectors using Boosted Learning -- A Residual Physics Approach. (arXiv:2302.01681v1 [cs.LG])
    Artificial intelligence is finding its way into medical imaging, usually focusing on image reconstruction or enhancing analytical reconstructed images. However, optimizations along the complete processing chain, from detecting signals to computing data, enable significant improvements. Thus, we present an approach toward detector optimization using boosted learning by exploiting the concept of residual physics. In our work, we improve the coincidence time resolution (CTR) of positron emission tomography (PET) detectors. PET enables imaging of metabolic processes by detecting {\gamma}-photons with scintillation detectors. Current research exploits light-sharing detectors, where the scintillation light is distributed over and digitized by an array of readout channels. While these detectors demonstrate excellent performance parameters, e.g., regarding spatial resolution, extracting precise timing information for time-of-flight (TOF) becomes more challenging due to deteriorating effects called time skews. Conventional correction methods mainly rely on analytical formulations, theoretically capable of covering all time skew effects, e.g., caused by signal runtimes or physical effects. However, additional effects are involved for light-sharing detectors, so finding suitable analytical formulations can become arbitrarily complicated. The residual physics-based strategy uses gradient tree boosting (GTB) and a physics-informed data generation mimicking an actual imaging process by shifting a radiation source. We used clinically relevant detectors with a height of 19 mm, coupled to digital photosensor arrays. All trained models improved the CTR significantly. Using the best model, we achieved CTRs down to 198 ps (185 ps) for energies ranging from 300 keV to 700 keV (450 keV to 550 keV).  ( 2 min )
    Structure-informed Language Models Are Protein Designers. (arXiv:2302.01649v1 [cs.LG])
    This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that our approach outperforms the state-of-the-art methods by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65% and 56.63% on CATH 4.2 and 4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)  ( 2 min )
    Two-Stage Constrained Actor-Critic fo Short Video Recommendation. (arXiv:2302.01680v1 [cs.LG])
    The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users' cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.  ( 2 min )
    GTV: Generating Tabular Data via Vertical Federated Learning. (arXiv:2302.01706v1 [cs.LG])
    Generative Adversarial Networks (GANs) have achieved state-of-the-art results in tabular data synthesis, under the presumption of direct accessible training data. Vertical Federated Learning (VFL) is a paradigm which allows to distributedly train machine learning model with clients possessing unique features pertaining to the same individuals, where the tabular data learning is the primary use case. However, it is unknown if tabular GANs can be learned in VFL. Demand for secure data transfer among clients and GAN during training and data synthesizing poses extra challenge. Conditional vector for tabular GANs is a valuable tool to control specific features of generated data. But it contains sensitive information from real data - risking privacy guarantees. In this paper, we propose GTV, a VFL framework for tabular GANs, whose key components are generator, discriminator and the conditional vector. GTV proposes an unique distributed training architecture for generator and discriminator to access training data in a privacy-preserving manner. To accommodate conditional vector into training without privacy leakage, GTV designs a mechanism training-with-shuffling to ensure that no party can reconstruct training data with conditional vector. We evaluate the effectiveness of GTV in terms of synthetic data quality, and overall training scalability. Results show that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by centralized GAN algorithm. The difference on machine learning utility can be as low as to 2.7%, even under extremely imbalanced data distributions across clients and different number of clients.  ( 2 min )
    Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks. (arXiv:2302.01677v1 [cs.LG])
    In this work, besides improving prediction accuracy, we study whether personalization could bring robustness benefits to backdoor attacks. We conduct the first study of backdoor attacks in the pFL framework, testing 4 widely used backdoor attacks against 6 pFL methods on benchmark datasets FEMNIST and CIFAR-10, a total of 600 experiments. The study shows that pFL methods with partial model-sharing can significantly boost robustness against backdoor attacks. In contrast, pFL methods with full model-sharing do not show robustness. To analyze the reasons for varying robustness performances, we provide comprehensive ablation studies on different pFL methods. Based on our findings, we further propose a lightweight defense method, Simple-Tuning, which empirically improves defense performance against backdoor attacks. We believe that our work could provide both guidance for pFL application in terms of its robustness and offer valuable insights to design more robust FL methods in the future.  ( 2 min )
    Private, fair and accurate: Training large-scale, privacy-preserving AI models in radiology. (arXiv:2302.01622v1 [eess.IV])
    Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure the protection of said data are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. However, prior work has shown that DP has negative implications on model accuracy and fairness. Therefore, the purpose of this study is to demonstrate that the privacy-preserving training of AI models for chest radiograph diagnosis is possible with high accuracy and fairness compared to non-private training. N=193,311 high quality clinical chest radiographs were retrospectively collected and manually labeled by experienced radiologists, who assigned one or more of the following diagnoses: cardiomegaly, congestion, pleural effusion, pneumonic infiltration and atelectasis, to each side (where applicable). The non-private AI models were compared with privacy-preserving (DP) models with respect to privacy-utility trade-offs (measured as area under the receiver-operator-characteristic curve (AUROC)), and privacy-fairness trade-offs (measured as Pearson-R or Statistical Parity Difference). The non-private AI model achieved an average AUROC score of 0.90 over all labels, whereas the DP AI model with a privacy budget of epsilon=7.89 resulted in an AUROC of 0.87, i.e., a mere 2.6% performance decrease compared to non-private training. The privacy-preserving training of diagnostic AI models can achieve high performance with a small penalty on model accuracy and does not amplify discrimination against age, sex or co-morbidity. We thus encourage practitioners to integrate state-of-the-art privacy-preserving techniques into medical AI model development.  ( 2 min )
    A Feature Selection Method for Driver Stress Detection Using Heart Rate Variability and Breathing Rate. (arXiv:2302.01602v1 [cs.LG])
    Driver stress is a major cause of car accidents and death worldwide. Furthermore, persistent stress is a health problem, contributing to hypertension and other diseases of the cardiovascular system. Stress has a measurable impact on heart and breathing rates and stress levels can be inferred from such measurements. Galvanic skin response is a common test to measure the perspiration caused by both physiological and psychological stress, as well as extreme emotions. In this paper, galvanic skin response is used to estimate the ground truth stress levels. A feature selection technique based on the minimal redundancy-maximal relevance method is then applied to multiple heart rate variability and breathing rate metrics to identify a novel and optimal combination for use in detecting stress. The support vector machine algorithm with a radial basis function kernel was used along with these features to reliably predict stress. The proposed method has achieved a high level of accuracy on the target dataset.  ( 2 min )
    Convergence Analysis of Split Learning on Non-IID Data. (arXiv:2302.01633v1 [cs.LG])
    Split Learning (SL) is one promising variant of Federated Learning (FL), where the AI model is split and trained at the clients and the server collaboratively. By offloading the computation-intensive portions to the server, SL enables efficient model training on resource-constrained clients. Despite its booming applications, SL still lacks rigorous convergence analysis on non-IID data, which is critical for hyperparameter selection. In this paper, we first prove that SL exhibits an $\mathcal{O}(1/\sqrt{R})$ convergence rate for non-convex objectives on non-IID data, where $R$ is the number of total training rounds. The derived convergence results can facilitate understanding the effect of some crucial factors in SL (e.g., data heterogeneity and synchronization interval). Furthermore, comparing with the convergence result of FL, we show that the guarantee of SL is worse than FL in terms of training rounds on non-IID data. The experimental results verify our theory. More findings on the comparison between FL and SL in cross-device settings are also reported.  ( 2 min )
    Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. (arXiv:2302.01629v1 [stat.ML])
    Machine learning models are vulnerable to adversarial perturbations, and a thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon through the lens of over-parameterization: interpolating smoothly the data requires significantly more parameters than simply memorizing it. However, this "universal" law provides only a necessary condition for robustness, and it is unable to discriminate between models. In this paper, we address these gaps by focusing on empirical risk minimization in two prototypical settings, namely, random features and the neural tangent kernel (NTK). We prove that, for random features, the model is not robust for any degree of over-parameterization, even when the necessary condition coming from the universal law of robustness is satisfied. In contrast, for even activations, the NTK model meets the universal lower bound, and it is robust as soon as the necessary condition on over-parameterization is fulfilled. This also addresses a conjecture in prior work by Bubeck, Li and Nagaraj. Our analysis decouples the effect of the kernel of the model from an "interaction matrix", which describes the interaction with the test data and captures the effect of the activation. Our theoretical results are corroborated by numerical evidence on both synthetic and standard datasets (MNIST, CIFAR-10).  ( 2 min )
    SCCAM: Supervised Contrastive Convolutional Attention Mechanism for Ante-hoc Interpretable Fault Diagnosis with Limited Fault Samples. (arXiv:2302.01599v1 [cs.LG])
    In real industrial processes, fault diagnosis methods are required to learn from limited fault samples since the procedures are mainly under normal conditions and the faults rarely occur. Although attention mechanisms have become popular in the field of fault diagnosis, the existing attention-based methods are still unsatisfying for the above practical applications. First, pure attention-based architectures like transformers need a large number of fault samples to offset the lack of inductive biases thus performing poorly under limited fault samples. Moreover, the poor fault classification dilemma further leads to the failure of the existing attention-based methods to identify the root causes. To address the aforementioned issues, we innovatively propose a supervised contrastive convolutional attention mechanism (SCCAM) with ante-hoc interpretability, which solves the root cause analysis problem under limited fault samples for the first time. The proposed SCCAM method is tested on a continuous stirred tank heater and the Tennessee Eastman industrial process benchmark. Three common fault diagnosis scenarios are covered, including a balanced scenario for additional verification and two scenarios with limited fault samples (i.e., imbalanced scenario and long-tail scenario). The comprehensive results demonstrate that the proposed SCCAM method can achieve better performance compared with the state-of-the-art methods on fault classification and root cause analysis.  ( 2 min )
    A Novel Fuzzy Bi-Clustering Algorithm with AFS for Identification of Co-Regulated Genes. (arXiv:2302.01596v1 [cs.LG])
    The identification of co-regulated genes and their transcription-factor binding sites (TFBS) are the key steps toward understanding transcription regulation. In addition to effective laboratory assays, various bi-clustering algorithms for detection of the co-expressed genes have been developed. Bi-clustering methods are used to discover subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. By building two fuzzy partition matrices of the gene expression data with the Axiomatic Fuzzy Set (AFS) theory, this paper proposes a novel fuzzy bi-clustering algorithm for identification of co-regulated genes. Specifically, the gene expression data is transformed into two fuzzy partition matrices via sub-preference relations theory of AFS at first. One of the matrices is considering the genes as the universe and the conditions as the concept, the other one is considering the genes as the concept and the conditions as the universe. The identification of the co-regulated genes (bi-clusters) is carried out on the two partition matrices at the same time. Then, a novel fuzzy-based similarity criterion is defined based on the partition matrixes, and a cyclic optimization algorithm is designed to discover the significant bi-clusters at expression level. The above procedures guarantee that the generated bi-clusters have more significant expression values than that of extracted by the traditional bi-clustering methods. Finally, the performance of the proposed method is evaluated with the performance of the three well-known bi-clustering algorithms on publicly available real microarray datasets. The experimental results are in agreement with the theoretical analysis and show that the proposed algorithm can effectively detect the co-regulated genes without any prior knowledge of the gene expression data.  ( 2 min )
    Learning to Decouple Complex Systems. (arXiv:2302.01581v1 [cs.LG])
    A complex system with cluttered observations may be a coupled mixture of multiple simple sub-systems corresponding to latent entities. Such sub-systems may hold distinct dynamics in the continuous-time domain; therein, complicated interactions between sub-systems also evolve over time. This setting is fairly common in the real world but has been less considered. In this paper, we propose a sequential learning approach under this setting by decoupling a complex system for handling irregularly sampled and cluttered sequential observations. Such decoupling brings about not only subsystems describing the dynamics of each latent entity but also a meta-system capturing the interaction between entities over time. Specifically, we argue that the meta-system evolving within a simplex is governed by projected differential equations (ProjDEs). We further analyze and provide neural-friendly projection operators in the context of Bregman divergence. Experimental results on synthetic and real-world datasets show the advantages of our approach when facing complex and cluttered sequential data compared to the state-of-the-art.  ( 2 min )
    Deep Reinforcement Learning for Cyber System Defense under Dynamic Adversarial Uncertainties. (arXiv:2302.01595v1 [cs.LG])
    Development of autonomous cyber system defense strategies and action recommendations in the real-world is challenging, and includes characterizing system state uncertainties and attack-defense dynamics. We propose a data-driven deep reinforcement learning (DRL) framework to learn proactive, context-aware, defense countermeasures that dynamically adapt to evolving adversarial behaviors while minimizing loss of cyber system operations. A dynamic defense optimization problem is formulated with multiple protective postures against different types of adversaries with varying levels of skill and persistence. A custom simulation environment was developed and experiments were devised to systematically evaluate the performance of four model-free DRL algorithms against realistic, multi-stage attack sequences. Our results suggest the efficacy of DRL algorithms for proactive cyber defense under multi-stage attack profiles and system uncertainties.  ( 2 min )
    Blockwise Self-Supervised Learning at Scale. (arXiv:2302.01647v1 [cs.CV])
    Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.  ( 2 min )
    An Operational Perspective to Fairness Interventions: Where and How to Intervene. (arXiv:2302.01574v1 [cs.LG])
    As AI-based decision systems proliferate, their successful operationalization requires balancing multiple desiderata: predictive performance, disparity across groups, safeguarding sensitive group attributes (e.g., race), and engineering cost. We present a holistic framework for evaluating and contextualizing fairness interventions with respect to the above desiderata. The two key points of practical consideration are where (pre-, in-, post-processing) and how (in what way the sensitive group data is used) the intervention is introduced. We demonstrate our framework using a thorough benchmarking study on predictive parity; we study close to 400 methodological variations across two major model types (XGBoost vs. Neural Net) and ten datasets. Methodological insights derived from our empirical study inform the practical design of ML workflow with fairness as a central concern. We find predictive parity is difficult to achieve without using group data, and despite requiring group data during model training (but not inference), distributionally robust methods provide significant Pareto improvement. Moreover, a plain XGBoost model often Pareto-dominates neural networks with fairness interventions, highlighting the importance of model inductive bias.  ( 2 min )
    ResMem: Learn what you can and memorize the rest. (arXiv:2302.01576v1 [cs.LG])
    The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.  ( 2 min )
    Uniform tensor clustering by jointly exploring sample affinities of various orders. (arXiv:2302.01569v1 [cs.LG])
    Conventional clustering methods based on pairwise affinity usually suffer from the concentration effect while processing huge dimensional features yet low sample sizes data, resulting in inaccuracy to encode the sample proximity and suboptimal performance in clustering. To address this issue, we propose a unified tensor clustering method (UTC) that characterizes sample proximity using multiple samples' affinity, thereby supplementing rich spatial sample distributions to boost clustering. Specifically, we find that the triadic tensor affinity can be constructed via the Khari-Rao product of two affinity matrices. Furthermore, our early work shows that the fourth-order tensor affinity is defined by the Kronecker product. Therefore, we utilize arithmetical products, Khatri-Rao and Kronecker products, to mathematically integrate different orders of affinity into a unified tensor clustering framework. Thus, the UTC jointly learns a joint low-dimensional embedding to combine various orders. Finally, a numerical scheme is designed to solve the problem. Experiments on synthetic datasets and real-world datasets demonstrate that 1) the usage of high-order tensor affinity could provide a supplementary characterization of sample proximity to the popular affinity matrix; 2) the proposed method of UTC is affirmed to enhance clustering by exploiting different order affinities when processing high-dimensional data.  ( 2 min )
    DynaMIX: Resource Optimization for DNN-Based Real-Time Applications on a Multi-Tasking System. (arXiv:2302.01568v1 [cs.LG])
    As deep neural networks (DNNs) prove their importance and feasibility, more and more DNN-based apps, such as detection and classification of objects, have been developed and deployed on autonomous vehicles (AVs). To meet their growing expectations and requirements, AVs should "optimize" use of their limited onboard computing resources for multiple concurrent in-vehicle apps while satisfying their timing requirements (especially for safety). That is, real-time AV apps should share the limited on-board resources with other concurrent apps without missing their deadlines dictated by the frame rate of a camera that generates and provides input images to the apps. However, most, if not all, of existing DNN solutions focus on enhancing the concurrency of their specific hardware without dynamically optimizing/modifying the DNN apps' resource requirements, subject to the number of running apps, owing to their high computational cost. To mitigate this limitation, we propose DynaMIX (Dynamic MIXed-precision model construction), which optimizes the resource requirement of concurrent apps and aims to maximize execution accuracy. To realize a real-time resource optimization, we formulate an optimization problem using app performance profiles to consider both the accuracy and worst-case latency of each app. We also propose dynamic model reconfiguration by lazy loading only the selected layers at runtime to reduce the overhead of loading the entire model. DynaMIX is evaluated in terms of constraint satisfaction and inference accuracy for a multi-tasking system and compared against state-of-the-art solutions, demonstrating its effectiveness and feasibility under various environmental/operating conditions.  ( 2 min )
    Multiplier Bootstrap-based Exploration. (arXiv:2302.01543v1 [cs.LG])
    Despite the great interest in the bandit problem, designing efficient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. In this paper, we propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real data experiments, we show the generality and adaptivity of MBE.  ( 2 min )
    Example-Based Explainable AI and its Application for Remote Sensing Image Classification. (arXiv:2302.01526v1 [cs.AI])
    We present a method of explainable artificial intelligence (XAI), "What I Know (WIK)", to provide additional information to verify the reliability of a deep learning model by showing an example of an instance in a training dataset that is similar to the input data to be inferred and demonstrate it in a remote sensing image classification task. One of the expected roles of XAI methods is verifying whether inferences of a trained machine learning model are valid for an application, and it is an important factor that what datasets are used for training the model as well as the model architecture. Our data-centric approach can help determine whether the training dataset is sufficient for each inference by checking the selected example data. If the selected example looks similar to the input data, we can confirm that the model was not trained on a dataset with a feature distribution far from the feature of the input data. With this method, the criteria for selecting an example are not merely data similarity with the input data but also data similarity in the context of the model task. Using a remote sensing image dataset from the Sentinel-2 satellite, the concept was successfully demonstrated with reasonably selected examples. This method can be applied to various machine-learning tasks, including classification and regression.  ( 2 min )
    DCM: Deep energy method based on the principle of minimum complementary energy. (arXiv:2302.01538v1 [cs.LG])
    The principle of minimum potential and complementary energy are the most important variational principles in solid mechanics. The deep energy method (DEM), which has received much attention, is based on the principle of minimum potential energy and lacks the important form of minimum complementary energy. Thus, we propose the deep energy method based on the principle of minimum complementary energy (DCM). The output function of DCM is the stress function that naturally satisfies the equilibrium equation. We extend the proposed DCM algorithm (DCM-P), adding the terms that naturally satisfy the biharmonic equation in the Airy stress function. We combine operator learning with physical equations and propose a deep complementary energy operator method (DCM-O), including branch net, trunk net, basis net, and particular net. DCM-O first combines existing high-fidelity numerical results to train DCM-O through data. Then the complementary energy is used to train the branch net and trunk net in DCM-O. To analyze DCM performance, we present the numerical result of the most common stress functions, the Prandtl and Airy stress function. The proposed method DCM is used to model the representative mechanical problems with the different types of boundary conditions. We compare DCM with the existing PINNs and DEM algorithms. The result shows the advantage of the proposed DCM is suitable for dealing with problems of dominated displacement boundary conditions, which is reflected in theory and our numerical experiments. DCM-P and DCM-O improve the accuracy of DCM and the speed of calculation convergence. DCM is an essential supplementary energy form of the deep energy method. We believe that operator learning based on the energy method can balance data and physical equations well, giving computational mechanics broad research prospects.  ( 2 min )
    Robust Camera Pose Refinement for Multi-Resolution Hash Encoding. (arXiv:2302.01571v1 [cs.CV])
    Multi-resolution hash encoding has recently been proposed to reduce the computational cost of neural renderings, such as NeRF. This method requires accurate camera poses for the neural renderings of given scenes. However, contrary to previous methods jointly optimizing camera poses and 3D scenes, the naive gradient-based camera pose refinement method using multi-resolution hash encoding severely deteriorates performance. We propose a joint optimization algorithm to calibrate the camera pose and learn a geometric representation using efficient multi-resolution hash encoding. Showing that the oscillating gradient flows of hash encoding interfere with the registration of camera poses, our method addresses the issue by utilizing smooth interpolation weighting to stabilize the gradient oscillation for the ray samplings across hash grids. Moreover, the curriculum training procedure helps to learn the level-wise hash encoding, further increasing the pose refinement. Experiments on the novel-view synthesis datasets validate that our learning frameworks achieve state-of-the-art performance and rapid convergence of neural rendering, even when initial camera poses are unknown.  ( 2 min )
    Machine Learning for UAV Propeller Fault Detection based on a Hybrid Data Generation Model. (arXiv:2302.01556v1 [cs.LG])
    This paper describes the development of an on-board data-driven system that can monitor and localize the fault in a quadrotor unmanned aerial vehicle (UAV) and at the same time, evaluate the degree of damage of the fault under real scenarios. To achieve offline training data generation, a hybrid approach is proposed for the development of a virtual data-generative model using a combination of data-driven models as well as well-established dynamic models that describe the kinematics of the UAV. To effectively represent the drop in performance of a faulty propeller, a variation of the deep neural network, a LSTM network is proposed. With the RPM of the propeller as input and based on the fault condition of the propeller, the proposed propeller model estimates the resultant torque and thrust. Then, flight datasets of the UAV under various fault scenarios are generated via simulation using the developed data-generative model. Lastly, a fault classifier using a CNN model is proposed to identify as well as evaluate the degree of damage to the damaged propeller. The scope of this paper focuses on the identification of faulty propellers and classification of the fault level for quadrotor UAVs using RPM as well as flight data. Doing so allows for early minor fault detection to prevent serious faults from occurring if the fault is left unrepaired. To further validate the workability of this approach outside of simulation, a real-flight test is conducted indoors. The real flight data is collected and a simulation to real sim-real test is conducted. Due to the imperfections in the build of our experimental UAV, a slight calibration approach to our simulation model is further proposed and the experimental results obtained show that our trained model can identify the location of propeller fault as well as the degree/type of damage. Currently, the diagnosis accuracy on the testing set is over 80%.  ( 3 min )
    Deep Reinforcement Learning for Online Error Detection in Cyber-Physical Systems. (arXiv:2302.01567v1 [cs.LG])
    Reliability is one of the major design criteria in Cyber-Physical Systems (CPSs). This is because of the existence of some critical applications in CPSs and their failure is catastrophic. Therefore, employing strong error detection and correction mechanisms in CPSs is inevitable. CPSs are composed of a variety of units, including sensors, networks, and microcontrollers. Each of these units is probable to be in a faulty state at any time and the occurred fault can result in erroneous output. The fault may cause the units of CPS to malfunction and eventually crash. Traditional fault-tolerant approaches include redundancy time, hardware, information, and/or software. However, these approaches impose significant overheads besides their low error coverage, which limits their applicability. In addition, the interval between error occurrence and detection is too long in these approaches. In this paper, based on Deep Reinforcement Learning (DRL), a new error detection approach is proposed that not only detects errors with high accuracy but also can perform error detection at the moment due to very low inference time. The proposed approach can categorize different types of errors from normal data and predict whether the system will fail. The evaluation results illustrate that the proposed approach has improved more than 2x in terms of accuracy and more than 5x in terms of inference time compared to other approaches.  ( 2 min )
    Ordered GNN: Ordering Message Passing to Deal with Heterophily and Over-smoothing. (arXiv:2302.01524v1 [cs.LG])
    Most graph neural networks follow the message passing mechanism. However, it faces the over-smoothing problem when multiple times of message passing is applied to a graph, causing indistinguishable node representations and prevents the model to effectively learn dependencies between farther-away nodes. On the other hand, features of neighboring nodes with different labels are likely to be falsely mixed, resulting in the heterophily problem. In this work, we propose to order the messages passing into the node representation, with specific blocks of neurons targeted for message passing within specific hops. This is achieved by aligning the hierarchy of the rooted-tree of a central node with the ordered neurons in its node representation. Experimental results on an extensive set of datasets show that our model can simultaneously achieve the state-of-the-art in both homophily and heterophily settings, without any targeted design. Moreover, its performance maintains pretty well while the model becomes really deep, effectively preventing the over-smoothing problem. Finally, visualizing the gating vectors shows that our model learns to behave differently between homophily and heterophily settings, providing an explainable graph neural model.  ( 2 min )
    Causal Inference Based Single-branch Ensemble Trees For Uplift Modeling. (arXiv:2302.01563v1 [cs.LG])
    In this manuscript (ms), we propose causal inference based single-branch ensemble trees for uplift modeling, namely CIET. Different from standard classification methods for predictive probability modeling, CIET aims to achieve the change in the predictive probability of outcome caused by an action or a treatment. According to our CIET, two partition criteria are specifically designed to maximize the difference in outcome distribution between the treatment and control groups. Next, a novel single-branch tree is built by taking a top-down node partition approach, and the remaining samples are censored since they are not covered by the upper node partition logic. Repeating the tree-building process on the censored data, single-branch ensemble trees with a set of inference rules are thus formed. Moreover, CIET is experimentally demonstrated to outperform previous approaches for uplift modeling in terms of both area under uplift curve (AUUC) and Qini coefficient significantly. At present, CIET has already been applied to online personal loans in a national financial holdings group in China. CIET will also be of use to analysts applying machine learning techniques to causal inference in broader business domains such as web advertising, medicine and economics.  ( 2 min )
    Group Fairness in Non-monotone Submodular Maximization. (arXiv:2302.01546v1 [cs.LG])
    Maximizing a submodular function has a wide range of applications in machine learning and data mining. One such application is data summarization whose goal is to select a small set of representative and diverse data items from a large dataset. However, data items might have sensitive attributes such as race or gender, in this setting, it is important to design \emph{fairness-aware} algorithms to mitigate potential algorithmic bias that may cause over- or under- representation of particular groups. Motivated by that, we propose and study the classic non-monotone submodular maximization problem subject to novel group fairness constraints. Our goal is to select a set of items that maximizes a non-monotone submodular function, while ensuring that the number of selected items from each group is proportionate to its size, to the extent specified by the decision maker. We develop the first constant-factor approximation algorithms for this problem. We also extend the basic model to incorporate an additional global size constraint on the total number of selected items.  ( 2 min )
    Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits. (arXiv:2302.01544v1 [cs.LG])
    In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.  ( 2 min )
    Vertical Federated Learning: Taxonomies, Threats, and Prospects. (arXiv:2302.01550v1 [cs.LG])
    Federated learning (FL) is the most popular distributed machine learning technique. FL allows machine-learning models to be trained without acquiring raw data to a single point for processing. Instead, local models are trained with local data; the models are then shared and combined. This approach preserves data privacy as locally trained models are shared instead of the raw data themselves. Broadly, FL can be divided into horizontal federated learning (HFL) and vertical federated learning (VFL). For the former, different parties hold different samples over the same set of features; for the latter, different parties hold different feature data belonging to the same set of samples. In a number of practical scenarios, VFL is more relevant than HFL as different companies (e.g., bank and retailer) hold different features (e.g., credit history and shopping history) for the same set of customers. Although VFL is an emerging area of research, it is not well-established compared to HFL. Besides, VFL-related studies are dispersed, and their connections are not intuitive. Thus, this survey aims to bring these VFL-related studies to one place. Firstly, we classify existing VFL structures and algorithms. Secondly, we present the threats from security and privacy perspectives to VFL. Thirdly, for the benefit of future researchers, we discussed the challenges and prospects of VFL in detail.  ( 2 min )
    Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective. (arXiv:2302.01530v1 [cs.CL])
    Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets demonstrate that our proposed ILD method outperforms other KD techniques. Our code is available at https://github.com/jongwooko/CR-ILD.  ( 2 min )
    Using natural language processing and structured medical data to phenotype patients hospitalized due to COVID-19. (arXiv:2302.01536v1 [cs.CL])
    To identify patients who are hospitalized because of COVID-19 as opposed to those who were admitted for other indications, we compared the performance of different computable phenotype definitions for COVID-19 hospitalizations that use different types of data from the electronic health records (EHR), including structured EHR data elements, provider notes, or a combination of both data types. And conduct a retrospective data analysis utilizing chart review-based validation. Participants are 586 hospitalized individuals who tested positive for SARS-CoV-2 during January 2022. We used natural language processing to incorporate data from provider notes and LASSO regression and Random Forests to fit classification algorithms that incorporated structured EHR data elements, provider notes, or a combination of structured data and provider notes. Results: Based on a chart review, 38% of 586 patients were determined to be hospitalized for reasons other than COVID-19 despite having tested positive for SARS-CoV-2. A classification algorithm that used provider notes had significantly better discrimination than one that used structured EHR data elements (AUROC: 0.894 vs 0.841, p < 0.001), and performed similarly to a model that combined provider notes with structured data elements (AUROC: 0.894 vs 0.893). Assessments of hospital outcome metrics significantly differed based on whether the population included all hospitalized patients who tested positive for SARS-CoV-2 versus those who were determined to have been hospitalized due to COVID-19. This work demonstrates the utility of natural language processing approaches to derive information related to patient hospitalizations in cases where there may be multiple conditions that could serve as the primary indication for hospitalization.  ( 3 min )
    Multi-channel Autobidding with Budget and ROI Constraints. (arXiv:2302.01523v1 [cs.GT])
    In digital online advertising, advertisers procure ad impressions simultaneously on multiple platforms, or so-called channels, such as Google Ads, Meta Ads Manager, etc., each of which consists of numerous ad auctions. We study how an advertiser maximizes total conversion (e.g. ad clicks) while satisfying aggregate return-on-investment (ROI) and budget constraints across all channels. In practice, an advertiser does not have control over, and thus cannot globally optimize, which individual ad auctions she participates in for each channel, and instead authorizes a channel to procure impressions on her behalf: the advertiser can only utilize two levers on each channel, namely setting a per-channel budget and per-channel target ROI. In this work, we first analyze the effectiveness of each of these levers for solving the advertiser's global multi-channel problem. We show that when an advertiser only optimizes over per-channel ROIs, her total conversion can be arbitrarily worse than what she could have obtained in the global problem. Further, we show that the advertiser can achieve the global optimal conversion when she only optimizes over per-channel budgets. In light of this finding, under a bandit feedback setting that mimics real-world scenarios where advertisers have limited information on ad auctions in each channels and how channels procure ads, we present an efficient learning algorithm that produces per-channel budgets whose resulting conversion approximates that of the global optimal problem. Finally, we argue that all our results hold for both single-item and multi-item auctions from which channels procure impressions on advertisers' behalf.  ( 2 min )
    Pseudonorm Approachability and Applications to Regret Minimization. (arXiv:2302.01517v1 [cs.LG])
    Blackwell's celebrated approachability theory provides a general framework for a variety of learning problems, including regret minimization. However, Blackwell's proof and implicit algorithm measure approachability using the $\ell_2$ (Euclidean) distance. We argue that in many applications such as regret minimization, it is more useful to study approachability under other distance metrics, most commonly the $\ell_\infty$-metric. But, the time and space complexity of the algorithms designed for $\ell_\infty$-approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. Thus, we present a framework for converting high-dimensional $\ell_\infty$-approachability problems to low-dimensional pseudonorm approachability problems, thereby resolving such issues. We first show that the $\ell_\infty$-distance between the average payoff and the approachability set can be equivalently defined as a pseudodistance between a lower-dimensional average vector payoff and a new convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for $\ell_2$ and other norms, showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an $\ell_\infty$-approachability algorithm whose convergence is independent of the dimension of the original vectorial payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original $\ell_\infty$-distance can be computed efficiently. We also give an $\ell_\infty$-approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer.  ( 2 min )
    Improving Recommendation Relevance by simulating User Interest. (arXiv:2302.01522v1 [math.NA])
    Most if not all on-line item-to-item recommendation systems rely on estimation of a distance like measure (rank) of similarity between items. For on-line recommendation systems, time sensitivity of this similarity measure is extremely important. We observe that recommendation "recency" can be straightforwardly and transparently maintained by iterative reduction of ranks of inactive items. The paper briefly summarizes algorithmic developments based on this self-explanatory observation. The basic idea behind this work is patented in a context of online recommendation systems.  ( 2 min )
    Randomized Gaussian Process Upper Confidence Bound with Tight Bayesian Regret Bounds. (arXiv:2302.01511v1 [cs.LG])
    Gaussian process upper confidence bound (GP-UCB) is a theoretically promising approach for black-box optimization; however, the confidence parameter $\beta$ is considerably large in the theorem and chosen heuristically in practice. Then, randomized GP-UCB (RGP-UCB) uses a randomized confidence parameter, which follows the Gamma distribution, to mitigate the impact of manually specifying $\beta$. This study first generalizes the regret analysis of RGP-UCB to a wider class of distributions, including the Gamma distribution. Furthermore, we propose improved RGP-UCB (IRGP-UCB) based on a two-parameter exponential distribution, which achieves tight Bayesian regret bounds. IRGP-UCB does not require an increase in the confidence parameter in terms of the number of iterations, which avoids over-exploration in the later iterations. Finally, we demonstrate the effectiveness of IRGP-UCB through extensive experiments.  ( 2 min )
    A Lipschitz Bandits Approach for Continuous Hyperparameter Optimization. (arXiv:2302.01539v1 [cs.LG])
    One of the most critical problems in machine learning is HyperParameter Optimization (HPO), since choice of hyperparameters has a significant impact on final model performance. Although there are many HPO algorithms, they either have no theoretical guarantees or require strong assumptions. To this end, we introduce BLiE -- a Lipschitz-bandit-based algorithm for HPO that only assumes Lipschitz continuity of the objective function. BLiE exploits the landscape of the objective function to adaptively search over the hyperparameter space. Theoretically, we show that $(i)$ BLiE finds an $\epsilon$-optimal hyperparameter with $O \left( \frac{1}{\epsilon} \right)^{d_z + \beta}$ total budgets, where $d_z$ and $\beta$ are problem intrinsic; $(ii)$ BLiE is highly parallelizable. Empirically, we demonstrate that BLiE outperforms the state-of-the-art HPO algorithms on benchmark tasks. We also apply BLiE to search for noise schedule of diffusion models. Comparison with the default schedule shows that BLiE schedule greatly improves the sampling speed.  ( 2 min )
    Support Recovery in Sparse PCA with Non-Random Missing Data. (arXiv:2302.01535v1 [stat.ML])
    We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems.  ( 2 min )
    ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics. (arXiv:2302.01501v1 [cs.IR])
    As the amount of text data generated by humans and machines increases, the necessity of understanding large corpora and finding a way to extract insights from them is becoming more crucial than ever. Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. In this paper, we introduce the Aligned Neural Topic Model (ANTM), a dynamic neural topic model that uses document embeddings to compute clusters of semantically similar documents at different periods and to align document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).  ( 2 min )
    LSA-PINN: Linear Boundary Connectivity Loss for Solving PDEs on Complex Geometry. (arXiv:2302.01518v1 [cs.LG])
    We present a novel loss formulation for efficient learning of complex dynamics from governing physics, typically described by partial differential equations (PDEs), using physics-informed neural networks (PINNs). In our experiments, existing versions of PINNs are seen to learn poorly in many problems, especially for complex geometries, as it becomes increasingly difficult to establish appropriate sampling strategy at the near boundary region. Overly dense sampling can adversely impede training convergence if the local gradient behaviors are too complex to be adequately modelled by PINNs. On the other hand, if the samples are too sparse, existing PINNs tend to overfit the near boundary region, leading to incorrect solution. To prevent such issues, we propose a new Boundary Connectivity (BCXN) loss function which provides linear local structure approximation (LSA) to the gradient behaviors at the boundary for PINN. Our BCXN-loss implicitly imposes local structure during training, thus facilitating fast physics-informed learning across entire problem domains with order of magnitude sparser training samples. This LSA-PINN method shows a few orders of magnitude smaller errors than existing methods in terms of the standard L2-norm metric, while using dramatically fewer training samples and iterations. Our proposed LSA-PINN does not pose any requirement on the differentiable property of the networks, and we demonstrate its benefits and ease of implementation on both multi-layer perceptron and convolutional neural network versions as commonly used in current PINN literature.  ( 2 min )
    Xtal2DoS: Attention-based Crystal to Sequence Learning for Density of States Prediction. (arXiv:2302.01486v1 [cs.LG])
    Modern machine learning techniques have been extensively applied to materials science, especially for property prediction tasks. A majority of these methods address scalar property predictions, while more challenging spectral properties remain less emphasized. We formulate a crystal-to-sequence learning task and propose a novel attention-based learning method, Xtal2DoS, which decodes the sequential representation of the material density of states (DoS) properties by incorporating the learned atomic embeddings through attention networks. Experiments show Xtal2DoS is faster than the existing models, and consistently outperforms other state-of-the-art methods on four metrics for two fundamental spectral properties, phonon and electronic DoS.  ( 2 min )
    User-centric Heterogeneous-action Deep Reinforcement Learning for Virtual Reality in the Metaverse over Wireless Networks. (arXiv:2302.01471v1 [cs.NI])
    The Metaverse is emerging as maturing technologies are empowering the different facets. Virtual Reality (VR) technologies serve as the backbone of the virtual universe within the Metaverse to offer a highly immersive user experience. As mobility is emphasized in the Metaverse context, VR devices reduce their weights at the sacrifice of local computation abilities. In this paper, for a system consisting of a Metaverse server and multiple VR users, we consider two cases of (i) the server generating frames and transmitting them to users, and (ii) users generating frames locally and thus consuming device energy. Moreover, in our multi-user VR scenario for the Metaverse, users have different characteristics and demands for Frames Per Second (FPS). Then the channel access arrangement (including the decisions on frame generation location), and transmission powers for the downlink communications from the server to the users are jointly optimized to improve the utilities of users. This joint optimization is addressed by deep reinforcement learning (DRL) with heterogeneous actions. Our proposed user-centric DRL algorithm is called User-centric Critic with Heterogenous Actors (UCHA). Extensive experiments demonstrate that our UCHA algorithm leads to remarkable results under various requirements and constraints.  ( 2 min )
    Clustered Embedding Learning for Recommender Systems. (arXiv:2302.01478v1 [cs.AI])
    In recent years, recommender systems have advanced rapidly, where embedding learning for users and items plays a critical role. A standard method learns a unique embedding vector for each user and item. However, such a method has two important limitations in real-world applications: 1) it is hard to learn embeddings that generalize well for users and items with rare interactions on their own; and 2) it may incur unbearably high memory costs when the number of users and items scales up. Existing approaches either can only address one of the limitations or have flawed overall performances. In this paper, we propose Clustered Embedding Learning (CEL) as an integrated solution to these two problems. CEL is a plug-and-play embedding learning framework that can be combined with any differentiable feature interaction model. It is capable of achieving improved performance, especially for cold users and items, with reduced memory cost. CEL enables automatic and dynamic clustering of users and items in a top-down fashion, where clustered entities jointly learn a shared embedding. The accelerated version of CEL has an optimal time complexity, which supports efficient online updates. Theoretically, we prove the identifiability and the existence of a unique optimal number of clusters for CEL in the context of nonnegative matrix factorization. Empirically, we validate the effectiveness of CEL on three public datasets and one business dataset, showing its consistently superior performance against current state-of-the-art methods. In particular, when incorporating CEL into the business model, it brings an improvement of $+0.6\%$ in AUC, which translates into a significant revenue gain; meanwhile, the size of the embedding table gets $2650$ times smaller.  ( 2 min )
    Learning to Optimize for Reinforcement Learning. (arXiv:2302.01470v1 [cs.LG])
    In recent years, by leveraging more data, computation, and diverse tasks, learned optimizers have achieved remarkable success in supervised learning optimization, outperforming classical hand-designed optimizers. However, in practice, these learned optimizers fail to generalize to reinforcement learning tasks due to unstable and complex loss landscapes. Moreover, neither hand-designed optimizers nor learned optimizers have been specifically designed to address the unique optimization properties in reinforcement learning. In this work, we take a data-driven approach to learn to optimize for reinforcement learning using meta-learning. We introduce a novel optimizer structure that significantly improves the training efficiency of learned optimizers, making it possible to learn an optimizer for reinforcement learning from scratch. Although trained in toy tasks, our learned optimizer demonstrates its generalization ability to unseen complex tasks. Finally, we design a set of small gridworlds to train the first general-purpose optimizer for reinforcement learning.  ( 2 min )
    Towards Practical Preferential Bayesian Optimization with Skew Gaussian Processes. (arXiv:2302.01513v1 [cs.LG])
    We study preferential Bayesian optimization (BO) where reliable feedback is limited to pairwise comparison called duels. An important challenge in preferential BO, which uses the preferential Gaussian process (GP) model to represent flexible preference structure, is that the posterior distribution is a computationally intractable skew GP. The most widely used approach for preferential BO is Gaussian approximation, which ignores the skewness of the true posterior. Alternatively, Markov chain Monte Carlo (MCMC) based preferential BO is also proposed. In this work, we first verify the accuracy of Gaussian approximation, from which we reveal the critical problem that the predictive probability of duels can be inaccurate. This observation motivates us to improve the MCMC-based estimation for skew GP, for which we show the practical efficiency of Gibbs sampling and derive the low variance MC estimator. However, the computational time of MCMC can still be a bottleneck in practice. Towards building a more practical preferential BO, we develop a new method that achieves both high computational efficiency and low sample complexity, and then demonstrate its effectiveness through extensive numerical experiments.  ( 2 min )
    Perfect Is the Enemy of Test Oracle. (arXiv:2302.01488v1 [cs.SE])
    Automation of test oracles is one of the most challenging facets of software testing, but remains comparatively less addressed compared to automated test input generation. Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes. What makes the oracle problem challenging and undecidable is the assumption that the ground-truth should know the exact expected, correct, or buggy behavior. However, we argue that one can still build an accurate oracle without knowing the exact correct or buggy behavior, but how these two might differ. This paper presents SEER, a learning-based approach that in the absence of test assertions or other types of oracle, can determine whether a unit test passes or fails on a given method under test (MUT). To build the ground-truth, SEER jointly embeds unit tests and the implementation of MUTs into a unified vector space, in such a way that the neural representation of tests are similar to that of MUTs they pass on them, but dissimilar to MUTs they fail on them. The classifier built on top of this vector representation serves as the oracle to generate "fail" labels, when test inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is (1) effective in predicting the fail or pass labels, achieving an overall accuracy, precision, recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting the labels for the unit test of projects that were not in training or validation set with negligible performance drop, and (3) efficient, detecting the existence of bugs in only 6.5 milliseconds on average.  ( 3 min )
    LazyGNN: Large-Scale Graph Neural Networks via Lazy Propagation. (arXiv:2302.01503v1 [cs.LG])
    Recent works have demonstrated the benefits of capturing long-distance dependency in graphs by deeper graph neural networks (GNNs). But deeper GNNs suffer from the long-lasting scalability challenge due to the neighborhood explosion problem in large-scale graphs. In this work, we propose to capture long-distance dependency in graphs by shallower models instead of deeper models, which leads to a much more efficient model, LazyGNN, for graph representation learning. Moreover, we demonstrate that LazyGNN is compatible with existing scalable approaches (such as sampling methods) for further accelerations through the development of mini-batch LazyGNN. Comprehensive experiments demonstrate its superior prediction performance and scalability on large-scale benchmarks. LazyGNN also achieves state-of-art performance on the OGB leaderboard.  ( 2 min )
    Defensive ML: Defending Architectural Side-channels with Adversarial Obfuscation. (arXiv:2302.01474v1 [cs.CR])
    Side-channel attacks that use machine learning (ML) for signal analysis have become prominent threats to computer security, as ML models easily find patterns in signals. To address this problem, this paper explores using Adversarial Machine Learning (AML) methods as a defense at the computer architecture layer to obfuscate side channels. We call this approach Defensive ML, and the generator to obfuscate signals, defender. Defensive ML is a workflow to design, implement, train, and deploy defenders for different environments. First, we design a defender architecture given the physical characteristics and hardware constraints of the side-channel. Next, we use our DefenderGAN structure to train the defender. Finally, we apply defensive ML to thwart two side-channel attacks: one based on memory contention and the other on application power. The former uses a hardware defender with ns-level response time that attains a high level of security with half the performance impact of a traditional scheme; the latter uses a software defender with ms-level response time that provides better security than a traditional scheme with only 70% of its power overhead.  ( 2 min )
    Gradient Estimation for Unseen Domain Risk Minimization with Pre-Trained Models. (arXiv:2302.01497v1 [cs.LG])
    Domain generalization aims to build generalized models that perform well on unseen domains when only source domains are available for model optimization. Recent studies have demonstrated that large-scale pre-trained models could play an important role in domain generalization by providing their generalization power. However, large-scale pre-trained models are not fully equipped with target task-specific knowledge due to a discrepancy between the pre-training objective and the target task. Although the task-specific knowledge could be learned from source domains by fine-tuning, this hurts the generalization power of the pre-trained models because of gradient bias toward the source domains. To address this issue, we propose a new domain generalization method that estimates unobservable gradients that reduce potential risks in unseen domains, using a large-scale pre-trained model. Our proposed method allows the pre-trained model to learn task-specific knowledge further while preserving its generalization ability with the estimated gradients. Experimental results show that our proposed method outperforms baseline methods on DomainBed, a standard benchmark in domain generalization. We also provide extensive analyses to demonstrate that the estimated unobserved gradients relieve the gradient bias, and the pre-trained model learns the task-specific knowledge without sacrificing its generalization power.  ( 2 min )
    SPADE: Self-supervised Pretraining for Acoustic DisEntanglement. (arXiv:2302.01483v1 [cs.LG])
    Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.  ( 2 min )
    Efficient Domain Adaptation for Speech Foundation Models. (arXiv:2302.01496v1 [cs.CL])
    Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.  ( 2 min )
    Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms. (arXiv:2302.01450v1 [cs.LG])
    Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, often discounted reward formulations are used with the discount factor being close to 1, which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first non-trivial error bounds for average-reward MDPs which go to zero in the limit where when policy evaluation and policy improvement errors go to zero.  ( 2 min )
    Spiking Synaptic Penalty: Appropriate Penalty Term for Energy-Efficient Spiking Neural Networks. (arXiv:2302.01500v1 [cs.LG])
    Spiking neural networks (SNNs) are energy-efficient neural networks because of their spiking nature. However, as the spike firing rate of SNNs increases, the energy consumption does as well, and thus, the advantage of SNNs diminishes. Here, we tackle this problem by introducing a novel penalty term for the spiking activity into the objective function in the training phase. Our method is designed so as to optimize the energy consumption metric directly without modifying the network architecture. Therefore, the proposed method can reduce the energy consumption more than other methods while maintaining the accuracy. We conducted experiments for image classification tasks, and the results indicate the effectiveness of the proposed method, which mitigates the dilemma of the energy--accuracy trade-off.  ( 2 min )
    Convergence of Gradient Descent with Linearly Correlated Noise and Applications to Differentially Private Learning. (arXiv:2302.01463v1 [cs.LG])
    We study stochastic optimization with linearly correlated noise. Our study is motivated by recent methods for optimization with differential privacy (DP), such as DP-FTRL, which inject noise via matrix factorization mechanisms. We propose an optimization problem that distils key facets of these DP methods and that involves perturbing gradients by linearly correlated noise. We derive improved convergence rates for gradient descent in this framework for convex and non-convex loss functions. Our theoretical analysis is novel and might be of independent interest. We use these convergence rates to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.  ( 2 min )
    Commonsense-Aware Prompting for Controllable Empathetic Dialogue Generation. (arXiv:2302.01441v1 [cs.CL])
    Improving the emotional awareness of pre-trained language models is an emerging important problem for dialogue generation tasks. Although prior studies have introduced methods to improve empathetic dialogue generation, few have discussed how to incorporate commonsense knowledge into pre-trained language models for controllable dialogue generation. In this study, we propose a novel framework that improves empathetic dialogue generation using pre-trained language models by 1) incorporating commonsense knowledge through prompt verbalization, and 2) controlling dialogue generation using a strategy-driven future discriminator. We conducted experiments to reveal that both the incorporation of social commonsense knowledge and enforcement of control over generation help to improve generation performance. Finally, we discuss the implications of our study for future research.  ( 2 min )
    Generalized Uncertainty of Deep Neural Networks: Taxonomy and Applications. (arXiv:2302.01440v1 [cs.LG])
    Deep neural networks have seen enormous success in various real-world applications. Beyond their predictions as point estimates, increasing attention has been focused on quantifying the uncertainty of their predictions. In this review, we show that the uncertainty of deep neural networks is not only important in a sense of interpretability and transparency, but also crucial in further advancing their performance, particularly in learning systems seeking robustness and efficiency. We will generalize the definition of the uncertainty of deep neural networks to any number or vector that is associated with an input or an input-label pair, and catalog existing methods on ``mining'' such uncertainty from a deep model. We will include those methods from the classic field of uncertainty quantification as well as those methods that are specific to deep neural networks. We then show a wide spectrum of applications of such generalized uncertainty in realistic learning tasks including robust learning such as noisy learning, adversarially robust learning; data-efficient learning such as semi-supervised and weakly-supervised learning; and model-efficient learning such as model compression and knowledge distillation.  ( 2 min )
    Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search. (arXiv:2302.01382v1 [cs.LG])
    Serving large-scale machine learning (ML) models efficiently and with low latency has become challenging owing to increasing model size and complexity. Quantizing models can simultaneously reduce memory and compute requirements, facilitating their widespread access. However, for large models not all layers are equally amenable to the same numerical precision and aggressive quantization can lead to unacceptable loss in model accuracy. One approach to prevent this accuracy degradation is mixed-precision quantization, which allows different tensors to be quantized to varying levels of numerical precision, leveraging the capabilities of modern hardware. Such mixed-precision quantiztaion can more effectively allocate numerical precision to different tensors `as needed' to preserve model accuracy while reducing footprint and compute latency. In this paper, we propose a method to efficiently determine quantization configurations of different tensors in ML models using post-training mixed precision quantization. We analyze three sensitivity metrics and evaluate them for guiding configuration search of two algorithms. We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31% compared to the baseline 16-bit floating point model while guaranteeing no more than 1% accuracy degradation.  ( 2 min )
    A Reduction-based Framework for Sequential Decision Making with Delayed Feedback. (arXiv:2302.01477v1 [cs.LG])
    We study stochastic delayed feedback in general multi-agent sequential decision making, which includes bandits, single-agent Markov decision processes (MDPs), and Markov games (MGs). We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that can handle stochastic delays in sequential decision making. By plugging different multi-batched algorithms into our framework, we provide several examples demonstrating that our framework not only matches or improves existing results for bandits, tabular MDPs, and tabular MGs, but also provides the first line of studies on delays in sequential decision making with function approximation. In summary, we provide a complete set of sharp results for multi-agent sequential decision making with delayed feedback.  ( 2 min )
    Out of Context: Investigating the Bias and Fairness Concerns of "Artificial Intelligence as a Service". (arXiv:2302.01448v1 [cs.LG])
    "AI as a Service" (AIaaS) is a rapidly growing market, offering various plug-and-play AI services and tools. AIaaS enables its customers (users) - who may lack the expertise, data, and/or resources to develop their own systems - to easily build and integrate AI capabilities into their applications. Yet, it is known that AI systems can encapsulate biases and inequalities that can have societal impact. This paper argues that the context-sensitive nature of fairness is often incompatible with AIaaS' 'one-size-fits-all' approach, leading to issues and tensions. Specifically, we review and systematise the AIaaS space by proposing a taxonomy of AI services based on the levels of autonomy afforded to the user. We then critically examine the different categories of AIaaS, outlining how these services can lead to biases or be otherwise harmful in the context of end-user applications. In doing so, we seek to draw research attention to the challenges of this emerging area.  ( 2 min )
    Continual Learning with Scaled Gradient Projection. (arXiv:2302.01386v1 [cs.LG])
    In neural networks, continual learning results in gradient interference among sequential tasks, leading to catastrophic forgetting of old tasks while learning new ones. This issue is addressed in recent methods by storing the important gradient spaces for old tasks and updating the model orthogonally during new tasks. However, such restrictive orthogonal gradient updates hamper the learning capability of the new tasks resulting in sub-optimal performance. To improve new learning while minimizing forgetting, in this paper we propose a Scaled Gradient Projection (SGP) method, where we combine the orthogonal gradient projections with scaled gradient steps along the important gradient spaces for the past tasks. The degree of gradient scaling along these spaces depends on the importance of the bases spanning them. We propose an efficient method for computing and accumulating importance of these bases using the singular value decomposition of the input representations for each task. We conduct extensive experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.  ( 2 min )
    Effective Robustness against Natural Distribution Shifts for Models with Different Training Data. (arXiv:2302.01381v1 [cs.LG])
    ``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new effective robustness evaluation metric to compare the effective robustness of models trained on different data distributions. To do this we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of the effectiveness robustness and explains the surprising effective robustness gains of zero-shot CLIP-like models exhibited when considering only one ID dataset, while the gains diminish under our evaluation.  ( 2 min )
    Hyper-parameter Tuning for Fair Classification without Sensitive Attribute Access. (arXiv:2302.01385v1 [cs.LG])
    Fair machine learning methods seek to train models that balance model performance across demographic subgroups defined over sensitive attributes like race and gender. Although sensitive attributes are typically assumed to be known during training, they may not be available in practice due to privacy and other logistical concerns. Recent work has sought to train fair models without sensitive attributes on training data. However, these methods need extensive hyper-parameter tuning to achieve good results, and hence assume that sensitive attributes are known on validation data. However, this assumption too might not be practical. Here, we propose Antigone, a framework to train fair classifiers without access to sensitive attributes on either training or validation data. Instead, we generate pseudo sensitive attributes on the validation data by training a biased classifier and using the classifier's incorrectly (correctly) labeled examples as proxies for minority (majority) groups. Since fairness metrics like demographic parity, equal opportunity and subgroup accuracy can be estimated to within a proportionality constant even with noisy sensitive attribute information, we show theoretically and empirically that these proxy labels can be used to maximize fairness under average accuracy constraints. Key to our results is a principled approach to select the hyper-parameters of the biased classifier in a completely unsupervised fashion (meaning without access to ground truth sensitive attributes) that minimizes the gap between fairness estimated using noisy versus ground-truth sensitive labels.  ( 2 min )
    Neural Insights for Digital Marketing Content Design. (arXiv:2302.01416v1 [cs.LG])
    In digital marketing, experimenting with new website content is one of the key levers to improve customer engagement. However, creating successful marketing content is a manual and time-consuming process that lacks clear guiding principles. This paper seeks to close the loop between content creation and online experimentation by offering marketers AI-driven actionable insights based on historical data to improve their creative process. We present a neural-network-based system that scores and extracts insights from a marketing content design, namely, a multimodal neural network predicts the attractiveness of marketing contents, and a post-hoc attribution method generates actionable insights for marketers to improve their content in specific marketing locations. Our insights not only point out the advantages and drawbacks of a given current content, but also provide design recommendations based on historical data. We show that our scoring model and insights work well both quantitatively and qualitatively.  ( 2 min )
    Provably Bounding Neural Network Preimages. (arXiv:2302.01404v1 [cs.LG])
    Most work on the formal verification of neural networks has focused on bounding forward images of neural networks, i.e., the set of outputs of a neural network that correspond to a given set of inputs (for example, bounded perturbations of a nominal input). However, many use cases of neural network verification require solving the inverse problem, i.e, over-approximating the set of inputs that lead to certain outputs. In this work, we present the first efficient bound propagation algorithm, INVPROP, for verifying properties over the preimage of a linearly constrained output set of a neural network, which can be combined with branch-and-bound to achieve completeness. Our efficient algorithm allows multiple passes of intermediate bound refinements, which are crucial for tight inverse verification because the bounds of an intermediate layer depend on relaxations both before and after this layer. We demonstrate our algorithm on applications related to quantifying safe control regions for a dynamical system and detecting out-of-distribution inputs to a neural network. Our results show that in certain settings, we can find over-approximations that are over 2500 times tighter than prior work while being 2.5 times faster on the same hardware.  ( 2 min )
    A Convolutional-based Model for Early Prediction of Alzheimer's based on the Dementia Stage in the MRI Brain Images. (arXiv:2302.01417v1 [cs.LG])
    Alzheimer's disease is a degenerative brain disease. Being the primary cause of Dementia in adults and progressively destroys brain memory. Though Alzheimer's disease does not have a cure currently, diagnosing it at an earlier stage will help reduce the severity of the disease. Thus, early diagnosis of Alzheimer's could help to reduce or stop the disease from progressing. In this paper, we proposed a deep convolutional neural network-based model for learning model using to determine the stage of Dementia in adults based on the Magnetic Resonance Imaging (MRI) images to detect the early onset of Alzheimer's.  ( 2 min )
    Hyperbolic Contrastive Learning. (arXiv:2302.01409v1 [cs.CV])
    Learning good image representations that are beneficial to downstream tasks is a challenging task in computer vision. As such, a wide variety of self-supervised learning approaches have been proposed. Among them, contrastive learning has shown competitive performance on several benchmark datasets. The embeddings of contrastive learning are arranged on a hypersphere that results in using the inner (dot) product as a distance measurement in Euclidean space. However, the underlying structure of many scientific fields like social networks, brain imaging, and computer graphics data exhibit highly non-Euclidean latent geometry. We propose a novel contrastive learning framework to learn semantic relationships in the hyperbolic space. Hyperbolic space is a continuous version of trees that naturally owns the ability to model hierarchical structures and is thus beneficial for efficient contrastive representation learning. We also extend the proposed Hyperbolic Contrastive Learning (HCL) to the supervised domain and studied the adversarial robustness of HCL. The comprehensive experiments show that our proposed method achieves better results on self-supervised pretraining, supervised classification, and higher robust accuracy than baseline methods.  ( 2 min )
    Dataset Distillation Fixes Dataset Reconstruction Attacks. (arXiv:2302.01428v1 [cs.LG])
    Modern deep learning requires large volumes of data, which could contain sensitive or private information which cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effectiveness regime, and ways to defend against it. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover its entire training set in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. More importantly, we formally show for the first time that dataset reconstruction attacks are a variation of dataset distillation. This key theoretical result on the unification of dataset reconstruction and distillation not only sheds more light on the characteristics of the attack but enables us to design defense mechanisms against them via distillation algorithms.  ( 2 min )
    On the Robustness of Randomized Ensembles to Adversarial Perturbations. (arXiv:2302.01375v1 [cs.LG])
    Randomized ensemble classifiers (RECs), where one classifier is randomly selected during inference, have emerged as an attractive alternative to traditional ensembling methods for realizing adversarially robust classifiers with limited compute requirements. However, recent works have shown that existing methods for constructing RECs are more vulnerable than initially claimed, casting major doubts on their efficacy and prompting fundamental questions such as: "When are RECs useful?", "What are their limits?", and "How do we train them?". In this work, we first demystify RECs as we derive fundamental results regarding their theoretical limits, necessary and sufficient conditions for them to be useful, and more. Leveraging this new understanding, we propose a new boosting algorithm (BARRE) for training robust RECs, and empirically demonstrate its effectiveness at defending against strong $\ell_\infty$ norm-bounded adversaries across various network architectures and datasets.  ( 2 min )
    Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective. (arXiv:2302.01425v1 [cs.LG])
    The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.  ( 2 min )
    Accelerating Policy Gradient by Estimating Value Function from Prior Computation in Deep Reinforcement Learning. (arXiv:2302.01399v1 [cs.LG])
    This paper investigates the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods in reinforcement learning. Our approach is to estimate the value function from prior computations, such as from the Q-network learned in DQN or the value function trained for different but related environments. In particular, we learn a new value function for the target task while combining it with a value estimate from the prior computation. Finally, the resulting value function is used as a baseline in the policy gradient method. This use of a baseline has the theoretical property of reducing variance in gradient computation and thus improving sample efficiency. The experiments show the successful use of prior value estimates in various settings and improved sample efficiency in several tasks.  ( 2 min )
    Personalized Understanding of Blood Glucose Dynamics via Mobile Sensor Data. (arXiv:2302.01400v1 [cs.HC])
    Continuous Blood Glucose (CGM) monitors have revolutionized the ability of diabetics to manage their blood glucose, and paved the way for artificial pancreas systems. In this paper we augment CGM data with sensor input collected by a smart phone and use it to provide analytical tools for patients and clinicians. We collected GPS data, activity classifications, and blood glucose data with a custom iOS application over a 9 month period from a single free-living type-1 diabetic patient. This data set is novel in terms of it's size, the inclusion of GPS data, and the fact that it was collected non-intrusively from a free-living patient. We describe a method to measure the occurrence of lifestyle \textit{events} based on GPS and activity data, and show that they can capture instances of food consumption and are therefore correlated to changes in blood glucose. Finally, we incorporate these event representations into our system to create useful visualizations and notifications to aid patients in managing their diabetes.  ( 2 min )
    Learning with Exposure Constraints in Recommendation Systems. (arXiv:2302.01377v1 [cs.LG])
    Recommendation systems are dynamic economic systems that balance the needs of multiple stakeholders. A recent line of work studies incentives from the content providers' point of view. Content providers, e.g., vloggers and bloggers, contribute fresh content and rely on user engagement to create revenue and finance their operations. In this work, we propose a contextual multi-armed bandit setting to model the dependency of content providers on exposure. In our model, the system receives a user context in every round and has to select one of the arms. Every arm is a content provider who must receive a minimum number of pulls every fixed time period (e.g., a month) to remain viable in later rounds; otherwise, the arm departs and is no longer available. The system aims to maximize the users' (content consumers) welfare. To that end, it should learn which arms are vital and ensure they remain viable by subsidizing arm pulls if needed. We develop algorithms with sub-linear regret, as well as a lower bound that demonstrates that our algorithms are optimal up to logarithmic factors.  ( 2 min )
    Neural Network Architecture for Database Augmentation Using Shared Features. (arXiv:2302.01374v1 [cs.LG])
    The popularity of learning from data with machine learning and neural networks has lead to the creation of many new datasets for almost every problem domain. However, even within a single domain, these datasets are often collected with disparate features, sampled from different sub-populations, and recorded at different time points. Even with the plethora of individual datasets, large data science projects can be difficult as it is often not trivial to merge these smaller datasets. Inherent challenges in some domains such as medicine also makes it very difficult to create large single source datasets or multi-source datasets with identical features. Instead of trying to merge these non-matching datasets directly, we propose a neural network architecture that can provide data augmentation using features common between these datasets. Our results show that this style of data augmentation can work for both image and tabular data.  ( 2 min )
    Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen's f2. (arXiv:2302.01407v1 [stat.ME])
    Deep artificial neural networks show high predictive performance in many fields, but they do not afford statistical inferences and their black-box operations are too complicated for humans to comprehend. Because positing that a relationship exists is often more important than prediction in scientific experiments and research models, machine learning is far less frequently used than inferential statistics. Additionally, statistics calls for improving the test of theory by showing the magnitude of the phenomena being studied. This article extends current XAI methods and develops a model agnostic hypothesis testing framework for machine learning. First, Fisher's variable permutation algorithm is tweaked to compute an effect size measure equivalent to Cohen's f2 for OLS regression models. Second, the Mann-Kendall test of monotonicity and the Theil-Sen estimator is applied to Apley's accumulated local effect plots to specify a variable's direction of influence and statistical significance. The usefulness of this approach is demonstrated on an artificial data set and a social survey with a Python sandbox implementation.  ( 2 min )
    Augmented Learning of Heterogeneous Treatment Effects via Gradient Boosting Trees. (arXiv:2302.01367v1 [stat.ML])
    Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.  ( 2 min )
  • Open

    Post-Selection Confidence Bounds for Prediction Performance. (arXiv:2210.13206v3 [stat.ML] UPDATED)
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.  ( 3 min )
    Realizable Learning is All You Need. (arXiv:2111.04746v3 [cs.LG] UPDATED)
    The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.  ( 2 min )
    Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. (arXiv:2210.02720v2 [cs.LG] UPDATED)
    Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.  ( 2 min )
    New Machine Learning Techniques for Simulation-Based Inference: InferoStatic Nets, Kernel Score Estimation, and Kernel Likelihood Ratio Estimation. (arXiv:2210.01680v2 [stat.ML] UPDATED)
    We propose an intuitive, machine-learning approach to multiparameter inference, dubbed the InferoStatic Networks (ISN) method, to model the score and likelihood ratio estimators in cases when the probability density can be sampled but not computed directly. The ISN uses a backend neural network that models a scalar function called the inferostatic potential $\varphi$. In addition, we introduce new strategies, respectively called Kernel Score Estimation (KSE) and Kernel Likelihood Ratio Estimation (KLRE), to learn the score and the likelihood ratio functions from simulated data. We illustrate the new techniques with some toy examples and compare to existing approaches in the literature. We mention en passant some new loss functions that optimally incorporate latent information from simulations into the training procedure.  ( 2 min )
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v2 [stat.ML] UPDATED)
    Spurious correlations, or correlations that change across domains where a model can be deployed, present significant challenges to real-world applications of machine learning models. However, such correlations are not always "spurious"; often, they provide valuable prior information for a prediction. Here, we present a test-time adaptation method that exploits the spurious correlation phenomenon, in contrast to recent approaches that attempt to eliminate spurious correlations through invariance. We consider situations where the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, but the generative model for features $p(\mathbf{x}|y, z)$ is constant. We note that this corresponds to an expanded version of the label shift assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we train a classifier to predict $p(y, z|\mathbf{x})$ on the source distribution, and propose a test-time label shift correction that adapts to changes in the marginal distribution $p(y, z)$ using unlabeled samples from the target domain. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on two different image datasets -- the CheXpert chest X-ray dataset and the Colored MNIST dataset -- and show a significant improvement over baseline methods. Code reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .  ( 2 min )
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v2 [physics.ao-ph] UPDATED)
    Machine learning models have been employed to perform either physics-free data-driven or hybrid dynamical downscaling of climate data. Most of these implementations operate over relatively small downscaling factors because of the challenge of recovering fine-scale information from coarse data. This limits their compatibility with many global climate model outputs, often available between $\sim$50--100 km resolution, to scales of interest such as cloud resolving or urban scales. This study systematically examines the capability of convolutional neural networks (CNNs) to downscale surface wind speed data over land surface from different coarse resolutions (25 km, 48 km, and 100 km resolution) to 3 km. For each downscaling factor, we consider three CNN configurations that generate super-resolved predictions of fine-scale wind speed, which take between 1 to 3 input fields: coarse wind speed, fine-scale topography, and diurnal cycle. In addition to fine-scale wind speeds, probability density function parameters are generated, through which sample wind speeds can be generated accounting for the intrinsic stochasticity of wind speed. For generalizability assessment, CNN models are tested on regions with different topography and climate that are unseen during training. The evaluation of super-resolved predictions focuses on subgrid-scale variability and the recovery of extremes. Models with coarse wind and fine topography as inputs exhibit the best performance compared with other model configurations, operating across the same downscaling factor. Our diurnal cycle encoding results in lower out-of-sample generalizability compared with other input configurations.  ( 2 min )
    Consistent Range Approximation for Fair Predictive Modeling. (arXiv:2212.10839v2 [cs.LG] UPDATED)
    This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.  ( 2 min )
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v2 [stat.ML] UPDATED)
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.  ( 2 min )
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v3 [stat.ME] UPDATED)
    We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.  ( 2 min )
    Learning Counterfactually Invariant Predictors. (arXiv:2207.09768v2 [cs.LG] UPDATED)
    Counterfactual invariance has proven an essential property for predictors that are fair, robust, and generalizable in the real world. We propose a general definition of counterfactual invariance and provide simple graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of (conditional independence in) the observational distribution. Any predictor that satisfies our criterion is provably counterfactually invariant. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactual Invariance Prediction (CIP), based on a kernel-based conditional dependence measure called Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various types of data including tabular, high-dimensional, and real-world dataset.  ( 2 min )
    DEUP: Direct Epistemic Uncertainty Prediction. (arXiv:2102.08501v4 [cs.LG] UPDATED)
    Epistemic Uncertainty is a measure of the lack of knowledge of a learner which diminishes with more evidence. While existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, we argue that this does not capture the part of lack of knowledge induced by model misspecification. We discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor, is a sound measure of epistemic uncertainty which captures the effect of model misspecification. We thus propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. We discuss the merits of this novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty and addresses its major pitfall. Our framework, Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round. Through a wide set of experiments, we illustrate how existing methods in sequential model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v2 [stat.ML] UPDATED)
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.  ( 2 min )
    From Robustness to Privacy and Back. (arXiv:2302.01855v1 [cs.LG])
    We study the relationship between two desiderata of algorithms in statistical inference and machine learning: differential privacy and robustness to adversarial data corruptions. Their conceptual similarity was first observed by Dwork and Lei (STOC 2009), who observed that private algorithms satisfy robustness, and gave a general method for converting robust algorithms to private ones. However, all general methods for transforming robust algorithms into private ones lead to suboptimal error rates. Our work gives the first black-box transformation that converts any adversarially robust algorithm into one that satisfies pure differential privacy. Moreover, we show that for any low-dimensional estimation task, applying our transformation to an optimal robust estimator results in an optimal private estimator. Thus, we conclude that for any low-dimensional task, the optimal error rate for $\varepsilon$-differentially private estimators is essentially the same as the optimal error rate for estimators that are robust to adversarially corrupting $1/\varepsilon$ training samples. We apply our transformation to obtain new optimal private estimators for several high-dimensional tasks, including Gaussian (sparse) linear regression and PCA. Finally, we present an extension of our transformation that leads to approximate differentially private algorithms whose error does not depend on the range of the output space, which is impossible under pure differential privacy.  ( 2 min )
    Sample Complexity of Probability Divergences under Group Symmetry. (arXiv:2302.01915v1 [math.ST])
    We rigorously quantify the improvement in the sample complexity of variational divergence estimations for group-invariant distributions. In the cases of the Wasserstein-1 metric and the Lipschitz-regularized $\alpha$-divergences, the reduction of sample complexity is proportional to an ambient-dimension-dependent power of the group size. For the maximum mean discrepancy (MMD), the improvement of sample complexity is more nuanced, as it depends on not only the group size but also the choice of kernel. Numerical simulations verify our theories.  ( 2 min )
    Certified Robustness of Learning-based Static Malware Detectors. (arXiv:2302.01757v1 [cs.CR])
    Certified defenses are a recent development in adversarial machine learning (ML), which aim to rigorously guarantee the robustness of ML models to adversarial perturbations. A large body of work studies certified defenses in computer vision, where $\ell_p$ norm-bounded evasion attacks are adopted as a tractable threat model. However, this threat model has known limitations in vision, and is not applicable to other domains -- e.g., where inputs may be discrete or subject to complex constraints. Motivated by this gap, we study certified defenses for malware detection, a domain where attacks against ML-based systems are a real and current threat. We consider static malware detection systems that operate on byte-level data. Our certified defense is based on the approach of randomized smoothing which we adapt by: (1) replacing the standard Gaussian randomization scheme with a novel deletion randomization scheme that operates on bytes or chunks of an executable; and (2) deriving a certificate that measures robustness to evasion attacks in terms of generalized edit distance. To assess the size of robustness certificates that are achievable while maintaining high accuracy, we conduct experiments on malware datasets using a popular convolutional malware detection model, MalConv. We are able to accurately classify 91% of the inputs while being certifiably robust to any adversarial perturbations of edit distance 128 bytes or less. By comparison, an existing certification of up to 128 bytes of substitutions (without insertions or deletions) achieves an accuracy of 78%. In addition, given that robustness certificates are conservative, we evaluate practical robustness to several recently published evasion attacks and, in some cases, find robustness beyond certified guarantees.  ( 2 min )
    Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. (arXiv:2302.01771v1 [stat.ML])
    Deep learning (DL) has emerged as a promising tool to downscale climate projections at regional-to-local scales from large-scale atmospheric fields following the perfect-prognosis (PP) approach. Given their complexity, it is crucial to properly evaluate these methods, especially when applied to changing climatic conditions where the ability to extrapolate/generalise is key. In this work, we intercompare several DL models extracted from the literature for the same challenging use-case (downscaling temperature in the CORDEX North America domain) and expand standard evaluation methods building on eXplainable artifical intelligence (XAI) techniques. We show how these techniques can be used to unravel the internal behaviour of these models, providing new evaluation dimensions and aiding in their diagnostic and design. These results show the usefulness of incorporating XAI techniques into statistical downscaling evaluation frameworks, especially when working with large regions and/or under climate change conditions.  ( 2 min )
    Leveraging a Probabilistic PCA Model to Understand the Multivariate Statistical Network Monitoring Framework for Network Security Anomaly Detection. (arXiv:2302.01759v1 [stat.ML])
    Network anomaly detection is a very relevant research area nowadays, especially due to its multiple applications in the field of network security. The boost of new models based on variational autoencoders and generative adversarial networks has motivated a reevaluation of traditional techniques for anomaly detection. It is, however, essential to be able to understand these new models from the perspective of the experience attained from years of evaluating network security data for anomaly detection. In this paper, we revisit anomaly detection techniques based on PCA from a probabilistic generative model point of view, and contribute a mathematical model that relates them. Specifically, we start with the probabilistic PCA model and explain its connection to the Multivariate Statistical Network Monitoring (MSNM) framework. MSNM was recently successfully proposed as a means of incorporating industrial process anomaly detection experience into the field of networking. We have evaluated the mathematical model using two different datasets. The first, a synthetic dataset created to better understand the analysis proposed, and the second, UGR'16, is a specifically designed real-traffic dataset for network security anomaly detection. We have drawn conclusions that we consider to be useful when applying generative models to network security detection.  ( 2 min )
    ResMem: Learn what you can and memorize the rest. (arXiv:2302.01576v1 [cs.LG])
    The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.  ( 2 min )
    Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits. (arXiv:2302.01544v1 [cs.LG])
    In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.  ( 2 min )
    Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. (arXiv:2302.01629v1 [stat.ML])
    Machine learning models are vulnerable to adversarial perturbations, and a thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon through the lens of over-parameterization: interpolating smoothly the data requires significantly more parameters than simply memorizing it. However, this "universal" law provides only a necessary condition for robustness, and it is unable to discriminate between models. In this paper, we address these gaps by focusing on empirical risk minimization in two prototypical settings, namely, random features and the neural tangent kernel (NTK). We prove that, for random features, the model is not robust for any degree of over-parameterization, even when the necessary condition coming from the universal law of robustness is satisfied. In contrast, for even activations, the NTK model meets the universal lower bound, and it is robust as soon as the necessary condition on over-parameterization is fulfilled. This also addresses a conjecture in prior work by Bubeck, Li and Nagaraj. Our analysis decouples the effect of the kernel of the model from an "interaction matrix", which describes the interaction with the test data and captures the effect of the activation. Our theoretical results are corroborated by numerical evidence on both synthetic and standard datasets (MNIST, CIFAR-10).  ( 2 min )
    Using natural language processing and structured medical data to phenotype patients hospitalized due to COVID-19. (arXiv:2302.01536v1 [cs.CL])
    To identify patients who are hospitalized because of COVID-19 as opposed to those who were admitted for other indications, we compared the performance of different computable phenotype definitions for COVID-19 hospitalizations that use different types of data from the electronic health records (EHR), including structured EHR data elements, provider notes, or a combination of both data types. And conduct a retrospective data analysis utilizing chart review-based validation. Participants are 586 hospitalized individuals who tested positive for SARS-CoV-2 during January 2022. We used natural language processing to incorporate data from provider notes and LASSO regression and Random Forests to fit classification algorithms that incorporated structured EHR data elements, provider notes, or a combination of structured data and provider notes. Results: Based on a chart review, 38% of 586 patients were determined to be hospitalized for reasons other than COVID-19 despite having tested positive for SARS-CoV-2. A classification algorithm that used provider notes had significantly better discrimination than one that used structured EHR data elements (AUROC: 0.894 vs 0.841, p < 0.001), and performed similarly to a model that combined provider notes with structured data elements (AUROC: 0.894 vs 0.893). Assessments of hospital outcome metrics significantly differed based on whether the population included all hospitalized patients who tested positive for SARS-CoV-2 versus those who were determined to have been hospitalized due to COVID-19. This work demonstrates the utility of natural language processing approaches to derive information related to patient hospitalizations in cases where there may be multiple conditions that could serve as the primary indication for hospitalization.  ( 3 min )
    Where and How to Improve Graph-based Spatio-temporal Predictors. (arXiv:2302.01701v1 [stat.ML])
    This paper introduces a novel residual correlation analysis, called AZ-analysis, to assess the optimality of spatio-temporal predictive models. The proposed AZ-analysis constitutes a valuable asset for discovering and highlighting those space-time regions where the model can be improved with respect to performance. The AZ-analysis operates under very mild assumptions and is based on a spatio-temporal graph that encodes serial and functional dependencies in the data; asymptotically distribution-free summary statistics identify existing residual correlation in space and time regions, hence localizing time frames and/or communities of sensors, where the predictor can be improved.  ( 2 min )
    A Lipschitz Bandits Approach for Continuous Hyperparameter Optimization. (arXiv:2302.01539v1 [cs.LG])
    One of the most critical problems in machine learning is HyperParameter Optimization (HPO), since choice of hyperparameters has a significant impact on final model performance. Although there are many HPO algorithms, they either have no theoretical guarantees or require strong assumptions. To this end, we introduce BLiE -- a Lipschitz-bandit-based algorithm for HPO that only assumes Lipschitz continuity of the objective function. BLiE exploits the landscape of the objective function to adaptively search over the hyperparameter space. Theoretically, we show that $(i)$ BLiE finds an $\epsilon$-optimal hyperparameter with $O \left( \frac{1}{\epsilon} \right)^{d_z + \beta}$ total budgets, where $d_z$ and $\beta$ are problem intrinsic; $(ii)$ BLiE is highly parallelizable. Empirically, we demonstrate that BLiE outperforms the state-of-the-art HPO algorithms on benchmark tasks. We also apply BLiE to search for noise schedule of diffusion models. Comparison with the default schedule shows that BLiE schedule greatly improves the sampling speed.  ( 2 min )
    Support Recovery in Sparse PCA with Non-Random Missing Data. (arXiv:2302.01535v1 [stat.ML])
    We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems.  ( 2 min )
    Failure-informed adaptive sampling for PINNs, Part II: combining with re-sampling and subset simulation. (arXiv:2302.01529v1 [math.NA])
    This is the second part of our series works on failure-informed adaptive sampling for physic-informed neural networks (FI-PINNs). In our previous work \cite{gao2022failure}, we have presented an adaptive sampling framework by using the failure probability as the posterior error indicator, where the truncated Gaussian model has been adopted for estimating the indicator. In this work, we present two novel extensions to FI-PINNs. The first extension consist in combining with a re-sampling technique, so that the new algorithm can maintain a constant training size. This is achieved through a cosine-annealing, which gradually transforms the sampling of collocation points from uniform to adaptive via training progress. The second extension is to present the subset simulation algorithm as the posterior model (instead of the truncated Gaussian model) for estimating the error indicator, which can more effectively estimate the failure probability and generate new effective training points in the failure region. We investigate the performance of the new approach using several challenging problems, and numerical experiments demonstrate a significant improvement over the original algorithm.  ( 2 min )
    Dataset Distillation Fixes Dataset Reconstruction Attacks. (arXiv:2302.01428v1 [cs.LG])
    Modern deep learning requires large volumes of data, which could contain sensitive or private information which cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effectiveness regime, and ways to defend against it. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover its entire training set in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. More importantly, we formally show for the first time that dataset reconstruction attacks are a variation of dataset distillation. This key theoretical result on the unification of dataset reconstruction and distillation not only sheds more light on the characteristics of the attack but enables us to design defense mechanisms against them via distillation algorithms.  ( 2 min )
    Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective. (arXiv:2302.01425v1 [cs.LG])
    The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.  ( 2 min )
    Augmented Learning of Heterogeneous Treatment Effects via Gradient Boosting Trees. (arXiv:2302.01367v1 [stat.ML])
    Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.  ( 2 min )
    Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen's f2. (arXiv:2302.01407v1 [stat.ME])
    Deep artificial neural networks show high predictive performance in many fields, but they do not afford statistical inferences and their black-box operations are too complicated for humans to comprehend. Because positing that a relationship exists is often more important than prediction in scientific experiments and research models, machine learning is far less frequently used than inferential statistics. Additionally, statistics calls for improving the test of theory by showing the magnitude of the phenomena being studied. This article extends current XAI methods and develops a model agnostic hypothesis testing framework for machine learning. First, Fisher's variable permutation algorithm is tweaked to compute an effect size measure equivalent to Cohen's f2 for OLS regression models. Second, the Mann-Kendall test of monotonicity and the Theil-Sen estimator is applied to Apley's accumulated local effect plots to specify a variable's direction of influence and statistical significance. The usefulness of this approach is demonstrated on an artificial data set and a social survey with a Python sandbox implementation.  ( 2 min )

  • Open

    12 highlights from Google's BARD announcement
    I went through the entire blog post from Google and pulled out some quotes and highlights: ​ 1) “we re-oriented the company around AI six years ago” Right off the bat, “Pich-AI” lets it be known that Google is now an AI company. Partially true? Yes, of course. Would that phrase be coming out of his mouth at this point if not for the release and success of ChatGPT? No. 2) their mission: “organize the world’s information and make it universally accessible and useful” There’s a book called The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail. I'm certainly not here to say that Google is going to fail, but the re-stating of the mission makes it clear that they view AI (and Bard) as a way to improve, supplement, and perhaps protect their search business. This is …  ( 45 min )
    Weekly China AI News: Baidu's Language Model Behind Rumored ChatGPT Search; Tencent-Backed Robot Startup Files for Hong Kong IPO; Xpeng Targets "Full Autonomy" in 2023
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    The Blair Witch Project come to life with Midjourney as an 80's Horror Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    Harry Potter come to life with Midjourney as an 80s Love Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    [Project] I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! (Video and Model link in post)
    submitted by /u/VR_Angel [link] [comments]  ( 41 min )
    Over the weekend I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! Enjoy!
    submitted by /u/VR_Angel [link] [comments]  ( 40 min )
    Built a Telegram AI tutor bot + updates
    Hey! Since first posting here we've got 800+ users taking almost 1,000 courses! In short, this is how the bot works: 1️⃣ Send the captain a topic - usually, one or a few words is enough 2️⃣ Get a mini-course divided into 5 chapters 3️⃣ Receive your content packed into a beautiful magazine-style sharable link Would love to know what you think! http://edwardbot.com/ submitted by /u/Itaydr [link] [comments]  ( 41 min )
    AI Dream 144 - DARK CLOUD AI FUSION - MINDBLOW MONDAY
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Google launches ChatGPT competitor "Bard" and more
    submitted by /u/Peaking_AI [link] [comments]  ( 40 min )
    Change My Mind: You either believe consciousness is biomechanistic and therefore replicable, or you believe consciousness comes from some supernatural source and therefore not replicable. If it's the former, then you should believe AI is/can be sentient.
    I'll just try and add some detail in case the title doesn't make sense for some. The two major options I see: You believe consciousness arises from a very complicated system of biology > cells > molecules > atoms > elements. Ultimately a mechanistic view that, metaphorically, life is complicated machinery. If not the former, then it seems the only other major option is that life is somehow endowed some supernatural or divine force that humans could never synthetically replicate. This is of course generally considered a spiritual/religious belief. So if you believe in item 1, doesn't this mean you pretty much have to believe that AI is or can have consciousness? submitted by /u/sidianmsjones [link] [comments]  ( 44 min )
    OpenAI Have Started a Search Engine Revolution.
    submitted by /u/shauryadevil [link] [comments]  ( 40 min )
    AI Seinfeld show banned on Twitch for transphobic comments
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 40 min )
    Gen-1: The Next Step Forward for Generative AI - Use words and images to generate new videos out of existing ones
    submitted by /u/magenta_placenta [link] [comments]  ( 40 min )
    Spotify's Founder Has Developed An AI-Powered Body Health Scanner
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    runway announces GEN-1 — video to video generative AI
    submitted by /u/AR_MR_XR [link] [comments]  ( 40 min )
    Integrate OpenAI with .NET Core and Angular15
    submitted by /u/TheDotnetoffice [link] [comments]  ( 40 min )
    Google invests 500 million dollars in the next rival of ChatGPT
    submitted by /u/nikesh96 [link] [comments]  ( 41 min )
    Seinfeld AI makes transgender joke and gets banned on twitch
    AI Seinfeld Transphobic rant - YouTube submitted by /u/Status_Signal_4083 [link] [comments]  ( 42 min )
    Hi-ResNet: High resolution image classifier. (448, 896, 1792 sq.px.)
    submitted by /u/johnGettings [link] [comments]  ( 41 min )
    Spotify’s Founder Has Developed An AI-powered Body Health Scanner
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 41 min )
    ChatFAI: Chat with your favorite characters (updates and a challenge)
    ​ Characters as shown on https://chatfai.com/characters/ Hi everyone! I have recently made some exciting changes to my ChatFAI web app. The public characters library is now live - it's now easy to share and install public characters. Added a regenerate reply option. Created a new plan without any daily limit. I have gotten a lot of help and support from this community. The feedback and support from you all are really helpful and that is how I am improving ChatFAI (based on the feedback and suggestions). So, here I am again. What do you think about the latest updates? Is it going in the right direction? Another challenge I have not resolved yet is finding B2B use cases for ChatFAI. Thank you for your help and support - it's greatly appreciated! submitted by /u/usamaejazch [link] [comments]  ( 42 min )
    Streamline Ticket Triage and Reduce Customer Churn with AI
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    I Made a Text Bot Powered by ChatGPT, DALLE 2, and Wolfram Alpha
    submitted by /u/ImplodingCoding [link] [comments]  ( 44 min )
  • Open

    MC vs TD(0) on windy gridworld
    Hi, I'm relatively new to the field of RL but with strong experience in DL, I'm currently studying the Sutton & Barto book, and more specifically TD(0) method. At some point the discuss a "windy" gridworld example, in that we have the normal gridworld setup, but there is an upward wind on some tiles that may move the agent upwards as many squares as indicated in the bottom of each column. Here is a schematic of the setup: Windy Gridworld example from Sutton Barto book If you perform an action that would lead to the agent falling of the grid, the agent remains at the state it was. E.G. if the agent is in the top-left square and performs the action `up`, they will remain at the top-left square. They essentially use this to present SARSA, but then they mention the following for MC methods: Note that Monte Carlo methods cannot easily be used here because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. This strikes me quite as odd. Does this mean that I cannot (or that it's not safe to) use MC methods in any problem where it's possible to do a transition from a state that will result in the agent remaining in the same state? The first thing that comes to mind is the normal gridworld example (i.e. w/out any wind), does this restriction that they mention here mean that MC is not safe to use? submitted by /u/dep0 [link] [comments]  ( 42 min )
    [Discussion] League of Legends Reinforcement Learning Library - Interest
    Hello everyone. I am considering making a reinforcement learning library for the most recent version of League of Legends based on discussions on an existing library here. Would there be any interest in an RL library for League of Legends? The interface would work like the following: ```python import tlol.gym as leaguegym env = leaguegym.make() while True: obs = env.reset() done = False while not done: #Here we sample a random action. If you have an agent, you would get an action from it here. action = env.action_space.sample() next_obs, reward, done, gameinfo = env.step(action) obs = next_obs ``` If you look at my previous posts, I have already created an RL environment for League of Legends v4.20 where other people have also taken the project and successfully trained agents which learn adversarially against each other here. I've also released many gameplay datasets for League of Legends during Season 12 here also for supervised learning and RL. At the moment tlol-py contains an interface for ML models to play League of Legends but I'm considering creating a purpose built library for RL for League of Legends. Would there be interest in a project like this? submitted by /u/Ok-Alps-7918 [link] [comments]  ( 42 min )
    Why the sim2real problem in robotic manipulation?
    Hi all, assuming the task is opening the door with a robot, as far as I understand the sim2real problem happens as the robot behaves differently in the real world as the physics in the simulator (where the agent is trained) are not 100% identical in the real world. From my understanding the sim2real problem occurs if we let the agent also handle this controller part. But why cant we just extract the trajectory of the manipulator that the agent generates to open the door and executes it with the controller from the real world? Am I missing something here? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 43 min )
    Question on return values of the .step() method in a multi-agent environment
    Is it possible that, in a multi-agent environment, the return values for reward, terminated, and truncated of a .step(actions) method call have different values for each agent? If so, is there an example environment? submitted by /u/Toni-SM [link] [comments]  ( 41 min )
    I have implemented an RL agent for trading EUR/USD and I don't know what to do next...
    So, after months of learning about RL and doing toy implentations, I have coded a DQN, with experience buffer and dual nets. The network design is like the most average thing you can come across in ML scene. A simple deep feed forward with Relu and Linear as activation functions. I have also coded a simplified version of the Forex market for my agent to train in. It has bid ask prices, leverage, call margin, and buy/sell/not-in-the-market positions. The whole given state to the model is nothing fancy. It is merely the historical, model's balance and a few binary indictors about the environment. Since I'm cripplingly poor, I don't have any specialized hardware for training the model. After burning like 100 hours into the free version of Google collab with three different learning rates I…  ( 46 min )
    Why is my PPO algorithm not learning a simple environment?
    I have made a Stack Overflow post here. I will highly appreciate all your help on this. Thank you! submitted by /u/Academic-Rent7800 [link] [comments]  ( 42 min )
    Does it make sense to use RL for trading?
    I have seen some blog posts and papers about using RL for financial trading. I have to be hones, I didn't read that stuff in details. However the main idea seems kind of clear: you can model the market as a MDP where the state space encodes the relevant features of the market and your current portfolio, the possible actions are what/how much to sell/buy and the reward function should express the value change of your portfolio. However I am a bit puzzled. Clearly your actions as a trader do not really affect the market that much meaning that the transition function (probability distribution for the next state) does not depend on your action (excluding the changes in your portfolio). Why would RL provide any advantages over more classical approaches? I might miss something, but maybe formalizing the trading problem as a (contextual) multi-armed bandit seems more reasonable to me. submitted by /u/AdministrativeBank48 [link] [comments]  ( 48 min )
  • Open

    [Discussion] League of Legends Reinforcement Learning Library - Interest
    Hello everyone. I am considering making a reinforcement learning library for the most recent version of League of Legends based on discussions on an existing library here. Would there be any interest in an RL library for League of Legends? The interface would work like the following: ```python import tlol.gym as leaguegym env = leaguegym.make() while True: obs = env.reset() done = False while not done: #Here we sample a random action. If you have an agent, you would get an action from it here. action = env.action_space.sample() next_obs, reward, done, gameinfo = env.step(action) obs = next_obs ``` If you look at my previous posts, I have already created an RL environment for League of Legends v4.20 where other people have also taken the project and successfully trained agents which learn adversarially against each other here. I've also released many gameplay datasets for League of Legends during Season 12 here also for supervised learning and RL. At the moment tlol-py contains an interface for ML models to play League of Legends but I'm considering creating a purpose built library for RL for League of Legends. Would there be interest in a project like this? submitted by /u/Ok-Alps-7918 [link] [comments]  ( 43 min )
    [Project] Need Suggestions Improving the Model evaluation scores.
    Hi, I'm working on a project, where we're to classify a user into High or Low Income. The dataset contains 9000+ features and the number of observations/rows are, 30000 representing household. The features include the media consumption habits of people. Hourly, Weekly, Monthly and Yearly for different TV channels. So far I have tried SVC, Random Forest and Logistic Regression. I used an ensemble of these three. However, I haven't been able to get past 63% accuracy. I tried PCA, however, the results range b/w 61-63% accuracy, recall and precision. I do wanna add that the data is already scaled between 0-1 and most of the columns are sparse (0 values for many rows). Honestly, I have tried pretty much everything, but can't seem to raise the evaluation metrics. Can someone direct me to the right path on what I can do to improve the scores? submitted by /u/Toko_yami [link] [comments]  ( 43 min )
    [P] I have implemented an RL agent for trading EUR/USD and I don't know what to do next...
    So, after months of learning about RL and doing toy implentations, I have coded a DQN, with experience buffer and dual nets. The network design is like the most average thing you can come across in ML scene. A simple deep feed forward with Relu and Linear as activation functions. I have also coded a simplified version of the Forex market for my agent to train in. It has bid ask prices, leverage, call margin, and buy/sell/not-in-the-market positions. The whole given state to the model is nothing fancy. It is merely the historical, model's balance and a few binary indictors about the environment. Since I'm cripplingly poor, I don't have any specialized hardware for training the model. After burning like 100 hours into the free version of Google collab with three different learning rates I…  ( 45 min )
    [Project] I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! (Video and Model link in post)
    It took me about 46 hours to run this on my 3080 at home. The original files was from the Blu-ray release that was unfortunately pretty poorly done in my opinion. This version really gives it new life I think. Here's a link to the video result to see for yourself: https://vimeo.com/796411232 And a link to the model I used! https://github.com/TencentARC/AnimeSR submitted by /u/VR_Angel [link] [comments]  ( 43 min )
    [P] Looking for string generation GAN
    I have had minimal luck finding documentation on creating or using a premade string generator. It can't be a text generator really... because I am building it for translation from one language to another. I want to teach the generator to produce guesses on what the best translation for a single word would be based on underlying language semantics. I don't need it to be accurate necessarily, just a point-of-reference for observing language and phonetic mechanics. submitted by /u/lullaby876 [link] [comments]  ( 42 min )
    [D] What techniques can I use to tell if a problem is likely enough to be solved by ML so as to justify compiling the dataset?
    I have a problem that if I solve it with ML, I'll make money, with an outside chance of it being a lot of money. Compiling a dataset will take significant work. Are there any techniques that I can apply to let me know if this is going to be worth it? Perhaps there are certain hallmarks that a problem would have if it is likely to be solvable with available data? Maybe something I can do with a small initial dataset? Thanks. submitted by /u/SnuggleWuggleSleep [link] [comments]  ( 43 min )
    [N] Google: An Important Next Step On Our AI Journey
    https://blog.google/technology/ai/bard-google-ai-search-updates/ submitted by /u/EducationalCicada [link] [comments]  ( 50 min )
    [N] Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement
    From the article: Getty Images has filed a lawsuit in the US against Stability AI, creators of open-source AI art generator Stable Diffusion, escalating its legal battle against the firm. The stock photography company is accusing Stability AI of “brazen infringement of Getty Images’ intellectual property on a staggering scale.” It claims that Stability AI copied more than 12 million images from its database “without permission ... or compensation ... as part of its efforts to build a competing business,” and that the startup has infringed on both the company’s copyright and trademark protections. This is different from the UK-based news from weeks ago. submitted by /u/Wiskkey [link] [comments]  ( 44 min )
    Does the high dimensionality of AI systems that model the real world tell us something about the abstract space of ideas? [D]
    Physical world we live in has 4 dimensions, string theory posits like up to 10. It seems like in order to successfully model the abstract space of ideas which relates things in the physical world to each other and describes them, machine learning needs thousands of dimensions. Also to the extent that ML algos/matrices can be made sparse, that seems to me to tell us something about the density of the mapping between abstract space and physical space... anyone know any papers w/this line of thinking? It also seems a bit unintuitive to me because it seems like geometrically space gets exponentially more complicated as you add dimensions but ML scales linearly or better in many cases with matrix dimensionality. submitted by /u/Frumpagumpus [link] [comments]  ( 44 min )
    [P] Forecasting methods in Time Series
    Hi all! For the longest time, I was having issues understanding how to use time series to do forecasting. Over the last few weeks, I have been writing a series of posts to guide anyone through the process! I am also in the process of writing a detailed practical guide with step-by-step instructions. ​ Right now I have 6 articles on the topic: * Introduction to ARIMA models (https://mlpills.dev/time-series/introduction-to-arima-models/) * Parameters selection in ARIMA models (https://mlpills.dev/time-series/parameters-selection-in-arima-models/) * Seasonal ARIMA (https://mlpills.dev/time-series/seasonal-arima/) * ARCH / GARCH models for Time Series (https://mlpills.dev/time-series/arch-garch-models-for-time-series/) * ARIMA-GARCH models (https://mlpills.dev/time-series/arima-garch-models/) * And today's -> Forecasting in Time Series (https://mlpills.dev/time-series/forecasting-in-time-series/) Let me know if there are any topics that you would like me to cover in the future! submitted by /u/daansan-ml [link] [comments]  ( 43 min )
    Which strategies,framework and applications tools can be implement to automatically monitor the health of the machine learning model? [D]
    Machine LearningModels when deployed in the production environment, model degradation can arise where their output will change if the relationship between the incoming serving data and the predicted target drifts apart. Please can someone briefly elaborate on what strategies, frameworks and application tools can be implemented to automatically monitor the health of the model and alert the Data Scientist of any decay in data quality, data drift, and model quality? submitted by /u/astronaut1971 [link] [comments]  ( 42 min )
    [P] I made image clustering and captioning tools
    I made an image captioning and clustering tools for computer vision and diffusion projects. You can run almost everything automatically and with a simple CLI command. All contributions are welcome. https://github.com/cobanov/image-clustering https://github.com/cobanov/image-captioning submitted by /u/metover [link] [comments]  ( 42 min )
    High-speed cameras and deep learning [Research]
    I haven’t been able to find research on deep learning using high-speed cameras that capture images at frame rates higher than 250fps. I wonder if they are rather useless for image/video processing or do any of you have any ideas about potential applications. submitted by /u/A15L [link] [comments]  ( 44 min )
    [R] Research trends in Graph Neural Networks (GNN)
    Deep connections discovered between Graph Diffusion Networks and Partial Differential Equations modelling heat transfer. https://towardsdatascience.com/graph-neural-networks-as-neural-diffusion-pdes-8571b8c0c774 https://arxiv.org/abs/2106.10934 Strange connections uncovered between GNNs and Structural Causal Models. https://arxiv.org/abs/2109.04173 https://www.youtube.com/watch?v=XC-Bfg3dO0I GNNs used to enhance the factualness of LLMs by providing embeddings from Knowledge Graphs (KEs). https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00360/98089 GNNs used to categorize objects from only their 3D mesh. https://arxiv.org/pdf/2106.15778.pdf Prediction of intuitive physics among physical objects. https://proceedings.neurips.cc/paper/2016/hash/3147da8ab4a0437c15ef51a5cc7f2dc4-Abstract.html Zero-shot generalization in robot Task Planning. https://arxiv.org/abs/2102.13177 https://www.youtube.com/watch?v=POxaTDAj7aY submitted by /u/moschles [link] [comments]  ( 42 min )
    [R] Creating a Large Language Model of a Philosopher
    Paper : https://arxiv.org/abs/2302.01339 Abstract : Can large language models be trained to produce philosophical texts that are difficult to distinguish from texts produced by human philosophers? To address this question, we fine-tuned OpenAI's GPT-3 with the works of philosopher Daniel C. Dennett as additional training data. To explore the Dennett model, we asked the real Dennett ten philosophical questions and then posed the same questions to the language model, collecting four responses for each question without cherry-picking. We recruited 425 participants to distinguish Dennett's answer from the four machine-generated answers. Experts on Dennett's work (N = 25) succeeded 51% of the time, above the chance rate of 20% but short of our hypothesized rate of 80% correct. For two of the ten questions, the language model produced at least one answer that experts selected more frequently than Dennett's own answer. Philosophy blog readers (N = 302) performed similarly to the experts, while ordinary research participants (N = 98) were near chance distinguishing GPT-3's responses from those of an "actual human philosopher". submitted by /u/starstruckmon [link] [comments]  ( 44 min )
    [P] I Made a Text Bot Powered by ChatGPT, DALLE 2, and Wolfram Alpha
    submitted by /u/ImplodingCoding [link] [comments]  ( 43 min )
    [R] deep learning and session-specific rapid recalibration for dynamic hand gesture recognition from EMG
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 44 min )
    [D] AtheneWins just showcased an AI streamer bot, Does anyone know how he did this?
    submitted by /u/imagoons [link] [comments]  ( 42 min )
    [D] RNN and S4 etc
    Hello what's the state of modern RNNs, why does S4 not use nonlinearity on the state vector? What happened to unitary RNN or independent RNN (which sounds like exponential moving average)? submitted by /u/windoze [link] [comments]  ( 42 min )
  • Open

    Automating the math for decision-making under uncertainty
    A new tool brings the benefits of AI programming to a much broader class of problems.  ( 8 min )
  • Open

    Create powerful self-service experiences with Amazon Lex on Talkdesk CX Cloud contact center
    This blog post is co-written with Bruno Mateus, Jonathan Diedrich and Crispim Tribuna at Talkdesk. Contact centers are using artificial intelligence (AI) and natural language processing (NLP) technologies to build a personalized customer experience and deliver effective self-service support through conversational bots. This is the first of a two-part series dedicated to the integration of […]  ( 8 min )
    Image classification model selection using Amazon SageMaker JumpStart
    Researchers continue to develop new model architectures for common machine learning (ML) tasks. One such task is image classification, where images are accepted as input and the model attempts to classify the image as a whole with object label outputs. With many models available today that perform this image classification task, an ML practitioner may […]  ( 11 min )
  • Open

    Is there any value in having non uniform activation functions in the hidden layer?
    I wrote my first neural network implementation following a tutorial. I was amazed that it was less than 140 lines of code. A thought occurred to me and I brought it to chatgpt first, but I'd like some human input from people who know better. I'm wondering what the pros and cons would be for having a network where the hidden layer chooses an activation function when it is initialized that may be different per node. Chatgpt was polite in its responses but it seems very clear that this is a bad idea for most common goals with a neural network. I'm making little creatures that just kind of exist, adding more inputs and outputs as I think of them. So I wondered recently \1. Would there be any benefit from 1 node using a sigmoid and another node using relu? \2. I know it would really screw up the network, but what would happen if I also had a small chance for the activation function to change in a node when creating the next generation? It's baby's first simulation where the best nodes survive to make a new generation. Since I'm using a sort of evolution model I'm not worried about "bad" training results or overly complicated brains, the 2 points chatgpt kept stressing no matter how I asked it. Does anyone have some thoughts they can share on this? It's possible that the question is hard to answer because it's so stupid / pointless for what a neural network would be used for. I get that impression from stack overflow answers to similar questions, hah. I don't have a goal, I just want to watch the little critters exist and struggle. submitted by /u/Tomnnn [link] [comments]  ( 44 min )
    Create a fake person
    Hi. I don’t know if it is a good place to ask this question but i have to make a project for my uni and i am a bit confused. The topic is „creating fake avatar”. I need to create a fake person based on other people images and make it „alive”. By alive I mean that i can create many pictures of this person in different situations. I know i need to use GAN but I just can’t get my head around on how to do it. I mean. First i need a neural network to create a fake person. But how to use this fake person to create different scenarios? Thank you for any help in advance. submitted by /u/Acrobatic_Ad6507 [link] [comments]  ( 42 min )
  • Open

    It’s No Big Deal, but ChatGPT Changes Everything – Part III
    “I’ll tell you the problem with the scientific power that you’re using here: it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part III The post It’s No Big Deal, but ChatGPT Changes Everything – Part III appeared first on Data Science Central.  ( 24 min )
    Guide to Best Flutter State Management Libraries for 2023
    Flutter is a free, open-source mobile user interface (UI) framework that Google developed in 2017. It allows users to create native mobile applications using a single codebase. Using a single codebase and one programming language, users can develop apps for two different platforms, Android and iOS. It’s considered the most effective cross-platform framework available. Flutter… Read More »Guide to Best Flutter State Management Libraries for 2023 The post Guide to Best Flutter State Management Libraries for 2023 appeared first on Data Science Central.  ( 21 min )
    Ensuring Data Security in Realtime Operating System (RTOS) Devices
    Just a few days ago, January 28, we celebrated Data Protection Day, an international event aimed at promoting data privacy and security. In line with the goal of raising awareness about data protection, it would be a good time to discuss data security with Realtime Operating System. This unconventional operating system is widely used, so… Read More »Ensuring Data Security in Realtime Operating System (RTOS) Devices The post Ensuring Data Security in Realtime Operating System (RTOS) Devices appeared first on Data Science Central.  ( 21 min )
  • Open

    AI Joins Hunt for ET: Study Finds 8 Potential Alien Signals
    A University of Toronto undergrad among an international team of researchers unleashing deep learning in the search for extraterrestrial civilizations.  ( 6 min )

  • Open

    [D] Overview of of Chatbot Research?
    Is there a good overview of the state of chatbot research? I'm wondering if the ChatGPT approach of big LLM + RLHF is now considered the only way forward? How about alternatives like BlenderBot3? And what are the best open source chatbots right now? Or if you can't create your own ChatGPT, how does using a GPT3 sized model + prompt engineering compare to smaller models with supervised fine tuning on a conversation dataset? submitted by /u/renbid [link] [comments]  ( 42 min )
    [D] Large language models (LLM) as priority / conflict resolver for embodied AI or in general
    I wanted to discuss the possibilities to use LLM in generating answer based on the context and resolving conflict. Some recent work leveraging LLM in robotics planning, like Language Models as Zero-Shot Planner use LLM to generate plans for robot. What are your views in terms of LLM which leverage the background knowledge and visual clues together to generate correct next action by robots or embodied systems. As a human we decide actions based on resolving priority or conflict based on rules/ concepts , can LLM takes these rules /concept explicitly in decision making to generate new set of actions? Example: while chopping the veggies by robots, if hand comes in between then robot will stop the chopping process of veggies. As chopping task and human hand presence are in conflict and humans hand safety is of higher priority than cutting. How such small-small kind of knowledge be encoded in these robotics system which makes them more safer and trustworthy in general. As LLM requires larges corpus of knowledge/data. submitted by /u/projekt_treadstone [link] [comments]  ( 43 min )
    [D] Is there a database of English language tokens, including all dictionary words and common word segments?
    I find it odd that I have to regenerate this from my input set each time. It should be something we can just start with pre-created. submitted by /u/MrOfficialCandy [link] [comments]  ( 42 min )
    [D] Is English the optimal language to train NLP models on?
    While the greatest amount of training content is available for English at the moment, it seems unlikely to me that it's an efficient language to train AI. A more optimal language would reduce training time and model size. It might, for example, be much more efficient to train AI on Chinese, Korean, or Japanese due to a reduce grammatical token-set when constructing sentences/ideas. But taking the idea further, I wonder if we should be using a human language at all. Perhaps it's more efficient to use something altogether new in order to both communicate with AI more exactingly and also to reduce model size/training. What do y'all think? submitted by /u/MrOfficialCandy [link] [comments]  ( 44 min )
    [R] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 42 min )
    [P] Interactive Map of NeurIPS Proceedings 1987-2022
    submitted by /u/NomicAI [link] [comments]  ( 42 min )
    [P] I made CoPilot for writing LaTeX (in Overleaf) - what do you think?
    submitted by /u/alistairmcleay [link] [comments]  ( 42 min )
    [P] NeuralFit now allows evolution of recurrent neural networks
    Hi all! Some time ago I made a post about NeuralFit (https://github.com/neural-fit/neuralfit), which allows you to evolve neural networks in 🐍 Python with just a few lines of code. Good news: it is now possible to evolve recurrent models on timeseries, which is useful for doing stock market predictions for example 📈. There is currently a simple example, but more examples will be added soon! In addition: the free limitations have been increased, so you can use NeuralFit for most hobby projects without needing a license. Just like last time, feedback is immensely appreciated! submitted by /u/wagenaartje [link] [comments]  ( 43 min )
    [R] Can anyone direct me to academic sources arguing that Big Tech using AI for targeted Social Media ads is a good thing for actual users?
    Been struggling to find sources relating to this, it’s mostly just tech websites or blogs I keep coming across. I’m struggling to find any academic papers arguing for specifically the use of user data to create targeted ads. submitted by /u/lara_lara24 [link] [comments]  ( 43 min )
    [D] How Machine Learning is Transforming Cybersecurity
    submitted by /u/DenofBlerds [link] [comments]  ( 42 min )
    [R] [D] PADL: Language-Directed Physics-Based Character Control by NVIDIA
    submitted by /u/WarmFormal9881 [link] [comments]  ( 42 min )
    [R] [D] The New XOR Problem
    submitted by /u/shawntan [link] [comments]  ( 45 min )
    [P] I made a browser extension that uses ChatGPT to answer every StackOverflow question
    submitted by /u/jsonathan [link] [comments]  ( 46 min )
    Are PhDs in statistics useful for ML research? [D]
    How much do research labs or research jobs in ML hire statisticians vs computer scientists or mathematicians? submitted by /u/AdFew4357 [link] [comments]  ( 44 min )
    Why not use Stable Diffusion’s VAE to get textual embeddings? [D]
    submitted by /u/sudo_fuck_you [link] [comments]  ( 42 min )
    [D] Does the M2 Max 30-core GPU have any advantage over M2 Pro 19-core GPU in Machine Learning Tasks?
    submitted by /u/dona6603 [link] [comments]  ( 42 min )
    [P] tradeslyPro - AI Roboadvisor
    submitted by /u/mrtkp9993 [link] [comments]  ( 42 min )
    [D] List of Large Language Models to play with.
    Hello! I'm trying to understand what available LLMs one can "relatively easily" play with. My goal is to understand the landscape since I haven't worked in this field before. I'm trying to run them "from the largest to the smallest". By "relatively easy", I mean doesn't require to setup a GPU cluster or costs more than $20:) Here are some examples I have found so far: ChatGPT (obviously) - 175B params OpenAI api to access GPT-3s (from ada (0.5B) to davinci (175B)). Also CodeX Bloom (176B) - text window on that page seems to work reliably, you just need to keep pressing "generate" OPT-175B (Facebook LLM), the hosting works surprisingly fast, but slower than ChatGPT Several models on HuggingFace that I made to run with Colab Pro subscription: GPT-NeoX 20B, Flan-t5-xxl 11B, Xlm-roberta-xxl 10.7B, GPT-j 6B. I spent about $20 total on running the models below. None of the Hugging face API interfaces/spaces didn't work for me :(. Here is an example notebook I made for NeoX. Does anyone know more models that are easily accessible? P.S. Some large models I couldn't figure out (yet) how to run easily: Galactica-120b 120B Opt-30b 30B submitted by /u/sinavski [link] [comments]  ( 44 min )
    [N] "I got access to Google LaMDA, the Chatbot that was so realistic that one Google engineer thought it was conscious. First impressions"
    Tweet thread: https://twitter.com/WholeMarsBlog/status/1622139178439036928 First impressions: this sucks ass I can only ask about dogs and a few different types of prompts Does anyone else have experiences to share with this nerfed LaMDA beta google released? submitted by /u/That_Violinist_18 [link] [comments]  ( 44 min )
    Objects Color Matching against a Reference Standard (ColorCODEX)? [D]
    Im trying to build and train a Machine Learning model that autonomously performs color matching between the target gemstone and the Reference Standard color chart. A digital photo image of the target gemstone is first captured in a controlled environment in terms of illumination and background. This digital image is further pre-processed and fed into an algorithm that recognizes and match its color distribution to the closest color in the Reference Standard color chart. Numerous Reference Standards exist but I will use the ColorCODEX (this link ColorCODEX) So I would like to know which Machine Learning Model to use in this case to ensure high matching accuracy and like what performance metric can I use to measure matching accuracy and the color space for the color model. And at the end what image pre-processing needs to be done? I found this article (https://www.atlantis-press.com/proceedings/icosat-17/25895985)with backpropagation NN but not sure if it the best choice. Any other option? submitted by /u/astronaut1971 [link] [comments]  ( 44 min )
    [R] AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 42 min )
    [D] GNN Is node information required ?
    [D] Hey, It's kind of a simple one but just putting it out for opinions: when passing a graph through graph neural networks to obtain vectors for all the nodes. Is the info in the node required because all we care is about the position of certain node in context to the whole graph and that's how gnn outputs the vectors of each node. Sorry if that was messy. submitted by /u/ab_11nav [link] [comments]  ( 42 min )
    What text to speech does this guy use? [R]
    https://youtu.be/ktdUeqzzhiA what text to speech does he use? he's been popping up on my yt feed lately and i can see he has different voices in his videos and most of them sound robotic, what do you think it's being used here? submitted by /u/candidhorse4 [link] [comments]  ( 42 min )
  • Open

    Fullmetal Alchemist as an 80's Dark Fantasy movie (Chapters in Descripti...
    submitted by /u/EIDANart [link] [comments]  ( 40 min )
    AI text generator for news
    Hey all! For an art installation I want to be able to generate war news in specific countries and have a specific word used in it. For instance, a 2-3 sentence news about the war in Yemen with the word "love" in it. Any AI platforms you can suggest? submitted by /u/jarjar_bigh [link] [comments]  ( 41 min )
    GPT3-Assisted Google search, document/video/audio/website/youtube video indexer and composer conveniently built into Discord!
    submitted by /u/yikeshardware [link] [comments]  ( 42 min )
    AI Ethics, Deep Fakes, and the Dark Side of the Algorithm
    submitted by /u/IndependenceFun4627 [link] [comments]  ( 40 min )
    AI Dream 156 - This EPIC AI Video might break Youtube
    submitted by /u/LordPewPew777 [link] [comments]  ( 41 min )
    This web can make voices only with AI
    https://voice.ai/r/AfpKl. This web can create very good quality AI voices. They have majorly famous people voices. For using this voices you only need to write something and then the AI will say it with the voice you choose. submitted by /u/Marcosio9083 [link] [comments]  ( 41 min )
    Most advanced AI for clothing try-on?
    There have been already some ai developers teamed up with big firms to create AI for virtual clothing try-on even few years ago, but considering lastest models can generate images of very realistic human (including garments on them) from nothing I suppose there must be practical ones that can generate better quality try-on image given a series of photos of a model and clothes to put on. Do you know any such models? submitted by /u/PMATHbreaker [link] [comments]  ( 41 min )
    Is Microsoft's Azure OpenAI service generally 'cost effective' for a small number of specialist users in a small business?
    Imagine a type of user that is a senior manager, they do not have an analysis background. What i'd like is for them to be able to tell a dashboard "what happened to our customer reviews last week?" and a text summary is generated and parameters to charts sent out, and charts generated I'm exploring the aspect of cost here I read Azure's OpenAI marketing and see: > Azure OpenAI Service is enabling customers across industries from health care to financial services to manufacturing to quickly perform an array of tasks. Innovations include generating unique content for customers, summarizing and classifying customer feedback, and extracting text from medical records to streamline billing. The most common uses have been writing assistance, translating natural language to code and gaining data insights through search, entity extraction, sentiment and classification. Great But the pricing looks expensive for Base Series Fine-tuned, Curie: > Inferencing per 1,000 tokens | $0.002 > Hosting per hour | $0.24 and so i'd need to think about some caching intermediary stage to prevent too many calls to the Azure OpenAI API So is using Azure here for an ai based solution cost effective for a small business? submitted by /u/Work_Owl [link] [comments]  ( 42 min )
    Universities are acting like a fish out of the water due to AI
    submitted by /u/foundersblock [link] [comments]  ( 40 min )
    Starting a career in AI
    Starting a career in AI Hello folks, I’m a Product Manager with 10 years of experience in B2C mobile apps. I love mobile apps and I’m really passionate about user experience. However, I lately started using various AI products and I’m really impressed. Since I believe it the future I was thinking if it’s possible to start learning the basics and even make a career change in AI development (as soon as I learn what fields exist). I have studied computer science and I’m familiar with software development. Any idea/recommendations where to start from and what a possible career path would be? Thanks submitted by /u/EquivalentMongoose95 [link] [comments]  ( 42 min )
    Technology Readiness Levels (TRL) in AI development
    How can we move from an idea to production in AI? Does the technology readiness levels (TRL) help? If you want to get some answers please read this article in medium: https://medium.com/towards-artificial-intelligence/technology-readiness-levels-trl-in-ai-development-c6ed1190fbd6 All the ideas are more than welcome! submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 40 min )
    Request: Voice cloning app. (Website or app)
    I am looking for a free voice cloning app or website (Like elevenlabs, but I would need a quota bypass for that) and I can't find any. Any suggestions? submitted by /u/1sydxyz [link] [comments]  ( 40 min )
    I have a question for you? I am in need of specific AI tool I am not aware if exists yet.
    I am searching for an image and I need to find where the image comes from and it can be anywhere on the web or as a part of video as whole so is there an AI you can give a name , the said picture, photo, sentence or a specific data to find the exact photo in a random youtube video, the same name on some old site or inclusion of A photo I have on another site where it may have originated. An Ai that will scan the entire youtube , watch every vid and then list all the instances where the image was used. submitted by /u/Plajomzn [link] [comments]  ( 42 min )
    ChatGPT Becomes Fastest Growing App, Beating TikTok In Popularity
    submitted by /u/liquidocelotYT [link] [comments]  ( 42 min )
    The Fast and the Furious come to life with Midjourney as an 80's Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    Ai that let’s you Play As Neo in the matrix, in an open ended movie experience
    submitted by /u/techmanj [link] [comments]  ( 40 min )
    Why Overfitting and Underfitting Happen
    Hi guys, I have made a video on YouTube here where I explain why underfitting and overfitting happen in machine learning models by looking at the fundamental theory behind bias variance trade-off. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    How will AI affect Day-to-day business?
    https://buddingmanager.com/2023/01/26/https-buddingmanager-com-artificialintelligenceinbusiness/ submitted by /u/OriginalRecklessPark [link] [comments]  ( 40 min )
    🌎 Make your best prediction: HOW will AI systems change the world in the coming 10 years? What will be different 10 years later, because of AI systems like ChatGPT, Midjourney, Codex, Whisper and others?
    submitted by /u/DrMelbourne [link] [comments]  ( 43 min )
    I created a stream where AI bots watch movies and deliver a running commentary
    Hi all, For my weekend project I figured I would build an AI driven spiritual successor to Mystery Science Theater 3000... Stop on by and watch the AI characters watch movies and make comments! Today they are watching "The House on Haunted Hill" and "Plan 9 From Outer Space." There's still a lot to do but I'm excited to play around with this more and see how it plays out and would love some feedback! https://twitch.tv/MysteryAItheater submitted by /u/caseigl [link] [comments]  ( 42 min )
    AI That Can Put Objects from Different Pictures Together?
    I'm trying to put together pictures of my best friend (he unfortunately passed away) and myself together. We don't have any pictures of us together since our childhood /teen days. I would like to put our individually taken pics together using AI. Can someone recommend any tools? submitted by /u/mustufa2020 [link] [comments]  ( 41 min )
    Amazing "Jailbreak" Bypasses ChatGPT's Ethics Safeguards
    submitted by /u/Mental_Character7367 [link] [comments]  ( 54 min )
    Breaking: Google Invests in AnthropicAI and Claude with $300 Million Round for 10 Percent of the A.I. Lab valued at $5 Billion
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
  • Open

    Getty Images v. Stability AI – Lawsuit filing
    submitted by /u/nickb [link] [comments]  ( 40 min )
  • Open

    How to teach the agent to arrive at the goal by creating a search pattern
    Hi all, assuming the goal is to reach a ball on the table. The reward function used for this task is often: d= norm( gripper_position - ball_position ) , which will solve the problem. However, how can one teach the agent not to "directly" go to the ball, but creating a search pattern, for example, "scratching the surface with the gripper until you find the ball"? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 43 min )
    Autonomous Driving Off-Road | Swaayatt Robots | Dense Fog
    submitted by /u/shani_786 [link] [comments]  ( 41 min )
  • Open

    Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions. (arXiv:2211.01916v2 [cs.LG] UPDATED)
    We give an improved theoretical analysis of score-based generative modeling. Under a score estimate with small $L^2$ error (averaged across timesteps), we provide efficient convergence guarantees for any data distribution with second-order moment, by either employing early stopping or assuming smoothness condition on the score function of the data distribution. Our result does not rely on any log-concavity or functional inequality assumption and has a logarithmic dependence on the smoothness. In particular, we show that under only a finite second moment condition, approximating the following in reverse KL divergence in $\epsilon$-accuracy can be done in $\tilde O\left(\frac{d \log (1/\delta)}{\epsilon}\right)$ steps: 1) the variance-$\delta$ Gaussian perturbation of any data distribution; 2) data distributions with $1/\delta$-smooth score functions. Our analysis also provides a quantitative comparison between different discrete approximations and may guide the choice of discretization points in practice.
    Learning PDE Solution Operator for Continuous Modeling of Time-Series. (arXiv:2302.00854v1 [cs.LG])
    Learning underlying dynamics from data is important and challenging in many real-world scenarios. Incorporating differential equations (DEs) to design continuous networks has drawn much attention recently, however, most prior works make specific assumptions on the type of DEs, making the model specialized for particular problems. This work presents a partial differential equation (PDE) based framework which improves the dynamics modeling capability. Building upon the recent Fourier neural operator, we propose a neural operator that can handle time continuously without requiring iterative operations or specific grids of temporal discretization. A theoretical result demonstrating its universality is provided. We also uncover an intrinsic property of neural operators that improves data efficiency and model generalization by ensuring stability. Our model achieves superior accuracy in dealing with time-dependent PDEs compared to existing models. Furthermore, several numerical pieces of evidence validate that our method better represents a wide range of dynamics and outperforms state-of-the-art DE-based models in real-time-series applications. Our framework opens up a new way for a continuous representation of neural networks that can be readily adopted for real-world applications.
    Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function. (arXiv:2302.01244v1 [cs.LG])
    Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics- and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To test this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network's spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with an ensemble of probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.
    Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay. (arXiv:2205.08459v2 [cs.SD] UPDATED)
    Voice assistants overhear conversations and a consent management mechanism is required. Consent management can be implemented using speaker recognition. Users that do not give consent enrol their voice and all their further recordings are discarded. Building speaker recognition-based consent management is challenging as dynamic registration, removal, and re-registration of speakers must be efficiently handled. This work proposes a consent management system addressing the aforementioned challenges. A contrastive based training is applied to learn the underlying speaker equivariance inductive bias. The contrastive features for buckets of speakers are trained a few steps into each iteration and act as replay buffers. These features are progressively selected using a multi-strided random sampler for classification. Moreover, new methods for dynamic registration using a portion of old utterances, removal, and re-registration of speakers are proposed. The results verify memory efficiency and dynamic capabilities of the proposed methods and outperform the existing approach from the literature.
    Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy. (arXiv:2207.12141v2 [cs.LG] UPDATED)
    Model-based reinforcement learning (RL) often achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies, i.e., the sample replay buffer. However, in this paper, we observe that fitting the dynamics model under the distribution for \emph{all historical policies} does not necessarily benefit model prediction for the \emph{current policy} since the policy in use is constantly evolving over time. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how this distribution shift over historical policies affects the model learning and model rollouts. We then propose a novel dynamics model learning method, named \textit{Policy-adapted Dynamics Model Learning (PDML)}. PDML dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.
    Modelling the long-term fairness dynamics of data-driven targeted help on job seekers. (arXiv:2208.08881v2 [cs.LG] UPDATED)
    The use of data-driven decision support by public agencies is becoming more widespread and already influences the allocation of public resources. This raises ethical concerns, as it has adversely affected minorities and historically discriminated groups. In this paper, we use an approach that combines statistics and data-driven approaches with dynamical modeling to assess long-term fairness effects of labor market interventions. Specifically, we develop and use a model to investigate the impact of decisions caused by a public employment authority that selectively supports job-seekers through targeted help. The selection of who receives what help is based on a data-driven intervention model that estimates an individual's chances of finding a job in a timely manner and rests upon data that describes a population in which skills relevant to the labor market are unevenly distributed between two groups (e.g., males and females). The intervention model has incomplete access to the individual's actual skills and can augment this with knowledge of the individual's group affiliation, thus using a protected attribute to increase predictive accuracy. We assess this intervention model's dynamics -- especially fairness-related issues and trade-offs between different fairness goals -- over time and compare it to an intervention model that does not use group affiliation as a predictive feature. We conclude that in order to quantify the trade-off correctly and to assess the long-term fairness effects of such a system in the real-world, careful modeling of the surrounding labor market is indispensable.
    Do Kernel and Neural Embeddings Help in Training and Generalization?. (arXiv:1905.05095v3 [cs.LG] UPDATED)
    Recent results on optimization and generalization properties of neural networks showed that in a simple two-layer network, the alignment of the labels to the eigenvectors of the corresponding Gram matrix determines the convergence of the optimization during training. Such analyses also provide upper bounds on the generalization error. We experimentally investigate the implications of these results to deeper networks via embeddings. We regard the layers preceding the final hidden layer as producing different representations of the input data which are then fed to the two-layer model. We show that these representations improve both optimization and generalization. In particular, we investigate three kernel representations when fed to the final hidden layer: the Gaussian kernel and its approximation by random Fourier features, kernels designed to imitate representations produced by neural networks and finally an optimal kernel designed to align the data with target labels. The approximated representations induced by these kernels are fed to the neural network and the optimization and generalization properties of the final model are evaluated and compared.
    Learning-To-Ensemble by Contextual Rank Aggregation in E-Commerce. (arXiv:2107.08598v3 [cs.LG] UPDATED)
    Ensemble models in E-commerce combine predictions from multiple sub-models for ranking and revenue improvement. Industrial ensemble models are typically deep neural networks, following the supervised learning paradigm to infer conversion rate given inputs from sub-models. However, this process has the following two problems. Firstly, the point-wise scoring approach disregards the relationships between items and leads to homogeneous displayed results, while diversified display benefits user experience and revenue. Secondly, the learning paradigm focuses on the ranking metrics and does not directly optimize the revenue. In our work, we propose a new Learning-To-Ensemble (LTE) framework RAEGO, which replaces the ensemble model with a contextual Rank Aggregator (RA) and explores the best weights of sub-models by the Evaluator-Generator Optimization (EGO). To achieve the best online performance, we propose a new rank aggregation algorithm TournamentGreedy as a refinement of classic rank aggregators, which also produces the best average weighted Kendall Tau Distance (KTD) amongst all the considered algorithms with quadratic time complexity. Under the assumption that the best output list should be Pareto Optimal on the KTD metric for sub-models, we show that our RA algorithm has higher efficiency and coverage in exploring the optimal weights. Combined with the idea of Bayesian Optimization and gradient descent, we solve the online contextual Black-Box Optimization task that finds the optimal weights for sub-models given a chosen RA model. RA-EGO has been deployed in our online system and has improved the revenue significantly.
    Neural Design for Genetic Perturbation Experiments. (arXiv:2207.12805v2 [q-bio.QM] UPDATED)
    The problem of how to genetically modify cells in order to maximize a certain cellular phenotype has taken center stage in drug development over the last few years (with, for example, genetically edited CAR-T, CAR-NK, and CAR-NKT cells entering cancer clinical trials). Exhausting the search space for all possible genetic edits (perturbations) or combinations thereof is infeasible due to cost and experimental limitations. This work provides a theoretically sound framework for iteratively exploring the space of perturbations in pooled batches in order to maximize a target phenotype under an experimental budget. Inspired by this application domain, we study the problem of batch query bandit optimization and introduce the Optimistic Arm Elimination ($\mathrm{OAE}$) principle designed to find an almost optimal arm under different functional relationships between the queries (arms) and the outputs (rewards). We analyze the convergence properties of $\mathrm{OAE}$ by relating it to the Eluder dimension of the algorithm's function class and validate that $\mathrm{OAE}$ outperforms other strategies in finding optimal actions in experiments on simulated problems, public datasets well-studied in bandit contexts, and in genetic perturbation datasets when the regression model is a deep neural network. OAE also outperforms the benchmark algorithms in 3 of 4 datasets in the GeneDisco experimental planning challenge.
    Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees. (arXiv:2209.10492v2 [cs.CL] UPDATED)
    Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-Search that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-Search effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems. Supporting code available at https://github.com/swarnaHub/SummarizationPrograms
    Towards Understanding and Mitigating Dimensional Collapse in Heterogeneous Federated Learning. (arXiv:2210.00226v2 [cs.LG] UPDATED)
    Federated learning aims to train models collaboratively across different clients without the sharing of data for privacy considerations. However, one major challenge for this learning paradigm is the {\em data heterogeneity} problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects the representations of the globally aggregated models. Interestingly, we find that heterogeneous data results in the global model suffering from severe {\em dimensional collapse}, in which representations tend to reside in a lower-dimensional space instead of the ambient space. Moreover, we observe a similar phenomenon on models locally trained on each client and deduce that the dimensional collapse on the global model is inherited from local models. In addition, we theoretically analyze the gradient flow dynamics to shed light on how data heterogeneity result in dimensional collapse for local models. To remedy this problem caused by the data heterogeneity, we propose {\sc FedDecorr}, a novel method that can effectively mitigate dimensional collapse in federated learning. Specifically, {\sc FedDecorr} applies a regularization term during local training that encourages different dimensions of representations to be uncorrelated. {\sc FedDecorr}, which is implementation-friendly and computationally-efficient, yields consistent improvements over baselines on standard benchmark datasets. Code: https://github.com/Yujun-Shi/FedCLS.
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v2 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
    Analysis of Knowledge Transfer in Kernel Regime. (arXiv:2003.13438v3 [cs.LG] UPDATED)
    Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
    A Light Recipe to Train Robust Vision Transformers. (arXiv:2209.07399v2 [cs.CV] UPDATED)
    In this paper, we ask whether Vision Transformers (ViTs) can serve as an underlying architecture for improving the adversarial robustness of machine learning models against evasion attacks. While earlier works have focused on improving Convolutional Neural Networks, we show that also ViTs are highly suitable for adversarial training to achieve competitive performance. We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset. The canonical training recipe for ViTs recommends strong data augmentation, in part to compensate for the lack of vision inductive bias of attention modules, when compared to convolutions. We show that this recipe achieves suboptimal performance when used for adversarial training. In contrast, we find that omitting all heavy data augmentation, and adding some additional bag-of-tricks ($\varepsilon$-warmup and larger weight decay), significantly boosts the performance of robust ViTs. We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k. Additionally, investigating the reasons for the robustness of our models, we show that it is easier to generate strong attacks during training when using our recipe and that this leads to better robustness at test time. Finally, we further study one consequence of adversarial training by proposing a way to quantify the semantic nature of adversarial perturbations and highlight its correlation with the robustness of the model. Overall, we recommend that the community should avoid translating the canonical training recipes in ViTs to robust training and rethink common training choices in the context of adversarial training.
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v2 [cs.LG] UPDATED)
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These changes result in distribution shifts across environments. Attributing model performance changes to specific shifts is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.
    Safe Optimization of an Industrial Refrigeration Process Using an Adaptive and Explorative Framework. (arXiv:2211.13019v2 [math.OC] UPDATED)
    Many industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown process characteristics, real-time optimization becomes challenging, particularly for the satisfaction of safety constraints. In this paper, we demonstrate the application of an adaptive and explorative real-time optimization framework to an industrial refrigeration process, where we learn the process characteristics through changes in process control targets and through exploration to satisfy safety constraints. We quantify the uncertainty in unknown compressor characteristics of the refrigeration plant by using Gaussian processes and incorporate this uncertainty into the objective function of the real-time optimization problem as a weighted cost term. We adaptively control the weight of this term to drive exploration. The results of our simulation experiments indicate the proposed approach can help to increase the energy efficiency of the considered refrigeration process, closely approximating the performance of a solution that has complete information about the compressor performance characteristics.
    High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance. (arXiv:2302.00999v1 [math.OC])
    During recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimization methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assumptions such as the boundedness of the gradient noise variance or of the objective's gradient itself. In this paper, we propose several algorithms with high-probability convergence results under less restrictive assumptions. In particular, we derive new high-probability convergence results under the assumption that the gradient/operator noise has bounded central $\alpha$-th moment for $\alpha \in (1,2]$ in the following setups: (i) smooth non-convex / Polyak-Lojasiewicz / convex / strongly convex / quasi-strongly convex minimization problems, (ii) Lipschitz / star-cocoercive and monotone / quasi-strongly monotone variational inequalities. These results justify the usage of the considered methods for solving problems that do not fit standard functional classes studied in stochastic optimization.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v5 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    Multi-agent Reinforcement Learning with Graph Q-Networks for Antenna Tuning. (arXiv:2302.01199v1 [cs.NI])
    Future generations of mobile networks are expected to contain more and more antennas with growing complexity and more parameters. Optimizing these parameters is necessary for ensuring the good performance of the network. The scale of mobile networks makes it challenging to optimize antenna parameters using manual intervention or hand-engineered strategies. Reinforcement learning is a promising technique to address this challenge but existing methods often use local optimizations to scale to large network deployments. We propose a new multi-agent reinforcement learning algorithm to optimize mobile network configurations globally. By using a value decomposition approach, our algorithm can be trained from a global reward function instead of relying on an ad-hoc decomposition of the network performance across the different cells. The algorithm uses a graph neural network architecture which generalizes to different network topologies and learns coordination behaviors. We empirically demonstrate the performance of the algorithm on an antenna tilt tuning problem and a joint tilt and power control problem in a simulated environment.
    Robust Estimation under the Wasserstein Distance. (arXiv:2302.01237v1 [stat.ML])
    We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.
    From Traditional Adaptive Data Caching to Adaptive Context Caching: A Survey. (arXiv:2211.11259v2 [cs.HC] UPDATED)
    Context data is in demand more than ever with the rapid increase in the development of many context-aware Internet of Things applications. Research in context and context-awareness is being conducted to broaden its applicability in light of many practical and technical challenges. One of the challenges is improving performance when responding to large number of context queries. Context Management Platforms that infer and deliver context to applications measure this problem using Quality of Service (QoS) parameters. Although caching is a proven way to improve QoS, transiency of context and features such as variability, heterogeneity of context queries pose an additional real-time cost management problem. This paper presents a critical survey of state-of-the-art in adaptive data caching with the objective of developing a body of knowledge in cost- and performance-efficient adaptive caching strategies. We comprehensively survey a large number of research publications and evaluate, compare, and contrast different techniques, policies, approaches, and schemes in adaptive caching. Our critical analysis is motivated by the focus on adaptively caching context as a core research problem. A formal definition for adaptive context caching is then proposed, followed by identified features and requirements of a well-designed, objective optimal adaptive context caching strategy.
    Uncertainty in Fairness Assessment: Maintaining Stable Conclusions Despite Fluctuations. (arXiv:2302.01079v1 [cs.LG])
    Several recent works encourage the use of a Bayesian framework when assessing performance and fairness metrics of a classification algorithm in a supervised setting. We propose the Uncertainty Matters (UM) framework that generalizes a Beta-Binomial approach to derive the posterior distribution of any criteria combination, allowing stable performance assessment in a bias-aware setting.We suggest modeling the confusion matrix of each demographic group using a Multinomial distribution updated through a Bayesian procedure. We extend UM to be applicable under the popular K-fold cross-validation procedure. Experiments highlight the benefits of UM over classical evaluation frameworks regarding informativeness and stability.
    A Survey on Efficient Training of Transformers. (arXiv:2302.01107v1 [cs.LG])
    Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v3 [stat.ML] UPDATED)
    This paper presents new machine learning approaches to approximate the solutions of optimal stopping problems. The key idea of these methods is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees are provided. Our randomized reinforcement learning approach and randomized recurrent neural network approach outperform the state-of-the-art and other relevant machine learning approaches in Markovian and non-Markovian examples, respectively. In particular, we test our approaches on Black-Scholes, Heston, rough Heston and fractional Brownian motion. Moreover, we show that they can also be used to efficiently compute Greeks of American options.
    Epistemic Neural Networks. (arXiv:2107.08924v7 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. The epinet does not fit the traditional framework of Bayesian neural networks. To accommodate development of approaches beyond BNNs, such as the epinet, we introduce the epistemic neural network (ENN) as an interface for models that produce joint predictions.
    LogLG: Weakly Supervised Log Anomaly Detection via Log-Event Graph Construction. (arXiv:2208.10833v4 [cs.SE] UPDATED)
    Fully supervised log anomaly detection methods suffer the heavy burden of annotating massive unlabeled log data. Recently, many semi-supervised methods have been proposed to reduce annotation costs with the help of parsed templates. However, these methods consider each keyword independently, which disregards the correlation between keywords and the contextual relationships among log sequences. In this paper, we propose a novel weakly supervised log anomaly detection framework, named LogLG, to explore the semantic connections among keywords from sequences. Specifically, we design an end-to-end iterative process, where the keywords of unlabeled logs are first extracted to construct a log-event graph. Then, we build a subgraph annotator to generate pseudo labels for unlabeled log sequences. To ameliorate the annotation quality, we adopt a self-supervised task to pre-train a subgraph annotator. After that, a detection model is trained with the generated pseudo labels. Conditioned on the classification results, we re-extract the keywords from the log sequences and update the log-event graph for the next iteration. Experiments on five benchmarks validate the effectiveness of LogLG for detecting anomalies on unlabeled log data and demonstrate that LogLG, as the state-of-the-art weakly supervised method, achieves significant performance improvements compared to existing methods.
    OntoED: Low-resource Event Detection with Ontology Embedding. (arXiv:2105.10922v4 [cs.IR] CROSS LISTED)
    Event Detection (ED) aims to identify event trigger words from a given text and classify it into an event type. Most of current methods to ED rely heavily on training instances, and almost ignore the correlation of event types. Hence, they tend to suffer from data scarcity and fail to handle new unseen event types. To address these problems, we formulate ED as a process of event ontology population: linking event instances to pre-defined event types in event ontology, and propose a novel ED framework entitled OntoED with ontology embedding. We enrich event ontology with linkages among event types, and further induce more event-event correlations. Based on the event ontology, OntoED can leverage and propagate correlation knowledge, particularly from data-rich to data-poor event types. Furthermore, OntoED can be applied to new unseen event types, by establishing linkages to existing ones. Experiments indicate that OntoED is more predominant and robust than previous approaches to ED, especially in data-scarce scenarios.
    A general Markov decision process formalism for action-state entropy-regularized reward maximization. (arXiv:2302.01098v1 [cs.LG])
    Previous work has separately addressed different forms of action, state and action-state entropy regularization, pure exploration and space occupation. These problems have become extremely relevant for regularization, generalization, speeding up learning and providing robust solutions at unprecedented levels. However, solutions of those problems are hectic, ranging from convex and non-convex optimization, and unconstrained optimization to constrained optimization. Here we provide a general dual function formalism that transforms the constrained optimization problem into an unconstrained convex one for any mixture of action and state entropies. The cases with pure action entropy and pure state entropy are understood as limits of the mixture.
    Learning Globally Smooth Functions on Manifolds. (arXiv:2210.00301v3 [cs.LG] UPDATED)
    Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. This work combines techniques from semi-infinite constrained learning and manifold regularization to learn representations that are globally smooth on a manifold. To do so, it shows that under typical conditions the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. It is shown that under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Experiments on real world data illustrate the advantages of the proposed method relative to existing alternatives.
    Boundary-Aware Uncertainty for Feature Attribution Explainers. (arXiv:2210.02419v3 [cs.LG] UPDATED)
    Post-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications, precipitating a need for reliable explanations. Nevertheless, recent works have shown that many existing methods can be inconsistent or lack robustness. In addition, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. In this work, we propose a novel geodesic-based kernel which captures the complexity of the target black-box decision boundary. We show theoretically that the proposed kernel similarity increases with the complexity of the decision boundary. In addition, we introduce the Gaussian Process Explanation UnCertainty (GPEC) framework, which generates a unified uncertainty estimate combining decision boundary-aware uncertainty with existing explanation uncertainty methods. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method. Empirical results on multiple tabular and image datasets show that the GPEC uncertainty estimate improves understanding of explanations as compared to existing methods.
    Encouraging Intra-Class Diversity Through a Reverse Contrastive Loss for Better Single-Source Domain Generalization. (arXiv:2106.07916v2 [cs.CV] CROSS LISTED)
    Traditional deep learning algorithms often fail to generalize when they are tested outside of the domain of the training data. The issue can be mitigated by using unlabeled data from the target domain at training time, but because data distributions can change dynamically in real-life applications once a learned model is deployed, it is critical to create networks robust to unknown and unforeseen domain shifts. In this paper we focus on one of the reasons behind the inability of neural networks to be so: deep networks focus only on the most obvious, potentially spurious, clues to make their predictions and are blind to useful but slightly less efficient or more complex patterns. This behaviour has been identified and several methods partially addressed the issue. To investigate their effectiveness and limits, we first design a publicly available MNIST-based benchmark to precisely measure the ability of an algorithm to find the ''hidden'' patterns. Then, we evaluate state-of-the-art algorithms through our benchmark and show that the issue is largely unsolved. Finally, we propose a partially reversed contrastive loss to encourage intra-class diversity and find less strongly correlated patterns, whose efficiency is demonstrated by our experiments.
    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models. (arXiv:2205.07999v2 [stat.ML] UPDATED)
    Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v2 [cs.LG] UPDATED)
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.
    Mini-Batch Learning Strategies for modeling long term temporal dependencies: A study in environmental applications. (arXiv:2210.08347v2 [cs.LG] UPDATED)
    In many environmental applications, recurrent neural networks (RNNs) are often used to model physical variables with long temporal dependencies. However, due to mini-batch training, temporal relationships between training segments within the batch (intra-batch) as well as between batches (inter-batch) are not considered, which can lead to limited performance. Stateful RNNs aim to address this issue by passing hidden states between batches. Since Stateful RNNs ignore intra-batch temporal dependency, there exists a trade-off between training stability and capturing temporal dependency. In this paper, we provide a quantitative comparison of different Stateful RNN modeling strategies, and propose two strategies to enforce both intra- and inter-batch temporal dependency. First, we extend Stateful RNNs by defining a batch as a temporally ordered set of training segments, which enables intra-batch sharing of temporal information. While this approach significantly improves the performance, it leads to much larger training times due to highly sequential training. To address this issue, we further propose a new strategy which augments a training segment with an initial value of the target variable from the timestep right before the starting of the training segment. In other words, we provide an initial value of the target variable as additional input so that the network can focus on learning changes relative to that initial value. By using this strategy, samples can be passed in any order (mini-batch training) which significantly reduces the training time while maintaining the performance. In demonstrating our approach in hydrological modeling, we observe that the most significant gains in predictive accuracy occur when these methods are applied to state variables whose values change more slowly, such as soil water and snowpack, rather than continuously moving flux variables such as streamflow.
    An Advantage Using Feature Selection with a Quantum Annealer. (arXiv:2211.09756v3 [quant-ph] UPDATED)
    Feature selection is a technique in statistical prediction modeling that identifies features in a record with a strong statistical connection to the target variable. Excluding features with a weak statistical connection to the target variable in training not only drops the dimension of the data, which decreases the time complexity of the algorithm, it also decreases noise within the data which assists in avoiding overfitting. In all, feature selection assists in training a robust statistical model that performs well and is stable. Given the lack of scalability in classical computation, current techniques only consider the predictive power of the feature and not redundancy between the features themselves. Recent advancements in feature selection that leverages quantum annealing (QA) gives a scalable technique that aims to maximize the predictive power of the features while minimizing redundancy. As a consequence, it is expected that this algorithm would assist in the bias/variance trade-off yielding better features for training a statistical model. This paper tests this intuition against classical methods by utilizing open-source data sets and evaluate the efficacy of each trained statistical model well-known prediction algorithms. The numerical results display an advantage utilizing the features selected from the algorithm that leveraged QA.
    Optimization-Based Separations for Neural Networks. (arXiv:2112.02393v3 [cs.LG] UPDATED)
    Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are $d$-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than $\Omega(d^{-4})$, nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.
    Neural Estimation of the Rate-Distortion Function With Applications to Operational Source Coding. (arXiv:2204.01612v2 [cs.IT] UPDATED)
    A fundamental question in designing lossy data compression schemes is how well one can do in comparison with the rate-distortion function, which describes the known theoretical limits of lossy compression. Motivated by the empirical success of deep neural network (DNN) compressors on large, real-world data, we investigate methods to estimate the rate-distortion function on such data, which would allow comparison of DNN compressors with optimality. While one could use the empirical distribution of the data and apply the Blahut-Arimoto algorithm, this approach presents several computational challenges and inaccuracies when the datasets are large and high-dimensional, such as the case of modern image datasets. Instead, we re-formulate the rate-distortion objective, and solve the resulting functional optimization problem using neural networks. We apply the resulting rate-distortion estimator, called NERD, on popular image datasets, and provide evidence that NERD can accurately estimate the rate-distortion function. Using our estimate, we show that the rate-distortion achievable by DNN compressors are within several bits of the rate-distortion function for real-world datasets. Additionally, NERD provides access to the rate-distortion achieving channel, as well as samples from its output marginal. Therefore, using recent results in reverse channel coding, we describe how NERD can be used to construct an operational one-shot lossy compression scheme with guarantees on the achievable rate and distortion. Experimental results demonstrate competitive performance with DNN compressors.
    Information-theoretic limitations of data-based price discrimination. (arXiv:2204.12723v2 [cs.GT] UPDATED)
    This paper studies third-degree price discrimination (3PD) based on a random sample of valuation and covariate data, where the covariate is continuous, and the distribution of the data is unknown to the seller. The main results of this paper are twofold. The first set of results is pricing strategy independent and reveals the fundamental information-theoretic limitation of any data-based pricing strategy in revenue generation for two cases: 3PD and uniform pricing. The second set of results proposes the $K$-markets empirical revenue maximization (ERM) strategy and shows that the $K$-markets ERM and the uniform ERM strategies achieve the optimal rate of convergence in revenue to that generated by their respective true-distribution 3PD and uniform pricing optima. Our theoretical and numerical results suggest that the uniform (i.e., $1$-market) ERM strategy generates a larger revenue than the $K$-markets ERM strategy when the sample size is small enough, and vice versa.
    Revisiting Simple Regret: Fast Rates for Returning a Good Arm. (arXiv:2210.16913v2 [cs.LG] UPDATED)
    Simple regret is a natural and parameter-free performance criterion for pure exploration in multi-armed bandits yet is less popular than the probability of missing the best arm or an $\epsilon$-good arm, perhaps due to lack of easy ways to characterize it. In this paper, we make significant progress on minimizing simple regret in both data-rich ($T\ge n$) and data-poor regime ($T \le n$) where $n$ is the number of arms, and $T$ is the number of samples. At its heart is our improved instance-dependent analysis of the well-known Sequential Halving (SH) algorithm, where we bound the probability of returning an arm whose mean reward is not within $\epsilon$ from the best (i.e., not $\epsilon$-good) for \textit{any} choice of $\epsilon>0$, although $\epsilon$ is not an input to SH. Our bound not only leads to an optimal worst-case simple regret bound of $\sqrt{n/T}$ up to logarithmic factors but also essentially matches the instance-dependent lower bound for returning an $\epsilon$-good arm reported by Katz-Samuels and Jamieson (2020). For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once. Our empirical study shows that BSH outperforms existing methods on real-world tasks.
    Correlated Initialization for Correlated Data. (arXiv:2003.04422v2 [cs.LG] UPDATED)
    Spatial data exhibits the property that nearby points are correlated. This also holds for learnt representations across layers, but not for commonly used weight initialization methods. Our theoretical analysis quantifies the learning behavior of weights of a single spatial filter. It is thus in contrast to a large body of work that discusses statistical properties of weights. It shows that uncorrelated initialization (i) might lead to poor convergence behavior and (ii) training of (some) parameters is likely subject to slow convergence. Empirical analysis shows that these findings for a single spatial filter extend to networks with many spatial filters. The impact of (correlated) initialization depends strongly on learning rates and l2-regularization.
    The KFIoU Loss for Rotated Object Detection. (arXiv:2201.12558v5 [cs.CV] UPDATED)
    Differing from the well-developed horizontal object detection area whereby the computing-friendly IoU based loss is readily adopted and well fits with the detection metrics. In contrast, rotation detectors often involve a more complicated loss based on SkewIoU which is unfriendly to gradient-based training. In this paper, we propose an effective approximate SkewIoU loss based on Gaussian modeing and Kalman filter, which mainly consists of two items. The first term is a scale-insensitive center point loss, which is used to quickly get the center points between bounding boxes closer to assist the second term. In the distance-independent second term, Kalman filter is adopted to inherently mimic the mechanism of SkewIoU by its definition, and show its alignment with the SkewIoU loss at trend-level within a certain distance (i.e. within 9 pixels). This is in contrast to recent Gaussian modeling based rotation detectors e.g. GWD loss and KLD loss that involve a human-specified distribution distance metric which require additional hyperparameter tuning that vary across datasets and detectors. The resulting new loss called KFIoU loss is easier to implement and works better compared with exact SkewIoU loss, thanks to its full differentiability and ability to handle the non-overlapping cases. We further extend our technique to the 3-D case which also suffers from the same issues as 2-D detection. Extensive results on various public datasets (2-D/3-D, aerial/text/face images) with different base detectors show the effectiveness of our approach.
    Unravelling the Performance of Physics-informed Graph Neural Networks for Dynamical Systems. (arXiv:2211.05520v2 [cs.LG] UPDATED)
    Recently, graph neural networks have been gaining a lot of attention to simulate dynamical systems due to their inductive nature leading to zero-shot generalizability. Similarly, physics-informed inductive biases in deep-learning frameworks have been shown to give superior performance in learning the dynamics of physical systems. There is a growing volume of literature that attempts to combine these two approaches. Here, we evaluate the performance of thirteen different graph neural networks, namely, Hamiltonian and Lagrangian graph neural networks, graph neural ODE, and their variants with explicit constraints and different architectures. We briefly explain the theoretical formulation highlighting the similarities and differences in the inductive biases and graph architecture of these systems. We evaluate these models on spring, pendulum, gravitational, and 3D deformable solid systems to compare the performance in terms of rollout error, conserved quantities such as energy and momentum, and generalizability to unseen system sizes. Our study demonstrates that GNNs with additional inductive biases, such as explicit constraints and decoupling of kinetic and potential energies, exhibit significantly enhanced performance. Further, all the physics-informed GNNs exhibit zero-shot generalizability to system sizes an order of magnitude larger than the training system, thus providing a promising route to simulate large-scale realistic systems.
    Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection. (arXiv:1910.11621v2 [cs.CL] CROSS LISTED)
    Event detection (ED), a sub-task of event extraction, involves identifying triggers and categorizing event mentions. Existing methods primarily rely upon supervised learning and require large-scale labeled event datasets which are unfortunately not readily available in many real-life applications. In this paper, we consider and reformulate the ED task with limited labeled data as a Few-Shot Learning problem. We propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mentions once, our model is more robust and is capable of distilling contextual information from event mentions for multiple times due to the multi-hop mechanism of DMNs. The experiments show that DMB-PN not only deals with sample scarcity better than a series of baseline models but also performs more robustly when the variety of event types is relatively large and the instance quantity is extremely small.
    Augmentation Component Analysis: Modeling Similarity via the Augmentation Overlaps. (arXiv:2206.00471v2 [cs.LG] UPDATED)
    Self-supervised learning aims to learn a embedding space where semantically similar samples are close. Contrastive learning methods pull views of samples together and push different samples away, which utilizes semantic invariance of augmentation but ignores the relationship between samples. To better exploit the power of augmentation, we observe that semantically similar samples are more likely to have similar augmented views. Therefore, we can take the augmented views as a special description of a sample. In this paper, we model such a description as the augmentation distribution and we call it augmentation feature. The similarity in augmentation feature reflects how much the views of two samples overlap and is related to their semantical similarity. Without computational burdens to explicitly estimate values of the augmentation feature, we propose Augmentation Component Analysis (ACA) with a contrastive-like loss to learn principal components and an on-the-fly projection loss to embed data. ACA equals an efficient dimension reduction by PCA and extracts low-dimensional embeddings, theoretically preserving the similarity of augmentation distribution between samples. Empirical results show our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.
    Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Communication Compression. (arXiv:2210.13277v3 [cs.LG] UPDATED)
    In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and two-way communication with a distant orchestrating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: our algorithm benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
    $IC^3$: Image Captioning by Committee Consensus. (arXiv:2302.01328v1 [cs.CV])
    If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" ($IC^3$), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by $IC^3$ at least as helpful as baseline SOTA models more than two thirds of the time, and $IC^3$ captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee
    Lower Bounds for Learning in Revealing POMDPs. (arXiv:2302.01333v1 [cs.LG])
    This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.
    Dual PatchNorm. (arXiv:2302.01327v1 [cs.CV])
    We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms), before and after the patch embedding layer in Vision Transformers. We demonstrate that Dual PatchNorm outperforms the result of exhaustive search for alternative LayerNorm placement strategies in the Transformer block itself. In our experiments, incorporating this trivial modification, often leads to improved accuracy over well-tuned Vision Transformers and never hurts.
    Bayesian Metric Learning for Uncertainty Quantification in Image Retrieval. (arXiv:2302.01332v1 [cs.LG])
    We propose the first Bayesian encoder for metric learning. Rather than relying on neural amortization as done in prior works, we learn a distribution over the network weights with the Laplace Approximation. We actualize this by first proving that the contrastive loss is a valid log-posterior. We then propose three methods that ensure a positive definite Hessian. Lastly, we present a novel decomposition of the Generalized Gauss-Newton approximation. Empirically, we show that our Laplacian Metric Learner (LAM) estimates well-calibrated uncertainties, reliably detects out-of-distribution examples, and yields state-of-the-art predictive performance.
    The Value of Out-of-Distribution Data. (arXiv:2208.10967v3 [cs.LG] UPDATED)
    We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.
    Computational Discovery of Microstructured Composites with Optimal Strength-Toughness Trade-Offs. (arXiv:2302.01078v1 [cond-mat.mtrl-sci])
    The conflict between strength and toughness is a fundamental problem in engineering materials design. However, systematic discovery of microstructured composites with optimal strength-toughness trade-offs has never been demonstrated due to the discrepancies between simulation and reality and the lack of data-efficient exploration of the entire Pareto front. Here, we report a widely applicable pipeline harnessing physical experiments, numerical simulations, and artificial neural networks to efficiently discover microstructured designs that are simultaneously tough and strong. Using a physics-based simulator with moderate complexity, our strategy runs a data-driven proposal-validation workflow in a nested-loop fashion to bridge the gap between simulation and reality in high sample efficiency. Without any prescribed expert knowledge of materials design, our approach automatically identifies existing toughness enhancement mechanisms that were traditionally discovered through trial-and-error or biomimicry. We provide a blueprint for the computational discovery of optimal designs, which inverts traditional scientific approaches, and is applicable to a wide range of research problems beyond composites, including polymer chemistry, fluid dynamics, meteorology, and robotics.
    Error estimates for physics informed neural networks approximating the Navier-Stokes equations. (arXiv:2203.09346v2 [math.NA] UPDATED)
    We prove rigorous bounds on the errors resulting from the approximation of the incompressible Navier-Stokes equations with (extended) physics informed neural networks. We show that the underlying PDE residual can be made arbitrarily small for tanh neural networks with two hidden layers. Moreover, the total error can be estimated in terms of the training error, network size and number of quadrature points. The theory is illustrated with numerical experiments.
    Predicting Molecule-Target Interaction by Learning Biomedical Network and Molecule Representations. (arXiv:2302.00981v1 [cs.LG])
    The study of molecule-target interaction is quite important for drug discovery in terms of target identification, pathway study, drug-drug interaction, etc. Most existing methodologies utilize either biomedical network information or molecule structural features to predict potential interaction link. However, the biomedical network information based methods usually suffer from cold start problem, while structure based methods often give limited performance due to the structure/interaction assumption and data quality. To address these issues, we propose a pseudo-siamese Graph Neural Network method, namely MTINet+, which learns both biomedical network topological and molecule structural/chemical information as representations to predict potential interaction of given molecule and target pair. In MTINet+, 1-hop subgraphs of given molecule and target pair are extracted from known interaction of biomedical network as topological information, meanwhile the molecule structural and chemical attributes are processed as molecule information. MTINet+ learns these two types of information as embedding features for predicting the pair link. In the experiments of different molecule-target interaction tasks, MTINet+ significantly outperforms over the state-of-the-art baselines. In addition, in our designed network sparsity experiments , MTINet+ shows strong robustness against different sparse biomedical networks.
    Bayesian Optimization of Multiple Objectives with Different Latencies. (arXiv:2302.01310v1 [stat.ML])
    Multi-objective Bayesian optimization aims to find the Pareto front of optimal trade-offs between a set of expensive objectives while collecting as few samples as possible. In some cases, it is possible to evaluate the objectives separately, and a different latency or evaluation cost can be associated with each objective. This presents an opportunity to learn the Pareto front faster by evaluating the cheaper objectives more frequently. We propose a scalarization based knowledge gradient acquisition function which accounts for the different evaluation costs of the objectives. We prove consistency of the algorithm and show empirically that it significantly outperforms a benchmark algorithm which always evaluates both objectives.
    Efficient Privacy-Preserving Stochastic Nonconvex Optimization. (arXiv:1910.13659v3 [cs.LG] UPDATED)
    While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while improves the best previous utility guarantee given by Wang et al. (NeurIPS 2017). Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.
    Best Possible Q-Learning. (arXiv:2302.01188v1 [cs.LG])
    Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.
    Factor Fields: A Unified Framework for Neural Fields and Beyond. (arXiv:2302.01226v1 [cs.CV])
    We present Factor Fields, a novel framework for modeling and representing signals. Factor Fields decomposes a signal into a product of factors, each of which is represented by a neural or regular field representation operating on a coordinate transformed input signal. We show that this decomposition yields a unified framework that generalizes several recent signal representations including NeRF, PlenOxels, EG3D, Instant-NGP, and TensoRF. Moreover, the framework allows for the creation of powerful new signal representations, such as the Coefficient-Basis Factorization (CoBaFa) which we propose in this paper. As evidenced by our experiments, CoBaFa leads to improvements over previous fast reconstruction methods in terms of the three critical goals in neural signal representation: approximation quality, compactness and efficiency. Experimentally, we demonstrate that our representation achieves better image approximation quality on 2D image regression tasks, higher geometric quality when reconstructing 3D signed distance fields and higher compactness for radiance field reconstruction tasks compared to previous fast reconstruction methods. Besides, our CoBaFa representation enables generalization by sharing the basis across signals during training, enabling generalization tasks such as image regression with sparse observations and few-shot radiance field reconstruction.
    Diagrammatization: Rationalizing with diagrammatic AI explanations for abductive reasoning on hypotheses. (arXiv:2302.01241v1 [cs.AI])
    Many visualizations have been developed for explainable AI (XAI), but they often require further reasoning by users to interpret. We argue that XAI should support abductive reasoning - inference to the best explanation - with diagrammatic reasoning to convey hypothesis generation and evaluation. Inspired by Peircean diagrammatic reasoning and the 5-step abduction process, we propose Diagrammatization, an approach to provide diagrammatic, abductive explanations based on domain hypotheses. We implemented DiagramNet for a clinical application to predict diagnoses from heart auscultation, and explain with shape-based murmur diagrams. In modeling studies, we found that DiagramNet not only provides faithful murmur shape explanations, but also has better prediction performance than baseline models. We further demonstrate the usefulness of diagrammatic explanations in a qualitative user study with medical students, showing that clinically-relevant, diagrammatic explanations are preferred over technical saliency map explanations. This work contributes insights into providing domain-conventional abductive explanations for user-centric XAI.
    Geometric Deep Learning for Autonomous Driving: Unlocking the Power of Graph Neural Networks With CommonRoad-Geometric. (arXiv:2302.01259v1 [cs.LG])
    Heterogeneous graphs offer powerful data representations for traffic, given their ability to model the complex interaction effects among a varying number of traffic participants and the underlying road infrastructure. With the recent advent of graph neural networks (GNNs) as the accompanying deep learning framework, the graph structure can be efficiently leveraged for various machine learning applications such as trajectory prediction. As a first of its kind, our proposed Python framework offers an easy-to-use and fully customizable data processing pipeline to extract standardized graph datasets from traffic scenarios. Providing a platform for GNN-based autonomous driving research, it improves comparability between approaches and allows researchers to focus on model implementation instead of dataset curation.
    Neuro Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and Concept Rehearsal. (arXiv:2302.01242v1 [cs.LG])
    We introduce Neuro-Symbolic Continual Learning, where a model has to solve a sequence of neuro-symbolic tasks, that is, it has to map sub-symbolic inputs to high-level concepts and compute predictions by reasoning consistently with prior knowledge. Our key observation is that neuro-symbolic tasks, although different, often share concepts whose semantics remains stable over time. Traditional approaches fall short: existing continual strategies ignore knowledge altogether, while stock neuro-symbolic architectures suffer from catastrophic forgetting. We show that leveraging prior knowledge by combining neuro-symbolic architectures with continual strategies does help avoid catastrophic forgetting, but also that doing so can yield models affected by reasoning shortcuts. These undermine the semantics of the acquired concepts, even when detailed prior knowledge is provided upfront and inference is exact, and in turn continual performance. To overcome these issues, we introduce COOL, a COncept-level cOntinual Learning strategy tailored for neuro-symbolic continual problems that acquires high-quality concepts and remembers them over time. Our experiments on three novel benchmarks highlights how COOL attains sustained high performance on neuro-symbolic continual learning tasks in which other strategies fail.
    Federated Analytics: A survey. (arXiv:2302.01326v1 [cs.LG])
    Federated analytics (FA) is a privacy-preserving framework for computing data analytics over multiple remote parties (e.g., mobile devices) or silo-ed institutional entities (e.g., hospitals, banks) without sharing the data among parties. Motivated by the practical use cases of federated analytics, we follow a systematic discussion on federated analytics in this article. In particular, we discuss the unique characteristics of federated analytics and how it differs from federated learning. We also explore a wide range of FA queries and discuss various existing solutions and potential use case applications for different FA queries.
    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v1 [stat.ML])
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on downstream datasets that are (i) relatively large, and (ii) similar in distribution to the pretraining data. However, in many applications including personalization, it is crucial to perform well in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and on images from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases by as much as 32$\times$ for CIFAR-100 at $\epsilon=1$. We also find that few-shot non-private models are highly susceptible to membership inference attacks. DP provides clear mitigation against the attacks, but a small $\epsilon$ is required to effectively prevent them. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR federated learning benchmark.
    Normalizing Flow Ensembles for Rich Aleatoric and Epistemic Uncertainty Modeling. (arXiv:2302.01312v1 [cs.LG])
    In this work, we demonstrate how to reliably estimate epistemic uncertainty while maintaining the flexibility needed to capture complicated aleatoric distributions. To this end, we propose an ensemble of Normalizing Flows (NF), which are state-of-the-art in modeling aleatoric uncertainty. The ensembles are created via sets of fixed dropout masks, making them less expensive than creating separate NF models. We demonstrate how to leverage the unique structure of NFs, base distributions, to estimate aleatoric uncertainty without relying on samples, provide a comprehensive set of baselines, and derive unbiased estimates for differential entropy. The methods were applied to a variety of experiments, commonly used to benchmark aleatoric and epistemic uncertainty estimation: 1D sinusoidal data, 2D windy grid-world ($\it{Wet Chicken}$), $\it{Pendulum}$, and $\it{Hopper}$. In these experiments, we setup an active learning framework and evaluate each model's capability at measuring aleatoric and epistemic uncertainty. The results show the advantages of using NF ensembles in capturing complicated aleatoric while maintaining accurate epistemic uncertainty estimates.
    UW-CVGAN: UnderWater Image Enhancement with Capsules Vectors Quantization. (arXiv:2302.01144v1 [cs.CV])
    The degradation in the underwater images is due to wavelength-dependent light attenuation, scattering, and to the diversity of the water types in which they are captured. Deep neural networks take a step in this field, providing autonomous models able to achieve the enhancement of underwater images. We introduce Underwater Capsules Vectors GAN UWCVGAN based on the discrete features quantization paradigm from VQGAN for this task. The proposed UWCVGAN combines an encoding network, which compresses the image into its latent representation, with a decoding network, able to reconstruct the enhancement of the image from the only latent representation. In contrast with VQGAN, UWCVGAN achieves feature quantization by exploiting the clusterization ability of capsule layer, making the model completely trainable and easier to manage. The model obtains enhanced underwater images with high quality and fine details. Moreover, the trained encoder is independent of the decoder giving the possibility to be embedded onto the collector as compressing algorithm to reduce the memory space required for the images, of factor $3\times$. \myUWCVGAN{ }is validated with quantitative and qualitative analysis on benchmark datasets, and we present metrics results compared with the state of the art.
    A comparative study of statistical and machine learning models on near-real-time daily emissions prediction. (arXiv:2302.01152v1 [cs.AI])
    The rapid ascent in carbon dioxide emissions is a major cause of global warming and climate change, which pose a huge threat to human survival and impose far-reaching influence on the global ecosystem. Therefore, it is very necessary to effectively control carbon dioxide emissions by accurately predicting and analyzing the change trend timely, so as to provide a reference for carbon dioxide emissions mitigation measures. This paper is aiming to select a suitable model to predict the near-real-time daily emissions based on univariate daily time-series data from January 1st, 2020 to September 30st, 2022 of all sectors (Power, Industry, Ground Transport, Residential, Domestic Aviation, International Aviation) in China. We proposed six prediction models, which including three statistical models: Grey prediction (GM(1,1)), autoregressive integrated moving average (ARIMA) and seasonal autoregressive integrated moving average with exogenous factors (SARIMAX); three machine learning models: artificial neural network (ANN), random forest (RF) and long short term memory (LSTM). To evaluate the performance of these models, five criteria: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Coefficient of Determination () are imported and discussed in detail. In the results, three machine learning models perform better than that three statistical models, in which LSTM model performs the best on five criteria values for daily emissions prediction with the 3.5179e-04 MSE value, 0.0187 RMSE value, 0.0140 MAE value, 14.8291% MAPE value and 0.9844 value.
    Temporal fusion transformer using variational mode decomposition for wind power forecasting. (arXiv:2302.01222v1 [cs.LG])
    The power output of a wind turbine depends on a variety of factors, including wind speed at different heights, wind direction, temperature and turbine properties. Wind speed and direction, in particular, have complex cycles and fluctuate dramatically, leading to large uncertainties in wind power output. This study uses variational mode decomposition (VMD) to decompose the wind power series and Temporal fusion transformer (TFT) to forecast wind power for the next 1h, 3h and 6h. The experimental results show that VMD outperforms other decomposition algorithms and the TFT model outperforms other decomposition models.
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v1 [stat.ML])
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where the equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. Our method constitutes an important step towards developing general, transferable algorithms for accelerating MD.
    Dual Propagation: Accelerating Contrastive Hebbian Learning with Dyadic Neurons. (arXiv:2302.01228v1 [cs.LG])
    Activity difference based learning algorithms-such as contrastive Hebbian learning and equilibrium propagation-have been proposed as biologically plausible alternatives to error back-propagation. However, on traditional digital chips these algorithms suffer from having to solve a costly inference problem twice, making these approaches more than two orders of magnitude slower than back-propagation. In the analog realm equilibrium propagation may be promising for fast and energy efficient learning, but states still need to be inferred and stored twice. Inspired by lifted neural networks and compartmental neuron models we propose a simple energy based compartmental neuron model, termed dual propagation, in which each neuron is a dyad with two intrinsic states. At inference time these intrinsic states encode the error/activity duality through their difference and their mean respectively. The advantage of this method is that only a single inference phase is needed and that inference can be solved in layerwise closed-form. Experimentally we show on common computer vision datasets, including Imagenet32x32, that dual propagation performs equivalently to back-propagation both in terms of accuracy and runtime.
    Double Permutation Equivariance for Knowledge Graph Completion. (arXiv:2302.01313v1 [cs.LG])
    This work provides a formalization of Knowledge Graphs (KGs) as a new class of graphs that we denote doubly exchangeable attributed graphs, where node and pairwise (joint 2-node) representations must be equivariant to permutations of both node ids and edge (& node) attributes (relations & node features). Double-permutation equivariant KG representations open a new research direction in KGs. We show that this equivariance imposes a structural representation of relations that allows neural networks to perform complex logical reasoning tasks in KGs. Finally, we introduce a general blueprint for such equivariant representations and test a simple GNN-based double-permutation equivariant neural architecture that achieve 100% Hits@10 test accuracy in both the WN18RRv1 and NELL995v1 inductive KG completion tasks, and can accurately perform logical reasoning tasks that no existing methods can perform, to the best of our knowledge.
    Fed-GLOSS-DP: Federated, Global Learning using Synthetic Sets with Record Level Differential Privacy. (arXiv:2302.01068v1 [cs.LG])
    This work proposes Fed-GLOSS-DP, a novel approach to privacy-preserving learning that uses synthetic data to train federated models. In our approach, the server recovers an approximation of the global loss landscape in a local neighborhood based on synthetic samples received from the clients. In contrast to previous, point-wise, gradient-based, linear approximation (such as FedAvg), our formulation enables a type of global optimization that is particularly beneficial in non-IID federated settings. We also present how it rigorously complements record-level differential privacy. Extensive results show that our novel formulation gives rise to considerable improvements in terms of convergence speed and communication costs. We argue that our new approach to federated learning can provide a potential path toward reconciling privacy and accountability by sending differentially private, synthetic data instead of gradient updates. The source code will be released upon publication.
    Practical Bandits: An Industry Perspective. (arXiv:2302.01223v1 [cs.LG])
    The bandit paradigm provides a unified modeling framework for problems that require decision-making under uncertainty. Because many business metrics can be viewed as rewards (a.k.a. utilities) that result from actions, bandit algorithms have seen a large and growing interest from industrial applications, such as search, recommendation and advertising. Indeed, with the bandit lens comes the promise of direct optimisation for the metrics we care about. Nevertheless, the road to successfully applying bandits in production is not an easy one. Even when the action space and rewards are well-defined, practitioners still need to make decisions regarding multi-arm or contextual approaches, on- or off-policy setups, delayed or immediate feedback, myopic or long-term optimisation, etc. To make matters worse, industrial platforms typically give rise to large action spaces in which existing approaches tend to break down. The research literature on these topics is broad and vast, but this can overwhelm practitioners, whose primary aim is to solve practical problems, and therefore need to decide on a specific instantiation or approach for each project. This tutorial will take a step towards filling that gap between the theory and practice of bandits. Our goal is to present a unified overview of the field and its existing terminology, concepts and algorithms -- with a focus on problems relevant to industry. We hope our industrial perspective will help future practitioners who wish to leverage the bandit paradigm for their application.
    Interventional and Counterfactual Inference with Diffusion Models. (arXiv:2302.00860v1 [stat.ML])
    We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings to allow for direct sampling under interventions as well as abduction for counterfactuals. We utilize DCM to model structural equations, seeing that diffusion models serve as a natural candidate here since they encode each node to a latent representation, a proxy for the exogenous noise, and offer flexible and accurate modeling to provide reliable causal statements and estimates. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Our theoretical results provide a methodology for analyzing the counterfactual error for general encoder/decoder models which could be of independent interest.
    Reliable Prediction Intervals with Directly Optimized Inductive Conformal Regression for Deep Learning. (arXiv:2302.00872v1 [cs.LG])
    By generating prediction intervals (PIs) to quantify the uncertainty of each prediction in deep learning regression, the risk of wrong predictions can be effectively controlled. High-quality PIs need to be as narrow as possible, whilst covering a preset proportion of real labels. At present, many approaches to improve the quality of PIs can effectively reduce the width of PIs, but they do not ensure that enough real labels are captured. Inductive Conformal Predictor (ICP) is an algorithm that can generate effective PIs which is theoretically guaranteed to cover a preset proportion of data. However, typically ICP is not directly optimized to yield minimal PI width. However, in this study, we use Directly Optimized Inductive Conformal Regression (DOICR) that takes only the average width of PIs as the loss function and increases the quality of PIs through an optimized scheme under the validity condition that sufficient real labels are captured in the PIs. Benchmark experiments show that DOICR outperforms current state-of-the-art algorithms for regression problems using underlying Deep Neural Network structures for both tabular and image data.
    What Language Reveals about Perception: Distilling Psychophysical Knowledge from Large Language Models. (arXiv:2302.01308v1 [cs.CL])
    Understanding the extent to which the perceptual world can be recovered from language is a fundamental problem in cognitive science. We reformulate this problem as that of distilling psychophysical information from text and show how this can be done by combining large language models (LLMs) with a classic psychophysical method based on similarity judgments. Specifically, we use the prompt auto-completion functionality of GPT3, a state-of-the-art LLM, to elicit similarity scores between stimuli and then apply multidimensional scaling to uncover their underlying psychological space. We test our approach on six perceptual domains and show that the elicited judgments strongly correlate with human data and successfully recover well-known psychophysical structures such as the color wheel and pitch spiral. We also explore meaningful divergences between LLM and human representations. Our work showcases how combining state-of-the-art machine models with well-known cognitive paradigms can shed new light on fundamental questions in perception and language research.
    Are Diffusion Models Vulnerable to Membership Inference Attacks?. (arXiv:2302.01316v1 [cs.CV])
    Diffusion-based generative models have shown great potential for image synthesis, but there is a lack of research on the security and privacy risks they may pose. In this paper, we investigate the vulnerability of diffusion models to Membership Inference Attacks (MIAs), a common privacy concern. Our results indicate that existing MIAs designed for GANs or VAE are largely ineffective on diffusion models, either due to inapplicable scenarios (e.g., requiring the discriminator of GANs) or inappropriate assumptions (e.g., closer distances between synthetic images and member images). To address this gap, we propose Step-wise Error Comparing Membership Inference (SecMI), a black-box MIA that infers memberships by assessing the matching of forward process posterior estimation at each timestep. SecMI follows the common overfitting assumption in MIA where member samples normally have smaller estimation errors, compared with hold-out samples. We consider both the standard diffusion models, e.g., DDPM, and the text-to-image diffusion models, e.g., Stable Diffusion. Experimental results demonstrate that our methods precisely infer the membership with high confidence on both of the two scenarios across six different datasets
    MonoFlow: Rethinking Divergence GANs via the Perspective of Differential Equations. (arXiv:2302.01075v1 [stat.ML])
    The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing mapping of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework.
    Convolutional Autoencoders, Clustering and POD for Low-dimensional Parametrization of Navier-Stokes Equations. (arXiv:2302.01278v1 [math.DS])
    Simulations of large-scale dynamical systems require expensive computations. Low-dimensional parametrization of high-dimensional states such as Proper Orthogonal Decomposition (POD) can be a solution to lessen the burdens by providing a certain compromise between accuracy and model complexity. However, for really low-dimensional parametrizations (for example for controller design) linear methods like the POD come to their natural limits so that nonlinear approaches will be the methods of choice. In this work we propose a convolutional autoencoder (CAE) consisting of a nonlinear encoder and an affine linear decoder and consider combinations with k-means clustering for improved encoding performance. The proposed set of methods is compared to the standard POD approach in two cylinder-wake scenarios modeled by the incompressible Navier-Stokes equations.
    MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion Control in Real Networks. (arXiv:2302.01301v1 [cs.LG])
    Fast and efficient transport protocols are the foundation of an increasingly distributed world. The burden of continuously delivering improved communication performance to support next-generation applications and services, combined with the increasing heterogeneity of systems and network technologies, has promoted the design of Congestion Control (CC) algorithms that perform well under specific environments. The challenge of designing a generic CC algorithm that can adapt to a broad range of scenarios is still an open research question. To tackle this challenge, we propose to apply a novel Reinforcement Learning (RL) approach. Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return and models the learning process as an infinite-horizon task. We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch that researchers have encountered when applying RL to CC. We evaluated our solution on the task of file transfer and compared it to TCP Cubic. While further research is required, results have shown that MARLIN can achieve comparable results to TCP with little hyperparameter tuning, in a task significantly different from its training setting. Therefore, we believe that our work represents a promising first step toward building CC algorithms based on the maximum entropy RL framework.
    Energy Efficiency of Training Neural Network Architectures: An Empirical Study. (arXiv:2302.00967v1 [cs.LG])
    The evaluation of Deep Learning models has traditionally focused on criteria such as accuracy, F1 score, and related measures. The increasing availability of high computational power environments allows the creation of deeper and more complex models. However, the computations needed to train such models entail a large carbon footprint. In this work, we study the relations between DL model architectures and their environmental impact in terms of energy consumed and CO$_2$ emissions produced during training by means of an empirical study using Deep Convolutional Neural Networks. Concretely, we study: (i) the impact of the architecture and the location where the computations are hosted on the energy consumption and emissions produced; (ii) the trade-off between accuracy and energy efficiency; and (iii) the difference on the method of measurement of the energy consumed using software-based and hardware-based tools.
    Curriculum Learning for ab initio Deep Learned Refractive Optics. (arXiv:2302.01089v1 [cs.CV])
    Deep lens optimization has recently emerged as a new paradigm for designing computational imaging systems, however it has been limited to either simple optical systems consisting of a single DOE or metalens, or the fine-tuning of compound lenses from good initial designs. Here we present a deep lens design method based on curriculum learning, which is able to learn optical designs of compound lenses ab initio from randomly initialized surfaces, therefore overcoming the need for a good initial design. We demonstrate this approach with the fully-automatic design of an extended depth-of-field computational camera in a cellphone-style form factor, highly aspherical surfaces, and a short back focal length.
    Convolutional Neural Operators. (arXiv:2302.01178v1 [cs.LG])
    Although very successfully used in machine learning, convolution based neural network architectures -- believed to be inconsistent in function space -- have been largely ignored in the context of learning solution operators of PDEs. Here, we adapt convolutional neural networks to demonstrate that they are indeed able to process functions as inputs and outputs. The resulting architecture, termed as convolutional neural operators (CNOs), is shown to significantly outperform competing models on benchmark experiments, paving the way for the design of an alternative robust and accurate framework for learning operators.
    Online Bidding in Repeated Non-Truthful Auctions under Budget and ROI Constraints. (arXiv:2302.01203v1 [cs.GT])
    Online advertising platforms typically use auction mechanisms to allocate ad placements. Advertisers participate in a series of repeated auctions, and must select bids that will maximize their overall rewards while adhering to certain constraints. We focus on the scenario in which the advertiser has budget and return-on-investment (ROI) constraints. We investigate the problem of budget- and ROI-constrained bidding in repeated non-truthful auctions, such as first-price auctions, and present a best-of-both-worlds framework with no-regret guarantees under both stochastic and adversarial inputs. By utilizing the notion of interval regret, we demonstrate that our framework does not require knowledge of specific parameters of the problem which could be difficult to determine in practice. Our proof techniques can be applied to both the adversarial and stochastic cases with minimal modifications, thereby providing a unified perspective on the two problems. In the adversarial setting, we also show that it is possible to loosen the traditional requirement of having a strictly feasible solution to the offline optimization problem at each round.
    Conditional expectation for missing data imputation. (arXiv:2302.00911v1 [stat.ML])
    Missing data is common in datasets retrieved in various areas, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the logic behind the imputation is explainable, which is especially difficult for complex methods that are for example, based on deep learning. This motivates us to introduce a conditional Distribution based Imputation of Missing Values (DIMV) algorithm. This approach works based on finding the conditional distribution of a feature with missing entries based on the fully observed features. As will be illustrated in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods under comparison; (ii) is explainable; (iii) can provide an approximated confidence region for the missing values in a given sample; (iv) works for both small and large scale data; (v) in many scenarios, does not require a huge number of parameters as deep learning approaches and therefore can be used for mobile devices or web browsers; and (vi) is robust to the normally distributed assumption that its theoretical grounds rely on. In addition to DIMV, we also introduce the DPER* algorithm improving the speed of DPER for estimating the mean and covariance matrix from the data, and we confirm the speed-up via experiments.
    Speed-Oblivious Online Scheduling: Knowing (Precise) Speeds is not Necessary. (arXiv:2302.00985v1 [cs.DS])
    We consider online scheduling on unrelated (heterogeneous) machines in a speed-oblivious setting, where an algorithm is unaware of the exact job-dependent processing speeds. We show strong impossibility results for clairvoyant and non-clairvoyant algorithms and overcome them in models inspired by practical settings: (i) we provide competitive learning-augmented algorithms, assuming that (possibly erroneous) predictions on the speeds are given, and (ii) we provide competitive algorithms for the speed-ordered model, where a single global order of machines according to their unknown job-dependent speeds is known. We prove strong theoretical guarantees and evaluate our findings on a representative heterogeneous multi-core processor. These seem to be the first empirical results for algorithms with predictions that are performed in a non-synthetic environment on real hardware.
    Imitating careful experts to avoid catastrophic events. (arXiv:2302.01193v1 [cs.LG])
    RL is increasingly being used to control robotic systems that interact closely with humans. This interaction raises the problem of safe RL: how to ensure that a RL-controlled robotic system never, for instance, injures a human. This problem is especially challenging in rich, realistic settings where it is not even possible to clearly write down a reward function which incorporates these outcomes. In these circumstances, perhaps the only viable approach is based on IRL, which infers rewards from human demonstrations. However, IRL is massively underdetermined as many different rewards can lead to the same optimal policies; we show that this makes it difficult to distinguish catastrophic outcomes (such as injuring a human) from merely undesirable outcomes. Our key insight is that humans do display different behaviour when catastrophic outcomes are possible: they become much more careful. We incorporate carefulness signals into IRL, and find that they do indeed allow IRL to disambiguate undesirable from catastrophic outcomes, which is critical to ensuring safety in future real-world human-robot interactions.
    Graph Neural Networks for temporal graphs: State of the art, open challenges, and opportunities. (arXiv:2302.01018v1 [cs.LG])
    Graph Neural Networks (GNNs) have become the leading paradigm for learning on (static) graph-structured data. However, many real-world systems are dynamic in nature, since the graph and node/edge attributes change over time. In recent years, GNN-based models for temporal graphs have emerged as a promising area of research to extend the capabilities of GNNs. In this work, we provide the first comprehensive overview of the current state-of-the-art of temporal GNN, introducing a rigorous formalization of learning settings and tasks and a novel taxonomy categorizing existing approaches in terms of how the temporal aspect is represented and processed. We conclude the survey with a discussion of the most relevant open challenges for the field, from both research and application perspectives.
    Confidence and Dispersity Speak: Characterising Prediction Matrix for Unsupervised Accuracy Estimation. (arXiv:2302.01094v1 [cs.LG])
    This work aims to assess how well a model performs under distribution shifts without using labels. While recent methods study prediction confidence, this work reports prediction dispersity is another informative cue. Confidence reflects whether the individual prediction is certain; dispersity indicates how the overall predictions are distributed across all categories. Our key insight is that a well-performing model should give predictions with high confidence and high dispersity. That is, we need to consider both properties so as to make more accurate estimates. To this end, we use the nuclear norm that has been shown to be effective in characterizing both properties. Extensive experiments validate the effectiveness of nuclear norm for various models (e.g., ViT and ConvNeXt), different datasets (e.g., ImageNet and CUB-200), and diverse types of distribution shifts (e.g., style shift and reproduction shift). We show that the nuclear norm is more accurate and robust in accuracy estimation than existing methods. Furthermore, we validate the feasibility of other measurements (e.g., mutual information maximization) for characterizing dispersity and confidence. Lastly, we investigate the limitation of the nuclear norm, study its improved variant under severe class imbalance, and discuss potential directions.
    Laplacian Change Point Detection for Single and Multi-view Dynamic Graphs. (arXiv:2302.01204v1 [cs.LG])
    Dynamic graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly detection in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address three main challenges associated with this problem: i). how to compare graph snapshots across time, ii). how to capture temporal dependencies, and iii). how to combine different views of a temporal graph. To solve the above challenges, we first propose Laplacian Anomaly Detection (LAD) which uses the spectrum of graph Laplacian as the low dimensional embedding of the graph structure at each snapshot. LAD explicitly models short term and long term dependencies by applying two sliding windows. Next, we propose MultiLAD, a simple and effective generalization of LAD to multi-view graphs. MultiLAD provides the first change point detection method for multi-view dynamic graphs. It aggregates the singular values of the normalized graph Laplacian from different views through the scalar power mean operation. Through extensive synthetic experiments, we show that i). LAD and MultiLAD are accurate and outperforms state-of-the-art baselines and their multi-view extensions by a large margin, ii). MultiLAD's advantage over contenders significantly increases when additional views are available, and iii). MultiLAD is highly robust to noise from individual views. In five real world dynamic graphs, we demonstrate that LAD and MultiLAD identify significant events as top anomalies such as the implementation of government COVID-19 interventions which impacted the population mobility in multi-view traffic networks.
    Human not in the loop: objective sample difficulty measures for Curriculum Learning. (arXiv:2302.01243v1 [cs.CV])
    Curriculum learning is a learning method that trains models in a meaningful order from easier to harder samples. A key here is to devise automatic and objective difficulty measures of samples. In the medical domain, previous work applied domain knowledge from human experts to qualitatively assess classification difficulty of medical images to guide curriculum learning, which requires extra annotation efforts, relies on subjective human experience, and may introduce bias. In this work, we propose a new automated curriculum learning technique using the variance of gradients (VoG) to compute an objective difficulty measure of samples and evaluated its effects on elbow fracture classification from X-ray images. Specifically, we used VoG as a metric to rank each sample in terms of the classification difficulty, where high VoG scores indicate more difficult cases for classification, to guide the curriculum training process We compared the proposed technique to a baseline (without curriculum learning), a previous method that used human annotations on classification difficulty, and anti-curriculum learning. Our experiment results showed comparable and higher performance for the binary and multi-class bone fracture classification tasks.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v2 [cs.LG] UPDATED)
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.
    Construction and Applications of Billion-Scale Pre-trained Multimodal Business Knowledge Graph. (arXiv:2209.15214v3 [cs.AI] CROSS LISTED)
    Business Knowledge Graphs (KGs) are important to many enterprises today, providing factual knowledge and structured data that steer many products and make them more intelligent. Despite their promising benefits, building business KG necessitates solving prohibitive issues of deficient structure and multiple modalities. In this paper, we advance the understanding of the practical challenges related to building KG in non-trivial real-world systems. We introduce the process of building an open business knowledge graph (OpenBG) derived from a well-known enterprise, Alibaba Group. Specifically, we define a core ontology to cover various abstract products and consumption demands, with fine-grained taxonomy and multimodal facts in deployed applications. OpenBG is an open business KG of unprecedented scale: 2.6 billion triples with more than 88 million entities covering over 1 million core classes/concepts and 2,681 types of relations. We release all the open resources (OpenBG benchmarks) derived from it for the community and report experimental results of KG-centric tasks. We also run up an online competition based on OpenBG benchmarks, and has attracted thousands of teams. We further pre-train OpenBG and apply it to many KG- enhanced downstream tasks in business scenarios, demonstrating the effectiveness of billion-scale multimodal knowledge for e-commerce. All the resources with codes have been released at \url{https://github.com/OpenBGBenchmark/OpenBG}.
    Constrained Online Two-stage Stochastic Optimization: New Algorithms via Adversarial Learning. (arXiv:2302.00997v1 [cs.LG])
    We consider an online two-stage stochastic optimization with long-term constraints over a finite horizon of $T$ periods. At each period, we take the first-stage action, observe a model parameter realization and then take the second-stage action from a feasible set that depends both on the first-stage decision and the model parameter. We aim to minimize the cumulative objective value while guaranteeing that the long-term average second-stage decision belongs to a set. We propose a general algorithmic framework that derives online algorithms for the online two-stage problem from adversarial learning algorithms. Also, the regret bound of our algorithm cam be reduced to the regret bound of embedded adversarial learning algorithms. Based on our framework, we obtain new results under various settings. When the model parameter at each period is drawn from identical distributions, we derive state-of-art regret bound that improves previous bounds under special cases. Our algorithm is also robust to adversarial corruptions of model parameter realizations. When the model parameters are drawn from unknown non-stationary distributions and we are given prior estimates of the distributions, we develop a new algorithm from our framework with a regret $O(W_T+\sqrt{T})$, where $W_T$ measures the total inaccuracy of the prior estimates.
    Surprising Instabilities in Training Deep Networks and a Theoretical Analysis. (arXiv:2206.02001v3 [cs.LG] UPDATED)
    We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD). We show numerical error (on the order of the smallest floating point bit) induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance, comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.
    Prediction-Powered Inference. (arXiv:2301.09633v2 [stat.ML] UPDATED)
    We introduce prediction-powered inference $\unicode{x2013}$ a framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system. Our framework yields provably valid conclusions without making any assumptions on the machine-learning algorithm that supplies the predictions. Higher accuracy of the predictions translates to smaller confidence intervals, permitting more powerful inference. Prediction-powered inference yields simple algorithms for computing valid confidence intervals for statistical objects such as means, quantiles, and linear and logistic regression coefficients. We demonstrate the benefits of prediction-powered inference with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology.
    GANalyzer: Analysis and Manipulation of GANs Latent Space for Controllable Face Synthesis. (arXiv:2302.00908v1 [cs.CV])
    Generative Adversarial Networks (GANs) are capable of synthesizing high-quality facial images. Despite their success, GANs do not provide any information about the relationship between the input vectors and the generated images. Currently, facial GANs are trained on imbalanced datasets, which generate less diverse images. For example, more than 77% of 100K images that we randomly synthesized using the StyleGAN3 are classified as Happy, and only around 3% are Angry. The problem even becomes worse when a mixture of facial attributes is desired: less than 1% of the generated samples are Angry Woman, and only around 2% are Happy Black. To address these problems, this paper proposes a framework, called GANalyzer, for the analysis, and manipulation of the latent space of well-trained GANs. GANalyzer consists of a set of transformation functions designed to manipulate latent vectors for a specific facial attribute such as facial Expression, Age, Gender, and Race. We analyze facial attribute entanglement in the latent space of GANs and apply the proposed transformation for editing the disentangled facial attributes. Our experimental results demonstrate the strength of GANalyzer in editing facial attributes and generating any desired faces. We also create and release a balanced photo-realistic human face dataset. Our code is publicly available on GitHub.
    Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start". (arXiv:2302.00932v1 [cs.LG])
    Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.
    FCB-SwinV2 Transformer for Polyp Segmentation. (arXiv:2302.01027v1 [cs.CV])
    Polyp segmentation within colonoscopy video frames using deep learning models has the potential to automate the workflow of clinicians. This could help improve the early detection rate and characterization of polyps which could progress to colorectal cancer. Recent state-of-the-art deep learning polyp segmentation models have combined the outputs of Fully Convolutional Network architectures and Transformer Network architectures which work in parallel. In this paper we propose modifications to the current state-of-the-art polyp segmentation model FCBFormer. The transformer architecture of the FCBFormer is replaced with a SwinV2 Transformer-UNET and minor changes to the Fully Convolutional Network architecture are made to create the FCB-SwinV2 Transformer. The performance of the FCB-SwinV2 Transformer is evaluated on the popular colonoscopy segmentation bench-marking datasets Kvasir-SEG and CVC-ClinicDB. Generalizability tests are also conducted. The FCB-SwinV2 Transformer is able to consistently achieve higher mDice scores across all tests conducted and therefore represents new state-of-the-art performance. Issues found with how colonoscopy segmentation model performance is evaluated within literature are also re-ported and discussed. One of the most important issues identified is that when evaluating performance on the CVC-ClinicDB dataset it would be preferable to ensure no data leakage from video sequences occurs during the training/validation/test data partition.
    A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models. (arXiv:2302.01217v1 [stat.ML])
    We provide a theoretical justification for sample recovery using diffusion based image inpainting in a linear model setting. While most inpainting algorithms require retraining with each new mask, we prove that diffusion based inpainting generalizes well to unseen masks without retraining. We analyze a recently proposed popular diffusion based inpainting algorithm called RePaint (Lugmayr et al., 2022), and show that it has a bias due to misalignment that hampers sample recovery even in a two-state diffusion process. Motivated by our analysis, we propose a modified RePaint algorithm we call RePaint$^+$ that provably recovers the underlying true sample and enjoys a linear rate of convergence. It achieves this by rectifying the misalignment error present in drift and dispersion of the reverse process. To the best of our knowledge, this is the first linear convergence result for a diffusion based image inpainting algorithm.
    Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. (arXiv:2205.13134v2 [cs.AI] UPDATED)
    Nonlinear dynamics is ubiquitous in nature and commonly seen in various science and engineering disciplines. Distilling analytical expressions that govern nonlinear dynamics from limited data remains vital but challenging. To tackle this fundamental issue, we propose a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics. The key concept is to interpret mathematical operations and system state variables by computational rules and symbols, establish symbolic reasoning of mathematical formulas via expression trees, and employ a Monte Carlo tree search (MCTS) agent to explore optimal expression trees based on measurement data. The MCTS agent obtains an optimistic selection policy through the traversal of expression trees, featuring the one that maps to the arithmetic expression of underlying physics. Salient features of the proposed framework include search flexibility and enforcement of parsimony for discovered equations. The efficacy and superiority of the SPL machine are demonstrated by numerical examples, compared with state-of-the-art baselines.
    Model Monitoring and Robustness of In-Use Machine Learning Models: Quantifying Data Distribution Shifts Using Population Stability Index. (arXiv:2302.00775v1 [cs.LG])
    Safety goes first. Meeting and maintaining industry safety standards for robustness of artificial intelligence (AI) and machine learning (ML) models require continuous monitoring for faults and performance drops. Deep learning models are widely used in industrial applications, e.g., computer vision, but the susceptibility of their performance to environment changes (e.g., noise) \emph{after deployment} on the product, are now well-known. A major challenge is detecting data distribution shifts that happen, comparing the following: {\bf (i)} development stage of AI and ML models, i.e., train/validation/test, to {\bf (ii)} deployment stage on the product (i.e., even after `testing') in the environment. We focus on a computer vision example related to autonomous driving and aim at detecting shifts that occur as a result of adding noise to images. We use the population stability index (PSI) as a measure of presence and intensity of shift and present results of our empirical experiments showing a promising potential for the PSI. We further discuss multiple aspects of model monitoring and robustness that need to be analyzed \emph{simultaneously} to achieve robustness for industry safety standards. We propose the need for and the research direction toward \emph{categorizations} of problem classes and examples where monitoring for robustness is required and present challenges and pointers for future work from a \emph{practical} perspective.
    Efficient Graph Field Integrators Meet Point Clouds. (arXiv:2302.00942v1 [cs.LG])
    We present two new classes of algorithms for efficient field integration on graphs encoding point clouds. The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Methods (FMMs), which have had a tremendous impact on efficient integration, but for non-Euclidean spaces. We focus on geometries induced by distributions of walk lengths between points (e.g., shortest-path distance). We provide an extensive theoretical analysis of our algorithms, obtaining new results in structural graph theory as a byproduct. We also perform exhaustive empirical evaluation, including on-surface interpolation for rigid and deformable objects (particularly for mesh-dynamics modeling), Wasserstein distance computations for point clouds, and the Gromov-Wasserstein variant.
    Variational Autoencoder Learns Better Feature Representations for EEG-based Obesity Classification. (arXiv:2302.00789v1 [cs.LG])
    Obesity is a common issue in modern societies today that can lead to various diseases and significantly reduced quality of life. Currently, research has been conducted to investigate resting state EEG (electroencephalogram) signals with an aim to identify possible neurological characteristics associated with obesity. In this study, we propose a deep learning-based framework to extract the resting state EEG features for obese and lean subject classification. Specifically, a novel variational autoencoder framework is employed to extract subject-invariant features from the raw EEG signals, which are then classified by a 1-D convolutional neural network. Comparing with conventional machine learning and deep learning methods, we demonstrate the superiority of using VAE for feature extraction, as reflected by the significantly improved classification accuracies, better visualizations and reduced impurity measures in the feature representations. Future work can be directed to gaining an in-depth understanding regarding the spatial patterns that have been learned by the proposed model from a neurological view, as well as improving the interpretability of the proposed model by allowing it to uncover any temporal-related information.
    Using Machine Learning to Develop Smart Reflex Testing Protocols. (arXiv:2302.00794v1 [cs.LG])
    Objective: Reflex testing protocols allow clinical laboratories to perform second line diagnostic tests on existing specimens based on the results of initially ordered tests. Reflex testing can support optimal clinical laboratory test ordering and diagnosis. In current clinical practice, reflex testing typically relies on simple "if-then" rules; however, this limits their scope since most test ordering decisions involve more complexity than a simple rule will allow. Here, using the analyte ferritin as an example, we propose an alternative machine learning-based approach to "smart" reflex testing with a wider scope and greater impact than traditional rule-based approaches. Methods: Using patient data, we developed a machine learning model to predict whether a patient getting CBC testing will also have ferritin testing ordered, consider applications of this model to "smart" reflex testing, and evaluate the model by comparing its performance to possible rule-based approaches. Results: Our underlying machine learning models performed moderately well in predicting ferritin test ordering and demonstrated greater suitability to reflex testing than rule-based approaches. Using chart review, we demonstrate that our model may improve ferritin test ordering. Finally, as a secondary goal, we demonstrate that ferritin test results are missing not at random (MNAR), a finding with implications for unbiased imputation of missing test results. Conclusions: Machine learning may provide a foundation for new types of reflex testing with enhanced benefits for clinical diagnosis and laboratory utilization management.
    Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning. (arXiv:2302.01002v1 [stat.ML])
    We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning.
    Analysis of Biomass Sustainability Indicators from a Machine Learning Perspective. (arXiv:2302.00828v1 [cs.AI])
    Plant biomass estimation is critical due to the variability of different environmental factors and crop management practices associated with it. The assessment is largely impacted by the accurate prediction of different environmental sustainability indicators. A robust model to predict sustainability indicators is a must for the biomass community. This study proposes a robust model for biomass sustainability prediction by analyzing sustainability indicators using machine learning models. The prospect of ensemble learning was also investigated to analyze the regression problem. All experiments were carried out on a crop residue data from the Ohio state. Ten machine learning models, namely, linear regression, ridge regression, multilayer perceptron, k-nearest neighbors, support vector machine, decision tree, gradient boosting, random forest, stacking and voting, were analyzed to estimate three biomass sustainability indicators, namely soil erosion factor, soil conditioning index, and organic matter factor. The performance of the model was assessed using cross-correlation (R2), root mean squared error and mean absolute error metrics. The results showed that Random Forest was the best performing model to assess sustainability indicators. The analyzed model can now serve as a guide for assessing sustainability indicators in real time.
    Empirical Analysis of the AdaBoost's Error Bound. (arXiv:2302.00880v1 [cs.LG])
    Understanding the accuracy limits of machine learning algorithms is essential for data scientists to properly measure performance so they can continually improve their models' predictive capabilities. This study empirically verified the error bound of the AdaBoost algorithm for both synthetic and real-world data. The results show that the error bound holds up in practice, demonstrating its efficiency and importance to a variety of applications. The corresponding source code is available at https://github.com/armanbolatov/adaboost_error_bound.
    Noncommutative $C^*$-algebra Net: Learning Neural Networks with Powerful Product Structure in $C^*$-algebra. (arXiv:2302.01191v1 [math.OA])
    We propose a new generalization of neural networks with noncommutative $C^*$-algebra. An important feature of $C^*$-algebras is their noncommutative structure of products, but the existing $C^*$-algebra net frameworks have only considered commutative $C^*$-algebras. We show that this noncommutative structure of $C^*$-algebras induces powerful effects in learning neural networks. Our framework has a wide range of applications, such as learning multiple related neural networks simultaneously with interactions and learning invariant features with respect to group actions. We also show the validity of our framework numerically, which illustrates its potential power.  ( 2 min )
    Collaborating with language models for embodied reasoning. (arXiv:2302.00763v1 [cs.LG])
    Reasoning in a complex and ambiguous environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the ability to to adapt to new tasks through in-context learning. However, LSLMs do not inherently have the ability to interrogate or intervene on the environment. In this work, we investigate how to combine these complementary abilities in a single system consisting of three parts: a Planner, an Actor, and a Reporter. The Planner is a pre-trained language model that can issue commands to a simple embodied agent (the Actor), while the Reporter communicates with the Planner to inform its next command. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases, and demonstrate how components of this system can be trained with reinforcement-learning to improve performance.
    Molecular Geometry-aware Transformer for accurate 3D Atomic System modeling. (arXiv:2302.00855v1 [q-bio.MN])
    Molecular dynamic simulations are important in computational physics, chemistry, material, and biology. Machine learning-based methods have shown strong abilities in predicting molecular energy and properties and are much faster than DFT calculations. Molecular energy is at least related to atoms, bonds, bond angles, torsion angles, and nonbonding atom pairs. Previous Transformer models only use atoms as inputs which lack explicit modeling of the aforementioned factors. To alleviate this limitation, we propose Moleformer, a novel Transformer architecture that takes nodes (atoms) and edges (bonds and nonbonding atom pairs) as inputs and models the interactions among them using rotational and translational invariant geometry-aware spatial encoding. Proposed spatial encoding calculates relative position information including distances and angles among nodes and edges. We benchmark Moleformer on OC20 and QM9 datasets, and our model achieves state-of-the-art on the initial state to relaxed energy prediction of OC20 and is very competitive in QM9 on predicting quantum chemical properties compared to other Transformer and Graph Neural Network (GNN) methods which proves the effectiveness of the proposed geometry-aware spatial encoding in Moleformer.
    FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features. (arXiv:2302.00787v1 [cs.LG])
    The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
    Rethinking Warm-Starts with Predictions: Learning Predictions Close to Sets of Optimal Solutions for Faster $\text{L}$-/$\text{L}^\natural$-Convex Function Minimization. (arXiv:2302.00928v1 [cs.LG])
    An emerging line of work has shown that machine-learned predictions are useful to warm-start algorithms for discrete optimization problems, such as bipartite matching. Previous studies have shown time complexity bounds proportional to some distance between a prediction and an optimal solution, which we can approximately minimize by learning predictions from past optimal solutions. However, such guarantees may not be meaningful when multiple optimal solutions exist. Indeed, the dual problem of bipartite matching and, more generally, $\text{L}$-/$\text{L}^\natural$-convex function minimization have arbitrarily many optimal solutions, making such prediction-dependent bounds arbitrarily large. To resolve this theoretically critical issue, we present a new warm-start-with-prediction framework for $\text{L}$-/$\text{L}^\natural$-convex function minimization. Our framework offers time complexity bounds proportional to the distance between a prediction and the set of all optimal solutions. The main technical difficulty lies in learning predictions that are provably close to sets of all optimal solutions, for which we present an online-gradient-descent-based method. We thus give the first polynomial-time learnability of predictions that can provably warm-start algorithms regardless of multiple optimal solutions.
    Mnemosyne: Learning to Train Transformers with Transformers. (arXiv:2302.01128v1 [cs.LG])
    Training complex machine learning (ML) architectures requires a compute and time consuming process of selecting the right optimizer and tuning its hyper-parameters. A new paradigm of learning optimizers from data has emerged as a better alternative to hand-designed ML optimizers. We propose Mnemosyne optimizer, that uses Performers: implicit low-rank attention Transformers. It can learn to train entire neural network architectures including other Transformers without any task-specific optimizer tuning. We show that Mnemosyne: (a) generalizes better than popular LSTM optimizer, (b) in particular can successfully train Vision Transformers (ViTs) while meta--trained on standard MLPs and (c) can initialize optimizers for faster convergence in Robotics applications. We believe that these results open the possibility of using Transformers to build foundational optimization models that can address the challenges of regular Transformer training. We complement our results with an extensive theoretical analysis of the compact associative memory used by Mnemosyne.  ( 2 min )
    Causal Lifting and Link Prediction. (arXiv:2302.01198v1 [cs.LG])
    Current state-of-the-art causal models for link prediction assume an underlying set of inherent node factors -- an innate characteristic defined at the node's birth -- that governs the causal evolution of links in the graph. In some causal tasks, however, link formation is path-dependent, i.e., the outcome of link interventions depends on existing links. For instance, in the customer-product graph of an online retailer, the effect of an 85-inch TV ad (treatment) likely depends on whether the costumer already has an 85-inch TV. Unfortunately, existing causal methods are impractical in these scenarios. The cascading functional dependencies between links (due to path dependence) are either unidentifiable or require an impractical number of control variables. In order to remedy this shortcoming, this work develops the first causal model capable of dealing with path dependencies in link prediction. It introduces the concept of causal lifting, an invariance in causal models that, when satisfied, allows the identification of causal link prediction queries using limited interventional data. On the estimation side, we show how structural pairwise embeddings -- a type of symmetry-based joint representation of node pairs in a graph -- exhibit lower bias and correctly represent the causal structure of the task, as opposed to existing node embedding methods, e.g., GNNs and matrix factorization. Finally, we validate our theoretical findings on four datasets under three different scenarios for causal link prediction tasks: knowledge base completion, covariance matrix estimation and consumer-product recommendations.  ( 2 min )
    De Novo Molecular Generation via Connection-aware Motif Mining. (arXiv:2302.01129v1 [cs.LG])
    De novo molecular generation is an essential task for science discovery. Recently, fragment-based deep generative models have attracted much research attention due to their flexibility in generating novel molecules based on existing molecule fragments. However, the motif vocabulary, i.e., the collection of frequent fragments, is usually built upon heuristic rules, which brings difficulties to capturing common substructures from large amounts of molecules. In this work, we propose a new method, MiCaM, to generate molecules based on mined connection-aware motifs. Specifically, it leverages a data-driven algorithm to automatically discover motifs from a molecule library by iteratively merging subgraphs based on their frequency. The obtained motif vocabulary consists of not only molecular motifs (i.e., the frequent fragments), but also their connection information, indicating how the motifs are connected with each other. Based on the mined connection-aware motifs, MiCaM builds a connection-aware generator, which simultaneously picks up motifs and determines how they are connected. We test our method on distribution-learning benchmarks (i.e., generating novel molecules to resemble the distribution of a given training set) and goal-directed benchmarks (i.e., generating molecules with target properties), and achieve significant improvements over previous fragment-based baselines. Furthermore, we demonstrate that our method can effectively mine domain-specific motifs for different tasks.  ( 2 min )
    QCM-SGM+: Improved Quantized Compressed Sensing With Score-Based Generative Models for General Sensing Matrices. (arXiv:2302.00919v1 [eess.SP])
    In realistic compressed sensing (CS) scenarios, the obtained measurements usually have to be quantized to a finite number of bits before transmission and/or storage, thus posing a challenge in recovery, especially for extremely coarse quantization such as 1-bit sign measurements. Recently Meng & Kabashima proposed an efficient quantized compressed sensing algorithm called QCS-SGM using the score-based generative models as an implicit prior. Thanks to the power of score-based generative models in capturing the rich structure of the prior, QCS-SGM achieves remarkably better performances than previous quantized CS methods. However, QCS-SGM is restricted to (approximately) row-orthogonal sensing matrices since otherwise the likelihood score becomes intractable. To address this challenging problem, in this paper we propose an improved version of QCS-SGM, which we call QCS-SGM+, which also works well for general matrices. The key idea is a Bayesian inference perspective of the likelihood score computation, whereby an expectation propagation algorithm is proposed to approximately compute the likelihood score. Experiments on a variety of baseline datasets demonstrate that the proposed QCS-SGM+ outperforms QCS-SGM by a large margin when sensing matrices are far from row-orthogonal.
    An Enhanced V-cycle MgNet Model for Operator Learning in Numerical Partial Differential Equations. (arXiv:2302.00938v1 [cs.LG])
    This study used a multigrid-based convolutional neural network architecture known as MgNet in operator learning to solve numerical partial differential equations (PDEs). Given the property of smoothing iterations in multigrid methods where low-frequency errors decay slowly, we introduced a low-frequency correction structure for residuals to enhance the standard V-cycle MgNet. The enhanced MgNet model can capture the low-frequency features of solutions considerably better than the standard V-cycle MgNet. The numerical results obtained using some standard operator learning tasks are better than those obtained using many state-of-the-art methods, demonstrating the efficiency of our model.Moreover, numerically, our new model is more robust in case of low- and high-resolution data during training and testing, respectively.
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data. (arXiv:2302.00834v1 [cs.LG])
    We study the interpolation, or memorization, power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints.
    Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment. (arXiv:2302.00902v1 [cs.LG])
    Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained language models.
    CLIPood: Generalizing CLIP to Out-of-Distributions. (arXiv:2302.00864v1 [cs.LG])
    Out-of-distribution (OOD) generalization, where the model needs to handle distribution shifts from training, is a major challenge of machine learning. Recently, contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, revealing a promising path toward OOD generalization. However, to boost upon zero-shot performance, further adaptation of CLIP on downstream tasks is indispensable but undesirably degrades OOD generalization ability. In this paper, we aim at generalizing CLIP to out-of-distribution test data on downstream tasks. Beyond the two canonical OOD situations, domain shift and open class, we tackle a more general but difficult in-the-wild setting where both OOD situations may occur on the unseen test data. We propose CLIPood, a simple fine-tuning method that can adapt CLIP models to all OOD situations. To exploit semantic relations between classes from the text modality, CLIPood introduces a new training objective, margin metric softmax (MMS), with class adaptive margins for fine-tuning. Moreover, to incorporate both the pre-trained zero-shot model and the fine-tuned task-adaptive model, CLIPood proposes a new Beta moving average (BMA) to maintain a temporal ensemble according to Beta distribution. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
    MTP-GO: Graph-Based Probabilistic Multi-Agent Trajectory Prediction with Neural ODEs. (arXiv:2302.00735v1 [cs.RO])
    Enabling resilient autonomous motion planning requires robust predictions of surrounding road users' future behavior. In response to this need and the associated challenges, we introduce our model, titled MTP-GO. The model encodes the scene using temporal graph neural networks to produce the inputs to an underlying motion model. The motion model is implemented using neural ordinary differential equations where the state-transition functions are learned with the rest of the model. Multi-modal probabilistic predictions are provided by combining the concept of mixture density networks and Kalman filtering. The results illustrate the predictive capabilities of the proposed model across various data sets, outperforming several state-of-the-art methods on a number of metrics.  ( 2 min )
    LMC: Fast Training of GNNs via Subgraph Sampling with Provable Convergence. (arXiv:2302.00924v1 [cs.LG])
    The message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. However, training GNNs on large-scale graphs suffers from the well-known neighbor explosion problem, i.e., the exponentially increasing dependencies of nodes with the number of message passing layers. Subgraph-wise sampling methods -- a promising class of mini-batch training techniques -- discard messages outside the mini-batches in backward passes to avoid the neighbor explosion problem at the expense of gradient estimation accuracy. This poses significant challenges to their convergence analysis and convergence speeds, which seriously limits their reliable real-world applications. To address this challenge, we propose a novel subgraph-wise sampling method with a convergence guarantee, namely Local Message Compensation (LMC). To the best of our knowledge, LMC is the {\it first} subgraph-wise sampling method with provable convergence. The key idea of LMC is to retrieve the discarded messages in backward passes based on a message passing formulation of backward passes. By efficient and effective compensations for the discarded messages in both forward and backward passes, LMC computes accurate mini-batch gradients and thus accelerates convergence. We further show that LMC converges to first-order stationary points of GNNs. Experiments on large-scale benchmark tasks demonstrate that LMC significantly outperforms state-of-the-art subgraph-wise sampling methods in terms of efficiency.  ( 2 min )
    Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent. (arXiv:2302.00849v1 [cs.LG])
    It is well known that the finite step-size ($h$) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter $\beta$ play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?". To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of $(\frac{1+\beta}{1-\beta})$, thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.
    The Contextual Lasso: Sparse Linear Models via Deep Neural Networks. (arXiv:2302.00878v1 [stat.ML])
    Sparse linear models are a gold standard tool for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which we wish to explain the model's predictions, and contextual features, which we wish to determine the model's explanations. This dichotomy leads us to propose the contextual lasso, a new statistical estimator that fits a sparse linear model whose sparsity pattern and coefficients can vary with the contextual features. The fitting process involves learning a nonparametric map, realized via a deep neural network, from contextual feature vector to sparse coefficient vector. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. Extensive experiments on real and synthetic data suggest that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.
    High-precision regressors for particle physics. (arXiv:2302.00753v1 [physics.comp-ph])
    Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of $10^9$ - $10^{12}$, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than $1\%$ or even $0.1\%$ over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of $10^3$ - $10^6$ over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.
    Synthesizing Physical Character-Scene Interactions. (arXiv:2302.00883v1 [cs.GR])
    Movement is how people interact with and affect their environment. For realistic character animation, it is necessary to synthesize such interactions between virtual characters and their surroundings. Despite recent progress in character animation using machine learning, most systems focus on controlling an agent's movements in fairly simple and homogeneous environments, with limited interactions with other objects. Furthermore, many previous approaches that synthesize human-scene interactions require significant manual labeling of the training data. In contrast, we present a system that uses adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a natural and life-like manner. Our method learns scene interaction behaviors from large unstructured motion datasets, without manual annotation of the motion data. These scene interactions are learned using an adversarial discriminator that evaluates the realism of a motion within the context of a scene. The key novelty involves conditioning both the discriminator and the policy networks on scene context. We demonstrate the effectiveness of our approach through three challenging scene interaction tasks: carrying, sitting, and lying down, which require coordination of a character's movements in relation to objects in the environment. Our policies learn to seamlessly transition between different behaviors like idling, walking, and sitting. By randomizing the properties of the objects and their placements during training, our method is able to generalize beyond the objects and scenarios depicted in the training dataset, producing natural character-scene interactions for a wide variety of object shapes and placements. The approach takes physics-based character motion generation a step closer to broad applicability.
    STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition. (arXiv:2302.01172v1 [cs.LG])
    Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.  ( 2 min )
    Stochastic Contextual Bandits with Long Horizon Rewards. (arXiv:2302.00814v1 [cs.LG])
    The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
    Recurrent Graph Convolutional Networks for Spatiotemporal Prediction of Snow Accumulation Using Airborne Radar. (arXiv:2302.00817v1 [cs.LG])
    The accurate prediction and estimation of annual snow accumulation has grown in importance as we deal with the effects of climate change and the increase of global atmospheric temperatures. Airborne radar sensors, such as the Snow Radar, are able to measure accumulation rate patterns at a large-scale and monitor the effects of ongoing climate change on Greenland's precipitation and run-off. The Snow Radar's use of an ultra-wide bandwidth enables a fine vertical resolution that helps in capturing internal ice layers. Given the amount of snow accumulation in previous years using the radar data, in this paper, we propose a machine learning model based on recurrent graph convolutional networks to predict the snow accumulation in recent consecutive years at a certain location. We found that the model performs better and with more consistency than equivalent nongeometric and nontemporal models.
    Disentanglement of Latent Representations via Sparse Causal Interventions. (arXiv:2302.00869v1 [cs.LG])
    The process of generating data such as images is controlled by independent and unknown factors of variation. The retrieval of these variables has been studied extensively in the disentanglement, causal representation learning, and independent component analysis fields. Recently, approaches merging these domains together have shown great success. Instead of directly representing the factors of variation, the problem of disentanglement can be seen as finding the interventions on one image that yield a change to a single factor. Following this assumption, we introduce a new method for disentanglement inspired by causal dynamics that combines causality theory with vector-quantized variational autoencoders. Our model considers the quantized vectors as causal variables and links them in a causal graph. It performs causal interventions on the graph and generates atomic transitions affecting a unique factor of variation in the image. We also introduce a new task of action retrieval that consists of finding the action responsible for the transition between two images. We test our method on standard synthetic and real-world disentanglement datasets. We show that it can effectively disentangle the factors of variation and perform precise interventions on high-level semantic attributes of an image without affecting its quality, even with imbalanced data distributions.
    A Light-weight CNN Model for Efficient Parkinson's Disease Diagnostics. (arXiv:2302.00973v1 [stat.ML])
    In recent years, deep learning methods have achieved great success in various fields due to their strong performance in practical applications. In this paper, we present a light-weight neural network for Parkinson's disease diagnostics, in which a series of hand-drawn data are collected to distinguish Parkinson's disease patients from healthy control subjects. The proposed model consists of a convolution neural network (CNN) cascading to long-short-term memory (LSTM) to adapt the characteristics of collected time-series signals. To make full use of their advantages, a multilayered LSTM model is firstly used to enrich features which are then concatenated with raw data and fed into a shallow one-dimensional (1D) CNN model for efficient classification. Experimental results show that the proposed model achieves a high-quality diagnostic result over multiple evaluation metrics with much fewer parameters and operations, outperforming conventional methods such as support vector machine (SVM), random forest (RF), lightgbm (LGB) and CNN-based methods.
    Meta Learning in Decentralized Neural Networks: Towards More General AI. (arXiv:2302.01020v1 [cs.LG])
    Meta-learning usually refers to a learning algorithm that learns from other learning algorithms. The problem of uncertainty in the predictions of neural networks shows that the world is only partially predictable and a learned neural network cannot generalize to its ever-changing surrounding environments. Therefore, the question is how a predictive model can represent multiple predictions simultaneously. We aim to provide a fundamental understanding of learning to learn in the contents of Decentralized Neural Networks (Decentralized NNs) and we believe this is one of the most important questions and prerequisites to building an autonomous intelligence machine. To this end, we shall demonstrate several pieces of evidence for tackling the problems above with Meta Learning in Decentralized NNs. In particular, we will present three different approaches to building such a decentralized learning system: (1) learning from many replica neural networks, (2) building the hierarchy of neural networks for different functions, and (3) leveraging different modality experts to learn cross-modal representations.
    Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control. (arXiv:2302.00839v1 [cs.LG])
    Many real-world multi-label prediction problems involve set-valued predictions that must satisfy specific requirements dictated by downstream usage. We focus on a typical scenario where such requirements, separately encoding \textit{value} and \textit{cost}, compete with each other. For instance, a hospital might expect a smart diagnosis system to capture as many severe, often co-morbid, diseases as possible (the value), while maintaining strict control over incorrect predictions (the cost). We present a general pipeline, dubbed as FavMac, to maximize the value while controlling the cost in such scenarios. FavMac can be combined with almost any multi-label classifier, affording distribution-free theoretical guarantees on cost control. Moreover, unlike prior works, FavMac can handle real-world large-scale applications via a carefully designed online update mechanism, which is of independent interest. Our methodological and theoretical contributions are supported by experiments on several healthcare tasks and synthetic datasets - FavMac furnishes higher value compared with several variants and baselines while maintaining strict cost control.
    Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery. (arXiv:2302.00755v1 [stat.ML])
    In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient prediction and uncertainty quantification. Standard GPs, however, do not capture structured sparsity on the underlying response surface, which is present in many applications, particularly in the physical sciences. We thus propose a new hierarchical shrinkage GP (HierGP), which incorporates such structure via cumulative shrinkage priors within a GP framework. We show that the HierGP implicitly embeds the well-known principles of effect sparsity, heredity and hierarchy for analysis of experiments, which allows our model to identify structured sparse features from the response surface with limited data. We propose efficient posterior sampling algorithms for model training and prediction, and prove desirable consistency properties for the HierGP. Finally, we demonstrate the improved performance of HierGP over existing models, in a suite of numerical experiments and an application to dynamical system recovery.
    Privacy Risk for anisotropic Langevin dynamics using relative entropy bounds. (arXiv:2302.00766v1 [cs.LG])
    The privacy preserving properties of Langevin dynamics with additive isotropic noise have been extensively studied. However, the isotropic noise assumption is very restrictive: (a) when adding noise to existing learning algorithms to preserve privacy and maintain the best possible accuracy one should take into account the relative magnitude of the outputs and their correlations; (b) popular algorithms such as stochastic gradient descent (and their continuous time limits) appear to possess anisotropic covariance properties. To study the privacy risks for the anisotropic noise case, one requires general results on the relative entropy between the laws of two Stochastic Differential Equations with different drifts and diffusion coefficients. Our main contribution is to establish such a bound using stability estimates for solutions to the Fokker-Planck equations via functional inequalities. With additional assumptions, the relative entropy bound implies an $(\epsilon,\delta)$-differential privacy bound. We discuss the practical implications of our bound related to privacy risk in different contexts.Finally, the benefits of anisotropic noise are illustrated using numerical results on optimising a quadratic loss or calibrating a neural network.
    A Survey on Compositional Generalization in Applications. (arXiv:2302.01067v1 [cs.AI])
    The field of compositional generalization is currently experiencing a renaissance in AI, as novel problem settings and algorithms motivated by various practical applications are being introduced, building on top of the classical compositional generalization problem. This article aims to provide a comprehensive review of top recent developments in multiple real-life applications of the compositional generalization. Specifically, we introduce a taxonomy of common applications and summarize the state-of-the-art for each of those domains. Furthermore, we identify important current trends and provide new perspectives pertaining to the future of this burgeoning field.  ( 2 min )
    Pathologies of Predictive Diversity in Deep Ensembles. (arXiv:2302.00704v1 [cs.LG])
    Classical results establish that ensembles of small models benefit when predictive diversity is encouraged, through bagging, boosting, and similar. Here we demonstrate that this intuition does not carry over to ensembles of deep neural networks used for classification, and in fact the opposite can be true. Unlike regression models or small (unconfident) classifiers, predictions from large (confident) neural networks concentrate in vertices of the probability simplex. Thus, decorrelating these points necessarily moves the ensemble prediction away from vertices, harming confidence and moving points across decision boundaries. Through large scale experiments, we demonstrate that diversity-encouraging regularizers hurt the performance of high-capacity deep ensembles used for classification. Even more surprisingly, discouraging predictive diversity can be beneficial. Together this work strongly suggests that the best strategy for deep ensembles is utilizing more accurate, but likely less diverse, component models.
    Generative Modeling with Quantum Neurons. (arXiv:2302.00788v1 [quant-ph])
    The recently proposed Quantum Neuron Born Machine (QNBM) has demonstrated quality initial performance as the first quantum generative machine learning (ML) model proposed with non-linear activations. However, previous investigations have been limited in scope with regards to the model's learnability and simulatability. In this work, we make a considerable leap forward by providing an extensive deep dive into the QNBM's potential as a generative model. We first demonstrate that the QNBM's network representation makes it non-trivial to be classically efficiently simulated. Following this result, we showcase the model's ability to learn (express and train on) a wider set of probability distributions, and benchmark the performance against a classical Restricted Boltzmann Machine (RBM). The QNBM is able to outperform this classical model on all distributions, even for the most optimally trained RBM among our simulations. Specifically, the QNBM outperforms the RBM with an improvement factor of 75.3x, 6.4x, and 3.5x for the discrete Gaussian, cardinality-constrained, and Bars and Stripes distributions respectively. Lastly, we conduct an initial investigation into the model's generalization capabilities and use a KL test to show that the model is able to approximate the ground truth probability distribution more closely than the training distribution when given access to a limited amount of data. Overall, we put forth a stronger case in support of using the QNBM for larger-scale generative tasks.
    Causal Effect Estimation: Recent Advances, Challenges, and Opportunities. (arXiv:2302.00848v1 [cs.LG])
    Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., propensity score, matching, and reweighing) and advanced machine learning approaches (e.g., representation learning, adversarial learning, and graph neural networks). Although the advanced machine learning approaches have shown extraordinary performance in treatment effect estimation, it also comes with a lot of new topics and new research questions. In view of the latest research efforts in the causal inference field, we provide a comprehensive discussion of challenges and opportunities for the three core components of the treatment effect estimation task, i.e., treatment, covariates, and outcome. In addition, we showcase the promising research directions of this topic from multiple perspectives.
    The Weisfeiler-Lehman Distance: Reinterpretation and Connection with GNNs. (arXiv:2302.00713v1 [cs.LG])
    In this paper, we present a novel interpretation of the so-called Weisfeiler-Lehman (WL) distance, introduced by Chen et al. (2022), using concepts from stochastic processes. The WL distance aims at comparing graphs with node features, has the same discriminative power as the classic Weisfeiler-Lehman graph isomorphism test and has deep connections to the Gromov-Wasserstein distance. This new interpretation connects the WL distance to the literature on distances for stochastic processes, which also makes the interpretation of the distance more accessible and intuitive. We further explore the connections between the WL distance and certain Message Passing Neural Networks, and discuss the implications of the WL distance for understanding the Lipschitz property and the universal approximation results for these networks.
    ImageNomer: developing an fMRI and omics visualization tool to detect racial bias in functional connectivity. (arXiv:2302.00767v1 [q-bio.PE])
    It can be difficult to identify trends and perform quality control in large, high-dimensional fMRI or omics datasets. To remedy this, we develop ImageNomer, a data visualization and analysis tool that allows inspection of both subject-level and cohort-level features. The tool allows visualization of phenotype correlation with functional connectivity (FC), partial connectivity (PC), dictionary components (PCA and our own method), and genomic data (single-nucleotide polymorphisms, SNPs). In addition, it allows visualization of weights from arbitrary ML models. ImageNomer is built with a Python backend and a Vue frontend. We validate ImageNomer using the Philadelphia Neurodevelopmental Cohort (PNC) dataset, which contains multitask fMRI and SNP data of healthy adolescents. Using correlation, greedy selection, or model weights, we find that a set of 10 FC features can explain 15% of variation in age, compared to 35% for the full 34,716 feature model. The four most significant FCs are either between bilateral default mode network (DMN) regions or spatially proximal subcortical areas. Additionally, we show that whereas both FC (fMRI) and SNPs (genomic) features can account for 10-15% of intelligence variation, this predictive ability disappears when controlling for race. We find that FC features can be used to predict race with 85% accuracy, compared to 78% accuracy for sex prediction. Using ImageNomer, this work casts doubt on the possibility of finding unbiased intelligence-related features in fMRI and SNPs of healthy adolescents.
    Riemannian Stochastic Approximation for Minimizing Tame Nonsmooth Objective Functions. (arXiv:2302.00709v1 [cs.LG])
    In many learning applications, the parameters in a model are structurally constrained in a way that can be modeled as them lying on a Riemannian manifold. Riemannian optimization, wherein procedures to enforce an iterative minimizing sequence to be constrained to the manifold, is used to train such models. At the same time, tame geometry has become a significant topological description of nonsmooth functions that appear in the landscapes of training neural networks and other important models with structural compositions of continuous nonlinear functions with nonsmooth maps. In this paper, we study the properties of such stratifiable functions on a manifold and the behavior of retracted stochastic gradient descent, with diminishing stepsizes, for minimizing such functions.
    Sample Complexity of Kernel-Based Q-Learning. (arXiv:2302.00727v1 [cs.LG])
    Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
    Approximating the Shapley Value without Marginal Contributions. (arXiv:2302.00736v1 [cs.LG])
    The Shapley value is arguably the most popular approach for assigning a meaningful contribution value to players in a cooperative game, which has recently been used intensively in various areas of machine learning, most notably in explainable artificial intelligence. The meaningfulness is due to axiomatic properties that only the Shapley value satisfies, which, however, comes at the expense of an exact computation growing exponentially with the number of agents. Accordingly, a number of works are devoted to the efficient approximation of the Shapley values, all of which revolve around the notion of an agent's marginal contribution. In this paper, we propose with SVARM and Stratified SVARM two parameter-free and domain-independent approximation algorithms based on a representation of the Shapley value detached from the notion of marginal contributions. We prove unmatched theoretical guarantees regarding their approximation quality and provide satisfying empirical results.
    Neural Networks for Symbolic Regression. (arXiv:2302.00773v1 [cs.NE])
    Many real-world systems can be described by mathematical formulas that are human-comprehensible, easy to analyze and can be helpful in explaining the system's behaviour. Symbolic regression is a method that generates nonlinear models from data in the form of analytic expressions. Historically, symbolic regression has been predominantly realized using genetic programming, a method that iteratively evolves a population of candidate solutions that are sampled by genetic operators crossover and mutation. This gradient-free evolutionary approach suffers from several deficiencies: it does not scale well with the number of variables and samples in the training data, models tend to grow in size and complexity without an adequate accuracy gain, and it is hard to fine-tune the inner model coefficients using just genetic operators. Recently, neural networks have been applied to learn the whole analytic formula, i.e., its structure as well as the coefficients, by means of gradient-based optimization algorithms. We propose a novel neural network-based symbolic regression method that constructs physically plausible models based on limited training data and prior knowledge about the system. The method employs an adaptive weighting scheme to effectively deal with multiple loss function terms and an epoch-wise learning process to reduce the chance of getting stuck in poor local optima. Furthermore, we propose a parameter-free method for choosing the model with the best interpolation and extrapolation performance out of all models generated through the whole learning process. We experimentally evaluate the approach on the TurtleBot 2 mobile robot, the magnetic manipulation system, the equivalent resistance of two resistors in parallel, and the anti-lock braking system. The results clearly show the potential of the method to find sparse and accurate models that comply with the prior knowledge provided.
    A Survey of Deep Learning: From Activations to Transformers. (arXiv:2302.00722v1 [cs.LG])
    Deep learning has made tremendous progress in the last decade. A key success factor is the large amount of architectures, layers, objectives, and optimization techniques that have emerged in recent years. They include a myriad of variants related to attention, normalization, skip connection, transformer and self-supervised learning schemes -- to name a few. We provide a comprehensive overview of the most important, recent works in these areas to those who already have a basic understanding of deep learning. We hope that a holistic and unified treatment of influential, recent works helps researchers to form new connections between diverse areas of deep learning.
    Universal Soldier: Using Universal Adversarial Perturbations for Detecting Backdoor Attacks. (arXiv:2302.00747v1 [cs.LG])
    Deep learning models achieve excellent performance in numerous machine learning tasks. Yet, they suffer from security-related issues such as adversarial examples and poisoning (backdoor) attacks. A deep learning model may be poisoned by training with backdoored data or by modifying inner network parameters. Then, a backdoored model performs as expected when receiving a clean input, but it misclassifies when receiving a backdoored input stamped with a pre-designed pattern called "trigger". Unfortunately, it is difficult to distinguish between clean and backdoored models without prior knowledge of the trigger. This paper proposes a backdoor detection method by utilizing a special type of adversarial attack, universal adversarial perturbation (UAP), and its similarities with a backdoor trigger. We observe an intuitive phenomenon: UAPs generated from backdoored models need fewer perturbations to mislead the model than UAPs from clean models. UAPs of backdoored models tend to exploit the shortcut from all classes to the target class, built by the backdoor trigger. We propose a novel method called Universal Soldier for Backdoor detection (USB) and reverse engineering potential backdoor triggers via UAPs. Experiments on 345 models trained on several datasets show that USB effectively detects the injected backdoor and provides comparable or better results than state-of-the-art methods.
    Domain Generalization Emerges from Dreaming. (arXiv:2302.00980v1 [cs.CV])
    Recent studies have proven that DNNs, unlike human vision, tend to exploit texture information rather than shape. Such texture bias is one of the factors for the poor generalization performance of DNNs. We observe that the texture bias negatively affects not only in-domain generalization but also out-of-distribution generalization, i.e., Domain Generalization. Motivated by the observation, we propose a new framework to reduce the texture bias of a model by a novel optimization-based data augmentation, dubbed Stylized Dream. Our framework utilizes adaptive instance normalization (AdaIN) to augment the style of an original image yet preserve the content. We then adopt a regularization loss to predict consistent outputs between Stylized Dream and original images, which encourages the model to learn shape-based representations. Extensive experiments show that the proposed method achieves state-of-the-art performance in out-of-distribution settings on public benchmark datasets: PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet.  ( 2 min )
    Real-Time Evaluation in Online Continual Learning: A New Paradigm. (arXiv:2302.01047v1 [cs.LG])
    Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We hypothesize that under this new evaluation paradigm, computationally demanding CL approaches may perform poorly on streams with a varying distribution. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods.  ( 2 min )
    Avoiding Model Estimation in Robust Markov Decision Processes with a Generative Model. (arXiv:2302.01248v1 [stat.ML])
    Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.
    Randomized prior wavelet neural operator for uncertainty quantification. (arXiv:2302.01051v1 [stat.ML])
    In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with inherent uncertainty quantification module and hence, is expected to be extremely useful for scientists and engineers alike. RP-WNO utilizes randomized prior networks, which can account for prior information and is easier to implement for large, complex deep-learning architectures than its Bayesian counterpart. Four examples have been solved to test the proposed framework, and the results produced advocate favorably for the efficacy of the proposed framework.  ( 2 min )
    Reinforcement learning-based estimation for partial differential equations. (arXiv:2302.01189v1 [cs.LG])
    In systems governed by nonlinear partial differential equations such as fluid flows, the design of state estimators such as Kalman filters relies on a reduced-order model (ROM) that projects the original high-dimensional dynamics onto a computationally tractable low-dimensional space. However, ROMs are prone to large errors, which negatively affects the performance of the estimator. Here, we introduce the reinforcement learning reduced-order estimator (RL-ROE), a ROM-based estimator in which the correction term that takes in the measurements is given by a nonlinear policy trained through reinforcement learning. The nonlinearity of the policy enables the RL-ROE to compensate efficiently for errors of the ROM, while still taking advantage of the imperfect knowledge of the dynamics. Using examples involving the Burgers and Navier-Stokes equations, we show that in the limit of very few sensors, the trained RL-ROE outperforms a Kalman filter designed using the same ROM. Moreover, it yields accurate high-dimensional state estimates for reference trajectories corresponding to various physical parameter values, without direct knowledge of the latter.  ( 2 min )
    Neural Common Neighbor with Completion for Link Prediction. (arXiv:2302.00890v1 [cs.LG])
    Despite its outstanding performance in various graph tasks, vanilla Message Passing Neural Network (MPNN) usually fails in link prediction tasks, as it only uses representations of two individual target nodes and ignores the pairwise relation between them. To capture the pairwise relations, some models add manual features to the input graph and use the output of MPNN to produce pairwise representations. In contrast, others directly use manual features as pairwise representations. Though this simplification avoids applying a GNN to each link individually and thus improves scalability, these models still have much room for performance improvement due to the hand-crafted and unlearnable pairwise features. To upgrade performance while maintaining scalability, we propose Neural Common Neighbor (NCN), which uses learnable pairwise representations. To further boost NCN, we study the unobserved link problem. The incompleteness of the graph is ubiquitous and leads to distribution shifts between the training and test set, loss of common neighbor information, and performance degradation of models. Therefore, we propose two intervention methods: common neighbor completion and target link removal. Combining the two methods with NCN, we propose Neural Common Neighbor with Completion (NCNC). NCN and NCNC outperform recent strong baselines by large margins. NCNC achieves state-of-the-art performance in link prediction tasks.  ( 2 min )
    Energy Efficient Training of SNN using Local Zeroth Order Method. (arXiv:2302.00910v1 [cs.LG])
    Spiking neural networks are becoming increasingly popular for their low energy requirement in real-world tasks with accuracy comparable to the traditional ANNs. SNN training algorithms face the loss of gradient information and non-differentiability due to the Heaviside function in minimizing the model loss over model parameters. To circumvent the problem surrogate method uses a differentiable approximation of the Heaviside in the backward pass, while the forward pass uses the Heaviside as the spiking function. We propose to use the zeroth order technique at the neuron level to resolve this dichotomy and use it within the automatic differentiation tool. As a result, we establish a theoretical connection between the proposed local zeroth-order technique and the existing surrogate methods and vice-versa. The proposed method naturally lends itself to energy-efficient training of SNNs on GPUs. Experimental results with neuromorphic datasets show that such implementation requires less than 1 percent neurons to be active in the backward pass, resulting in a 100x speed-up in the backward computation time. Our method offers better generalization compared to the state-of-the-art energy-efficient technique while maintaining similar efficiency.  ( 2 min )
    Vectorized Scenario Description and Motion Prediction for Scenario-Based Testing. (arXiv:2302.01161v1 [cs.LG])
    Automated vehicles (AVs) are tested in diverse scenarios, typically specified by parameters such as velocities, distances, or curve radii. To describe scenarios uniformly independent of such parameters, this paper proposes a vectorized scenario description defined by the road geometry and vehicles' trajectories. Data of this form are generated for three scenarios, merged, and used to train the motion prediction model VectorNet, allowing to predict an AV's trajectory for unseen scenarios. Predicting scenario evaluation metrics, VectorNet partially achieves lower errors than regression models that separately process the three scenarios' data. However, for comprehensive generalization, sufficient variance in the training data must be ensured. Thus, contrary to existing methods, our proposed method can merge diverse scenarios' data and exploit spatial and temporal nuances in the vectorized scenario description. As a result, data from specified test scenarios and real-world scenarios can be compared and combined for (predictive) analyses and scenario selection.  ( 2 min )
    Deep COVID-19 Forecasting for Multiple States with Data Augmentation. (arXiv:2302.01155v1 [cs.LG])
    In this work, we propose a deep learning approach to forecasting state-level COVID-19 trends of weekly cumulative death in the United States (US) and incident cases in Germany. This approach includes a transformer model, an ensemble method, and a data augmentation technique for time series. We arrange the inputs of the transformer in such a way that predictions for different states can attend to the trends of the others. To overcome the issue of scarcity of training data for this COVID-19 pandemic, we have developed a novel data augmentation technique to generate useful data for training. More importantly, the generated data can also be used for model validation. As such, it has a two-fold advantage: 1) more actual observations can be used for training, and 2) the model can be validated on data which has distribution closer to the expected situation. Our model has achieved some of the best state-level results on the COVID-19 Forecast Hub for the US and for Germany.  ( 2 min )
    Predicting the Silent Majority on Graphs: Knowledge Transferable Graph Neural Network. (arXiv:2302.00873v1 [cs.LG])
    Graphs consisting of vocal nodes ("the vocal minority") and silent nodes ("the silent majority"), namely VS-Graph, are ubiquitous in the real world. The vocal nodes tend to have abundant features and labels. In contrast, silent nodes only have incomplete features and rare labels, e.g., the description and political tendency of politicians (vocal) are abundant while not for ordinary people (silent) on the twitter's social network. Predicting the silent majority remains a crucial yet challenging problem. However, most existing message-passing based GNNs assume that all nodes belong to the same domain, without considering the missing features and distribution-shift between domains, leading to poor ability to deal with VS-Graph. To combat the above challenges, we propose Knowledge Transferable Graph Neural Network (KT-GNN), which models distribution shifts during message passing and representation learning by transferring knowledge from vocal nodes to silent nodes. Specifically, we design the domain-adapted "feature completion and message passing mechanism" for node representation learning while preserving domain difference. And a knowledge transferable classifier based on KL-divergence is followed. Comprehensive experiments on real-world scenarios (i.e., company financial risk assessment and political elections) demonstrate the superior performance of our method. Our source code has been open sourced.
    Site-specific Deep Learning Path Loss Models based on the Method of Moments. (arXiv:2302.01052v1 [cs.LG])
    This paper describes deep learning models based on convolutional neural networks applied to the problem of predicting EM wave propagation over rural terrain. A surface integral equation formulation, solved with the method of moments and accelerated using the Fast Far Field approximation, is used to generate synthetic training data which comprises path loss computed over randomly generated 1D terrain profiles. These are used to train two networks, one based on fractal profiles and one based on profiles generated using a Gaussian process. The models show excellent agreement when applied to test profiles generated using the same statistical process used to create the training data and very good accuracy when applied to real life problems.
    adSformers: Personalization from Short-Term Sequences and Diversity of Representations in Etsy Ads. (arXiv:2302.01255v1 [cs.LG])
    In this article, we present our approach to personalizing Etsy Ads through encoding and learning from short-term (one-hour) sequences of user actions and diverse representations. To this end we introduce a three-component adSformer diversifiable personalization module (ADPM) and illustrate how we use this module to derive a short-term dynamic user representation and personalize the Click-Through Rate (CTR) and Post-Click Conversion Rate (PCCVR) models used in sponsored search (ad) ranking. The first component of the ADPM is a custom transformer encoder that learns the inherent structure from the sequence of actions. ADPM's second component enriches the signal through visual, multimodal and textual pretrained representations. Lastly, the third ADPM component includes a "learned" on the fly average pooled representation. The ADPM-personalized CTR and PCCVR models, henceforth referred to as adSformer CTR and adSformer PCCVR, outperform the CTR and PCCVR production baselines by $+6.65\%$ and $+12.70\%$, respectively, in offline Precision-Recall Area Under the Curve (PR AUC). At the time of this writing, following the online gains in A/B tests, such as $+5.34\%$ in return on ad spend, a seller success metric, we are ramping up the adSformers to $100\%$ traffic in Etsy Ads.
    ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs. (arXiv:2302.01275v1 [cs.LG])
    In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems.
    Physics Constrained Motion Prediction with Uncertainty Quantification. (arXiv:2302.01060v1 [cs.RO])
    Predicting the motion of dynamic agents is a critical task for guaranteeing the safety of autonomous systems. A particular challenge is that motion prediction algorithms should obey dynamics constraints and quantify prediction uncertainty as a measure of confidence. We present a physics-constrained approach for motion prediction which uses a surrogate dynamical model to ensure that predicted trajectories are dynamically feasible. We propose a two-step integration consisting of intent and trajectory prediction subject to dynamics constraints. We also construct prediction regions that quantify uncertainty and are tailored for autonomous driving by using conformal prediction, a popular statistical tool. Physics Constrained Motion Prediction achieves a 41% better ADE, 56% better FDE, and 19% better IoU over a baseline in experiments using an autonomous racing dataset.  ( 2 min )
    Randomized Greedy Learning for Non-monotone Stochastic Submodular Maximization Under Full-bandit Feedback. (arXiv:2302.01324v1 [cs.LG])
    We investigate the problem of unconstrained combinatorial multi-armed bandits with full-bandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$-regret upper bound of $\tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings.
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v1 [cs.LG])
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
    Bayesian Inference on Binary Spiking Networks Leveraging Nanoscale Device Stochasticity. (arXiv:2302.01302v1 [cs.NE])
    Bayesian Neural Networks (BNNs) can overcome the problem of overconfidence that plagues traditional frequentist deep neural networks, and are hence considered to be a key enabler for reliable AI systems. However, conventional hardware realizations of BNNs are resource intensive, requiring the implementation of random number generators for synaptic sampling. Owing to their inherent stochasticity during programming and read operations, nanoscale memristive devices can be directly leveraged for sampling, without the need for additional hardware resources. In this paper, we introduce a novel Phase Change Memory (PCM)-based hardware implementation for BNNs with binary synapses. The proposed architecture consists of separate weight and noise planes, in which PCM cells are configured and operated to represent the nominal values of weights and to generate the required noise for sampling, respectively. Using experimentally observed PCM noise characteristics, for the exemplary Breast Cancer Dataset classification problem, we obtain hardware accuracy and expected calibration error matching that of an 8-bit fixed-point (FxP8) implementation, with projected savings of over 9$\times$ in terms of core area transistor count.
    On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance. (arXiv:2302.01029v1 [cs.LG])
    A number of recent adaptive optimizers improve the generalisation performance of Adam by essentially reducing the variance of adaptive stepsizes to get closer to SGD with momentum. Following the above motivation, we suppress the range of the adaptive stepsizes of Adam by exploiting the layerwise gradient statistics. In particular, at each iteration, we propose to perform three consecutive operations on the second momentum v_t before using it to update a DNN model: (1): down-scaling, (2): epsilon-embedding, and (3): down-translating. The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations. The down-scaling operation on v_t is performed layerwise by making use of the angles between the layerwise subvectors of v_t and the corresponding all-one subvectors. Extensive experimental results show that SET-Adam outperforms eight adaptive optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the eight adaptive methods when training WGAN-GP models for image generation tasks. Furthermore, SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.  ( 2 min )
    Exposing the CSI: A Systematic Investigation of CSI-based Wi-Fi Sensing Capabilities and Limitations. (arXiv:2302.00992v1 [cs.NI])
    Thanks to the ubiquitous deployment of Wi-Fi hotspots, channel state information (CSI)-based Wi-Fi sensing can unleash game-changing applications in many fields, such as healthcare, security, and entertainment. However, despite one decade of active research on Wi-Fi sensing, most existing work only considers legacy IEEE 802.11n devices, often in particular and strictly-controlled environments. Worse yet, there is a fundamental lack of understanding of the impact on CSI-based sensing of modern Wi-Fi features, such as 160-MHz bandwidth, multiple-input multiple-output (MIMO) transmissions, and increased spectral resolution in IEEE 802.11ax (Wi-Fi 6). This work aims to shed light on the impact of Wi-Fi 6 features on the sensing performance and to create a benchmark for future research on Wi-Fi sensing. To this end, we perform an extensive CSI data collection campaign involving 3 individuals, 3 environments, and 12 activities, using Wi-Fi 6 signals. An anonymized ground truth obtained through video recording accompanies our 80-GB dataset, which contains almost two hours of CSI data from three collectors. We leverage our dataset to dissect the performance of a state-of-the-art sensing framework across different environments and individuals. Our key findings suggest that (i) MIMO transmissions and higher spectral resolution might be more beneficial than larger bandwidth for sensing applications; (ii) there is a pressing need to standardize research on Wi-Fi sensing because the path towards a truly environment-independent framework is still uncertain. To ease the experiments' replicability and address the current lack of Wi-Fi 6 CSI datasets, we release our 80-GB dataset to the community.  ( 2 min )
    Oracle-Preserving Latent Flows. (arXiv:2302.00806v1 [cs.LG])
    We develop a deep learning methodology for the simultaneous discovery of multiple nontrivial continuous symmetries across an entire labelled dataset. The symmetry transformations and the corresponding generators are modeled with fully connected neural networks trained with a specially constructed loss function ensuring the desired symmetry properties. The two new elements in this work are the use of a reduced-dimensionality latent space and the generalization to transformations invariant with respect to high-dimensional oracles. The method is demonstrated with several examples on the MNIST digit dataset.  ( 2 min )
    Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning. (arXiv:2302.00797v1 [cs.AI])
    Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. One approach, Policy-Space Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via meta-strategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new meta-strategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.
    Average-Constrained Policy Optimization. (arXiv:2302.00808v1 [cs.LG])
    Reinforcement Learning (RL) with constraints is becoming an increasingly important problem for various applications. Often, the average criterion is more suitable. Yet, RL for average criterion-constrained MDPs remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new (possibly the first) policy optimization algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by the famed PPO-type algorithms based on trust region methods. We develop basic sensitivity theory for average MDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging MuJoCo environments, show the superior performance of the algorithm when compared to other state-of-the-art algorithms adapted for the average CMDP setting.
    RobustNeRF: Ignoring Distractors with Robust Losses. (arXiv:2302.00833v1 [cs.CV])
    Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling distractors in training data as outliers of an optimization problem. Our method successfully removes outliers from a scene and improves upon our baselines, on synthetic and real-world scenes. Our technique is simple to incorporate in modern NeRF frameworks, with few hyper-parameters. It does not assume a priori knowledge of the types of distractors, and is instead focused on the optimization problem rather than pre-processing or modeling transient objects. More results on our page https://robustnerf.github.io/public.
    Scale up with Order: Finding Good Data Permutations for Distributed Training. (arXiv:2302.00845v1 [cs.LG])
    Gradient Balancing (GraB) is a recently proposed technique that finds provably better data permutations when training models with multiple epochs over a finite dataset. It converges at a faster rate than the widely adopted Random Reshuffling, by minimizing the discrepancy of the gradients on adjacently selected examples. However, GraB only operates under critical assumptions such as small batch sizes and centralized data, leaving open the question of how to order examples at large scale -- i.e. distributed learning with decentralized data. To alleviate the limitation, in this paper we propose D-GraB that involves two novel designs: (1) $\textsf{PairBalance}$ that eliminates the requirement to use stale gradient mean in GraB which critically relies on small learning rates; (2) an ordering protocol that runs $\textsf{PairBalance}$ in a distributed environment with negligible overhead, which benefits from both data ordering and parallelism. We prove D-GraB enjoys linear speed up at rate $\tilde{O}((mnT)^{-2/3})$ on smooth non-convex objectives and $\tilde{O}((mnT)^{-2})$ under PL condition, where $n$ denotes the number of parallel workers, $m$ denotes the number of examples per worker and $T$ denotes the number of epochs. Empirically, we show on various applications including GLUE, CIFAR10 and WikiText-2 that D-GraB outperforms naive parallel GraB and Distributed Random Reshuffling in terms of both training and validation performance.
    Quantum Graph Learning: Frontiers and Outlook. (arXiv:2302.00892v1 [cs.LG])
    Quantum theory has shown its superiority in enhancing machine learning. However, facilitating quantum theory to enhance graph learning is in its infancy. This survey investigates the current advances in quantum graph learning (QGL) from three perspectives, i.e., underlying theories, methods, and prospects. We first look at QGL and discuss the mutualism of quantum theory and graph learning, the specificity of graph-structured data, and the bottleneck of graph learning, respectively. A new taxonomy of QGL is presented, i.e., quantum computing on graphs, quantum graph representation, and quantum circuits for graph neural networks. Pitfall traps are then highlighted and explained. This survey aims to provide a brief but insightful introduction to this emerging field, along with a detailed discussion of frontiers and outlook yet to be investigated.
    Teaching MLOps in Higher Education through Project-Based Learning. (arXiv:2302.01048v1 [cs.SE])
    Building and maintaining production-grade ML-enabled components is a complex endeavor that goes beyond the current approach of academic education, focused on the optimization of ML model performance in the lab. In this paper, we present a project-based learning approach to teaching MLOps, focused on the demonstration and experience with emerging practices and tools to automatize the construction of ML-enabled components. We examine the design of a course based on this approach, including laboratory sessions that cover the end-to-end ML component life cycle, from model building to production deployment. Moreover, we report on preliminary results from the first edition of the course. During the present year, an updated version of the same course is being delivered in two independent universities; the related learning outcomes will be evaluated to analyze the effectiveness of project-based learning for this specific subject.
    SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling. (arXiv:2302.00861v1 [cs.LG])
    Time series analysis is widely used in extensive areas. Recently, to reduce labeling expenses and benefit various tasks, self-supervised pre-training has attracted immense interest. One mainstream paradigm is masked modeling, which successfully pre-trains deep models by learning to reconstruct the masked content based on the unmasked part. However, since the semantic information of time series is mainly contained in temporal variations, the standard way of randomly masking a portion of time points will ruin vital temporal variations of time series seriously, making the reconstruction task too difficult to guide representation learning. We thus present SimMTM, a Simple pre-training framework for Masked Time-series Modeling. By relating masked modeling to manifold learning, SimMTM proposes to recover masked time points by the weighted aggregation of multiple neighbors outside the manifold, which eases the reconstruction task by assembling ruined but complementary temporal variations from multiple masked series. SimMTM further learns to uncover the local structure of the manifold helpful for masked modeling. Experimentally, SimMTM achieves state-of-the-art fine-tuning performance in two canonical time series analysis tasks: forecasting and classification, covering both in- and cross-domain settings.
    Resilient Binary Neural Network. (arXiv:2302.00956v1 [cs.LG])
    Binary neural networks (BNNs) have received ever-increasing popularity for their great capability of reducing storage burden as well as quickening inference time. However, there is a severe performance drop compared with {real-valued} networks, due to its intrinsic frequent weight oscillation during training. In this paper, we introduce a Resilient Binary Neural Network (ReBNN) to mitigate the frequent oscillation for better BNNs' training. We identify that the weight oscillation mainly stems from the non-parametric scaling factor. To address this issue, we propose to parameterize the scaling factor and introduce a weighted reconstruction loss to build an adaptive training objective. %To the best of our knowledge, it is the first work to solve BNNs based on a dynamically re-weighted loss function. For the first time, we show that the weight oscillation is controlled by the balanced parameter attached to the reconstruction loss, which provides a theoretical foundation to parameterize it in back propagation. Based on this, we learn our ReBNN by {calculating} the {balanced} parameter {based on} its maximum magnitude, which can effectively mitigate the weight oscillation with a resilient training process. Extensive experiments are conducted upon various network models, such as ResNet and Faster-RCNN for computer vision, as well as BERT for natural language processing. The results demonstrate the overwhelming performance of our ReBNN over prior arts. For example, our ReBNN achieves 66.9\% Top-1 accuracy with ResNet-18 backbone on the ImageNet dataset, surpassing existing state-of-the-arts by a significant margin. Our code is open-sourced at https://github.com/SteveTsui/ReBNN.
    Algorithm Design for Online Meta-Learning with Task Boundary Detection. (arXiv:2302.00857v1 [cs.LG])
    Online meta-learning has recently emerged as a marriage between batch meta-learning and online learning, for achieving the capability of quick adaptation on new tasks in a lifelong manner. However, most existing approaches focus on the restrictive setting where the distribution of the online tasks remains fixed with known task boundaries. In this work, we relax these assumptions and propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. More specifically, we first propose two simple but effective detection mechanisms of task switches and distribution shift based on empirical observations, which serve as a key building block for more elegant online model updates in our algorithm: the task switch detection mechanism allows reusing of the best model available for the current task at hand, and the distribution shift detection mechanism differentiates the meta model update in order to preserve the knowledge for in-distribution tasks and quickly learn the new knowledge for out-of-distribution tasks. In particular, our online meta model updates are based only on the current data, which eliminates the need of storing previous data as required in most existing methods. We further show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions. Empirical studies on three different benchmarks clearly demonstrate the significant advantage of our algorithm over related baseline approaches.
    Unpaired Multi-Domain Causal Representation Learning. (arXiv:2302.00993v1 [stat.ML])
    The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our identifiability results into a practical method to recover the shared latent causal graph. Moreover, we study how multiple domains reduce errors in falsely detecting shared causal variables in the finite data setting.
    Deep-Learning Tool for Early Identifying Non-Traumatic Intracranial Hemorrhage Etiology based on CT Scan. (arXiv:2302.00953v1 [eess.IV])
    Background: To develop an artificial intelligence system that can accurately identify acute non-traumatic intracranial hemorrhage (ICH) etiology based on non-contrast CT (NCCT) scans and investigate whether clinicians can benefit from it in a diagnostic setting. Materials and Methods: The deep learning model was developed with 1868 eligible NCCT scans with non-traumatic ICH collected between January 2011 and April 2018. We tested the model on two independent datasets (TT200 and SD 98) collected after April 2018. The model's diagnostic performance was compared with clinicians's performance. We further designed a simulated study to compare the clinicians's performance with and without the deep learning system augmentation. Results: The proposed deep learning system achieved area under the receiver operating curve of 0.986 (95% CI 0.967-1.000) on aneurysms, 0.952 (0.917-0.987) on hypertensive hemorrhage, 0.950 (0.860-1.000) on arteriovenous malformation (AVM), 0.749 (0.586-0.912) on Moyamoya disease (MMD), 0.837 (0.704-0.969) on cavernous malformation (CM), and 0.839 (0.722-0.959) on other causes in TT200 dataset. Given a 90% specificity level, the sensitivities of our model were 97.1% and 90.9% for aneurysm and AVM diagnosis, respectively. The model also shows an impressive generalizability in an independent dataset SD98. The clinicians achieve significant improvements in the sensitivity, specificity, and accuracy of diagnoses of certain hemorrhage etiologies with proposed system augmentation. Conclusions: The proposed deep learning algorithms can be an effective tool for early identification of hemorrhage etiologies based on NCCT scans. It may also provide more information for clinicians for triage and further imaging examination selection.
    Versatile Energy-Based Models for High Energy Physics. (arXiv:2302.00695v1 [cs.LG])
    Energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In accordance with these signs of progress, we build a versatile energy-based model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator, a generic anomalous signal detector, and an augmented event classifier.
  • Open

    Algorithm Design for Online Meta-Learning with Task Boundary Detection. (arXiv:2302.00857v1 [cs.LG])
    Online meta-learning has recently emerged as a marriage between batch meta-learning and online learning, for achieving the capability of quick adaptation on new tasks in a lifelong manner. However, most existing approaches focus on the restrictive setting where the distribution of the online tasks remains fixed with known task boundaries. In this work, we relax these assumptions and propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. More specifically, we first propose two simple but effective detection mechanisms of task switches and distribution shift based on empirical observations, which serve as a key building block for more elegant online model updates in our algorithm: the task switch detection mechanism allows reusing of the best model available for the current task at hand, and the distribution shift detection mechanism differentiates the meta model update in order to preserve the knowledge for in-distribution tasks and quickly learn the new knowledge for out-of-distribution tasks. In particular, our online meta model updates are based only on the current data, which eliminates the need of storing previous data as required in most existing methods. We further show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions. Empirical studies on three different benchmarks clearly demonstrate the significant advantage of our algorithm over related baseline approaches.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v5 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    What Language Reveals about Perception: Distilling Psychophysical Knowledge from Large Language Models. (arXiv:2302.01308v1 [cs.CL])
    Understanding the extent to which the perceptual world can be recovered from language is a fundamental problem in cognitive science. We reformulate this problem as that of distilling psychophysical information from text and show how this can be done by combining large language models (LLMs) with a classic psychophysical method based on similarity judgments. Specifically, we use the prompt auto-completion functionality of GPT3, a state-of-the-art LLM, to elicit similarity scores between stimuli and then apply multidimensional scaling to uncover their underlying psychological space. We test our approach on six perceptual domains and show that the elicited judgments strongly correlate with human data and successfully recover well-known psychophysical structures such as the color wheel and pitch spiral. We also explore meaningful divergences between LLM and human representations. Our work showcases how combining state-of-the-art machine models with well-known cognitive paradigms can shed new light on fundamental questions in perception and language research.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v1 [cs.LG] CROSS LISTED)
    Motivated by time series forecasting, we study Online Linear Optimization (OLO) under the coupling of two problem structures: the domain is unbounded, and the performance of an algorithm is measured by its dynamic regret. Handling either of them requires the regret bound to depend on certain complexity measure of the comparator sequence -- specifically, the comparator norm in unconstrained OLO, and the path length in dynamic regret. In contrast to a recent work (Jacobsen & Cutkosky, 2022) that adapts to the combination of these two complexity measures, we propose an alternative complexity measure by recasting the problem into sparse coding. Adaptivity can be achieved by a simple modular framework, which naturally exploits more intricate prior knowledge of the environment. Along the way, we also present a new gradient adaptive algorithm for static unconstrained OLO, designed using novel continuous time machinery. This could be of independent interest.
    A Machine Learning Approach to Measuring Climate Adaptation. (arXiv:2302.01236v1 [stat.AP])
    I measure adaptation to climate change by comparing elasticities from short-run and long-run changes in damaging weather. I propose a debiased machine learning approach to flexibly measure these elasticities in panel settings. In a simulation exercise, I show that debiased machine learning has considerable benefits relative to standard machine learning or ordinary least squares, particularly in high-dimensional settings. I then measure adaptation to damaging heat exposure in United States corn and soy production. Using rich sets of temperature and precipitation variation, I find evidence that short-run impacts from damaging heat are significantly offset in the long run. I show that this is because the impacts of long-run changes in heat exposure do not follow the same functional form as short-run shocks to heat exposure.
    Interventional and Counterfactual Inference with Diffusion Models. (arXiv:2302.00860v1 [stat.ML])
    We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings to allow for direct sampling under interventions as well as abduction for counterfactuals. We utilize DCM to model structural equations, seeing that diffusion models serve as a natural candidate here since they encode each node to a latent representation, a proxy for the exogenous noise, and offer flexible and accurate modeling to provide reliable causal statements and estimates. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Our theoretical results provide a methodology for analyzing the counterfactual error for general encoder/decoder models which could be of independent interest.
    Correlated Initialization for Correlated Data. (arXiv:2003.04422v2 [cs.LG] UPDATED)
    Spatial data exhibits the property that nearby points are correlated. This also holds for learnt representations across layers, but not for commonly used weight initialization methods. Our theoretical analysis quantifies the learning behavior of weights of a single spatial filter. It is thus in contrast to a large body of work that discusses statistical properties of weights. It shows that uncorrelated initialization (i) might lead to poor convergence behavior and (ii) training of (some) parameters is likely subject to slow convergence. Empirical analysis shows that these findings for a single spatial filter extend to networks with many spatial filters. The impact of (correlated) initialization depends strongly on learning rates and l2-regularization.
    Analysis of Knowledge Transfer in Kernel Regime. (arXiv:2003.13438v3 [cs.LG] UPDATED)
    Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models. (arXiv:2205.07999v2 [stat.ML] UPDATED)
    Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
    Efficient Privacy-Preserving Stochastic Nonconvex Optimization. (arXiv:1910.13659v3 [cs.LG] UPDATED)
    While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while improves the best previous utility guarantee given by Wang et al. (NeurIPS 2017). Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.
    Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery. (arXiv:2302.00755v1 [stat.ML])
    In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient prediction and uncertainty quantification. Standard GPs, however, do not capture structured sparsity on the underlying response surface, which is present in many applications, particularly in the physical sciences. We thus propose a new hierarchical shrinkage GP (HierGP), which incorporates such structure via cumulative shrinkage priors within a GP framework. We show that the HierGP implicitly embeds the well-known principles of effect sparsity, heredity and hierarchy for analysis of experiments, which allows our model to identify structured sparse features from the response surface with limited data. We propose efficient posterior sampling algorithms for model training and prediction, and prove desirable consistency properties for the HierGP. Finally, we demonstrate the improved performance of HierGP over existing models, in a suite of numerical experiments and an application to dynamical system recovery.
    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v1 [stat.ML])
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on downstream datasets that are (i) relatively large, and (ii) similar in distribution to the pretraining data. However, in many applications including personalization, it is crucial to perform well in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and on images from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases by as much as 32$\times$ for CIFAR-100 at $\epsilon=1$. We also find that few-shot non-private models are highly susceptible to membership inference attacks. DP provides clear mitigation against the attacks, but a small $\epsilon$ is required to effectively prevent them. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR federated learning benchmark.
    Learning polytopes with fixed facet directions. (arXiv:2201.03419v4 [math.MG] UPDATED)
    We consider the task of reconstructing polytopes with fixed facet directions from finitely many support function evaluations. We show that for a fixed simplicial normal fan the least-squares estimate is given by a convex quadratic program. We study the geometry of the solution set and give a combinatorial characterization for the uniqueness of the reconstruction in this case. We provide an algorithm that, under mild assumptions, converges to the unknown input shape as the number of noisy support function evaluations increases. We also discuss limitations of our results if the restriction on the normal fan is removed.
    Stochastic Contextual Bandits with Long Horizon Rewards. (arXiv:2302.00814v1 [cs.LG])
    The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
    The Value of Out-of-Distribution Data. (arXiv:2208.10967v3 [cs.LG] UPDATED)
    We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.
    Sample Complexity of Kernel-Based Q-Learning. (arXiv:2302.00727v1 [cs.LG])
    Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
    Unsupervised Learning of Sampling Distributions for Particle Filters. (arXiv:2302.01174v1 [eess.SP])
    Accurate estimation of the states of a nonlinear dynamical system is crucial for their design, synthesis, and analysis. Particle filters are estimators constructed by simulating trajectories from a sampling distribution and averaging them based on their importance weight. For particle filters to be computationally tractable, it must be feasible to simulate the trajectories by drawing from the sampling distribution. Simultaneously, these trajectories need to reflect the reality of the nonlinear dynamical system so that the resulting estimators are accurate. Thus, the crux of particle filters lies in designing sampling distributions that are both easy to sample from and lead to accurate estimators. In this work, we propose to learn the sampling distributions. We put forward four methods for learning sampling distributions from observed measurements. Three of the methods are parametric methods in which we learn the mean and covariance matrix of a multivariate Gaussian distribution; each methods exploits a different aspect of the data (generic, time structure, graph structure). The fourth method is a nonparametric alternative in which we directly learn a transform of a uniform random variable. All four methods are trained in an unsupervised manner by maximizing the likelihood that the states may have produced the observed measurements. Our computational experiments demonstrate that learned sampling distributions exhibit better performance than designed, minimum-degeneracy sampling distributions.
    Avoiding Model Estimation in Robust Markov Decision Processes with a Generative Model. (arXiv:2302.01248v1 [stat.ML])
    Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.  ( 2 min )
    Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control. (arXiv:2302.00839v1 [cs.LG])
    Many real-world multi-label prediction problems involve set-valued predictions that must satisfy specific requirements dictated by downstream usage. We focus on a typical scenario where such requirements, separately encoding \textit{value} and \textit{cost}, compete with each other. For instance, a hospital might expect a smart diagnosis system to capture as many severe, often co-morbid, diseases as possible (the value), while maintaining strict control over incorrect predictions (the cost). We present a general pipeline, dubbed as FavMac, to maximize the value while controlling the cost in such scenarios. FavMac can be combined with almost any multi-label classifier, affording distribution-free theoretical guarantees on cost control. Moreover, unlike prior works, FavMac can handle real-world large-scale applications via a carefully designed online update mechanism, which is of independent interest. Our methodological and theoretical contributions are supported by experiments on several healthcare tasks and synthetic datasets - FavMac furnishes higher value compared with several variants and baselines while maintaining strict cost control.  ( 2 min )
    Causal Lifting and Link Prediction. (arXiv:2302.01198v1 [cs.LG])
    Current state-of-the-art causal models for link prediction assume an underlying set of inherent node factors -- an innate characteristic defined at the node's birth -- that governs the causal evolution of links in the graph. In some causal tasks, however, link formation is path-dependent, i.e., the outcome of link interventions depends on existing links. For instance, in the customer-product graph of an online retailer, the effect of an 85-inch TV ad (treatment) likely depends on whether the costumer already has an 85-inch TV. Unfortunately, existing causal methods are impractical in these scenarios. The cascading functional dependencies between links (due to path dependence) are either unidentifiable or require an impractical number of control variables. In order to remedy this shortcoming, this work develops the first causal model capable of dealing with path dependencies in link prediction. It introduces the concept of causal lifting, an invariance in causal models that, when satisfied, allows the identification of causal link prediction queries using limited interventional data. On the estimation side, we show how structural pairwise embeddings -- a type of symmetry-based joint representation of node pairs in a graph -- exhibit lower bias and correctly represent the causal structure of the task, as opposed to existing node embedding methods, e.g., GNNs and matrix factorization. Finally, we validate our theoretical findings on four datasets under three different scenarios for causal link prediction tasks: knowledge base completion, covariance matrix estimation and consumer-product recommendations.  ( 2 min )
    Neural Estimation of the Rate-Distortion Function With Applications to Operational Source Coding. (arXiv:2204.01612v2 [cs.IT] UPDATED)
    A fundamental question in designing lossy data compression schemes is how well one can do in comparison with the rate-distortion function, which describes the known theoretical limits of lossy compression. Motivated by the empirical success of deep neural network (DNN) compressors on large, real-world data, we investigate methods to estimate the rate-distortion function on such data, which would allow comparison of DNN compressors with optimality. While one could use the empirical distribution of the data and apply the Blahut-Arimoto algorithm, this approach presents several computational challenges and inaccuracies when the datasets are large and high-dimensional, such as the case of modern image datasets. Instead, we re-formulate the rate-distortion objective, and solve the resulting functional optimization problem using neural networks. We apply the resulting rate-distortion estimator, called NERD, on popular image datasets, and provide evidence that NERD can accurately estimate the rate-distortion function. Using our estimate, we show that the rate-distortion achievable by DNN compressors are within several bits of the rate-distortion function for real-world datasets. Additionally, NERD provides access to the rate-distortion achieving channel, as well as samples from its output marginal. Therefore, using recent results in reverse channel coding, we describe how NERD can be used to construct an operational one-shot lossy compression scheme with guarantees on the achievable rate and distortion. Experimental results demonstrate competitive performance with DNN compressors.  ( 2 min )
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v2 [cs.LG] UPDATED)
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.  ( 2 min )
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v2 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.  ( 2 min )
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data. (arXiv:2302.00834v1 [cs.LG])
    We study the interpolation, or memorization, power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints.  ( 2 min )
    Safe Optimization of an Industrial Refrigeration Process Using an Adaptive and Explorative Framework. (arXiv:2211.13019v2 [math.OC] UPDATED)
    Many industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown process characteristics, real-time optimization becomes challenging, particularly for the satisfaction of safety constraints. In this paper, we demonstrate the application of an adaptive and explorative real-time optimization framework to an industrial refrigeration process, where we learn the process characteristics through changes in process control targets and through exploration to satisfy safety constraints. We quantify the uncertainty in unknown compressor characteristics of the refrigeration plant by using Gaussian processes and incorporate this uncertainty into the objective function of the real-time optimization problem as a weighted cost term. We adaptively control the weight of this term to drive exploration. The results of our simulation experiments indicate the proposed approach can help to increase the energy efficiency of the considered refrigeration process, closely approximating the performance of a solution that has complete information about the compressor performance characteristics.  ( 2 min )
    FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features. (arXiv:2302.00787v1 [cs.LG])
    The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
    Oracle-Preserving Latent Flows. (arXiv:2302.00806v1 [cs.LG])
    We develop a deep learning methodology for the simultaneous discovery of multiple nontrivial continuous symmetries across an entire labelled dataset. The symmetry transformations and the corresponding generators are modeled with fully connected neural networks trained with a specially constructed loss function ensuring the desired symmetry properties. The two new elements in this work are the use of a reduced-dimensionality latent space and the generalization to transformations invariant with respect to high-dimensional oracles. The method is demonstrated with several examples on the MNIST digit dataset.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v2 [cs.LG] UPDATED)
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.  ( 2 min )
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v1 [stat.ML])
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where the equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. Our method constitutes an important step towards developing general, transferable algorithms for accelerating MD.  ( 2 min )
    Sketched Ridgeless Linear Regression: The Role of Downsampling. (arXiv:2302.01088v1 [math.ST])
    Overparametrization often helps improve the generalization performance. This paper proposes a dual view of overparametrization suggesting that downsampling may also help generalize. Motivated by this dual view, we characterize two out-of-sample prediction risks of the sketched ridgeless least square estimator in the proportional regime $m\asymp n \asymp p$, where $m$ is the sketching size, $n$ the sample size, and $p$ the feature dimensionality. Our results reveal the statistical role of downsampling. Specifically, downsampling does not always hurt the generalization performance, and may actually help improve it in some cases. We identify the optimal sketching sizes that minimize the out-of-sample prediction risks, and find that the optimally sketched estimator has stabler risk curves that eliminates the peaks of those for the full-sample estimator. We then propose a practical procedure to empirically identify the optimal sketching size. Finally, we extend our results to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.  ( 2 min )
    MonoFlow: Rethinking Divergence GANs via the Perspective of Differential Equations. (arXiv:2302.01075v1 [stat.ML])
    The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing mapping of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework.  ( 2 min )
    The Contextual Lasso: Sparse Linear Models via Deep Neural Networks. (arXiv:2302.00878v1 [stat.ML])
    Sparse linear models are a gold standard tool for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which we wish to explain the model's predictions, and contextual features, which we wish to determine the model's explanations. This dichotomy leads us to propose the contextual lasso, a new statistical estimator that fits a sparse linear model whose sparsity pattern and coefficients can vary with the contextual features. The fitting process involves learning a nonparametric map, realized via a deep neural network, from contextual feature vector to sparse coefficient vector. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. Extensive experiments on real and synthetic data suggest that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.  ( 2 min )
    High-dimensional variable clustering based on sub-asymptotic maxima of a weakly dependent random process. (arXiv:2302.00934v1 [math.ST])
    We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models, which defines population-level clusters based on the independence of the maxima of a multivariate stationary mixing random process among clusters. This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference. We also present an algorithm for recovering the clusters of variables without specifying the number of clusters \emph{a priori}. Our work provides some theoritical insights into the consistency of our algorithm, demonstrating that under certain conditions it can effectively identify clusters in the data with a computational complexity that is polynomial in the dimension. This implies that groups can be learned nonparametrically in which block maxima of a dependent process are only sub-asymptotic.  ( 2 min )
    Epistemic Neural Networks. (arXiv:2107.08924v7 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. The epinet does not fit the traditional framework of Bayesian neural networks. To accommodate development of approaches beyond BNNs, such as the epinet, we introduce the epistemic neural network (ENN) as an interface for models that produce joint predictions.  ( 2 min )
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v2 [cs.LG] UPDATED)
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These changes result in distribution shifts across environments. Attributing model performance changes to specific shifts is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.  ( 2 min )
    Rare Feature Selection in High Dimensions. (arXiv:1803.06675v3 [stat.ME] UPDATED)
    It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.  ( 2 min )
    Do Kernel and Neural Embeddings Help in Training and Generalization?. (arXiv:1905.05095v3 [cs.LG] UPDATED)
    Recent results on optimization and generalization properties of neural networks showed that in a simple two-layer network, the alignment of the labels to the eigenvectors of the corresponding Gram matrix determines the convergence of the optimization during training. Such analyses also provide upper bounds on the generalization error. We experimentally investigate the implications of these results to deeper networks via embeddings. We regard the layers preceding the final hidden layer as producing different representations of the input data which are then fed to the two-layer model. We show that these representations improve both optimization and generalization. In particular, we investigate three kernel representations when fed to the final hidden layer: the Gaussian kernel and its approximation by random Fourier features, kernels designed to imitate representations produced by neural networks and finally an optimal kernel designed to align the data with target labels. The approximated representations induced by these kernels are fed to the neural network and the optimization and generalization properties of the final model are evaluated and compared.  ( 2 min )
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v3 [stat.ML] UPDATED)
    This paper presents new machine learning approaches to approximate the solutions of optimal stopping problems. The key idea of these methods is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees are provided. Our randomized reinforcement learning approach and randomized recurrent neural network approach outperform the state-of-the-art and other relevant machine learning approaches in Markovian and non-Markovian examples, respectively. In particular, we test our approaches on Black-Scholes, Heston, rough Heston and fractional Brownian motion. Moreover, we show that they can also be used to efficiently compute Greeks of American options.  ( 2 min )
    A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models. (arXiv:2302.01217v1 [stat.ML])
    We provide a theoretical justification for sample recovery using diffusion based image inpainting in a linear model setting. While most inpainting algorithms require retraining with each new mask, we prove that diffusion based inpainting generalizes well to unseen masks without retraining. We analyze a recently proposed popular diffusion based inpainting algorithm called RePaint (Lugmayr et al., 2022), and show that it has a bias due to misalignment that hampers sample recovery even in a two-state diffusion process. Motivated by our analysis, we propose a modified RePaint algorithm we call RePaint$^+$ that provably recovers the underlying true sample and enjoys a linear rate of convergence. It achieves this by rectifying the misalignment error present in drift and dispersion of the reverse process. To the best of our knowledge, this is the first linear convergence result for a diffusion based image inpainting algorithm.  ( 2 min )
    Bayesian Optimization of Multiple Objectives with Different Latencies. (arXiv:2302.01310v1 [stat.ML])
    Multi-objective Bayesian optimization aims to find the Pareto front of optimal trade-offs between a set of expensive objectives while collecting as few samples as possible. In some cases, it is possible to evaluate the objectives separately, and a different latency or evaluation cost can be associated with each objective. This presents an opportunity to learn the Pareto front faster by evaluating the cheaper objectives more frequently. We propose a scalarization based knowledge gradient acquisition function which accounts for the different evaluation costs of the objectives. We prove consistency of the algorithm and show empirically that it significantly outperforms a benchmark algorithm which always evaluates both objectives.  ( 2 min )
    Causal Effect Estimation: Recent Advances, Challenges, and Opportunities. (arXiv:2302.00848v1 [cs.LG])
    Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., propensity score, matching, and reweighing) and advanced machine learning approaches (e.g., representation learning, adversarial learning, and graph neural networks). Although the advanced machine learning approaches have shown extraordinary performance in treatment effect estimation, it also comes with a lot of new topics and new research questions. In view of the latest research efforts in the causal inference field, we provide a comprehensive discussion of challenges and opportunities for the three core components of the treatment effect estimation task, i.e., treatment, covariates, and outcome. In addition, we showcase the promising research directions of this topic from multiple perspectives.  ( 2 min )
    Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning. (arXiv:2302.01002v1 [stat.ML])
    We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning.  ( 2 min )
    Robust Estimation under the Wasserstein Distance. (arXiv:2302.01237v1 [stat.ML])
    We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.  ( 2 min )
    Randomized prior wavelet neural operator for uncertainty quantification. (arXiv:2302.01051v1 [stat.ML])
    In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with inherent uncertainty quantification module and hence, is expected to be extremely useful for scientists and engineers alike. RP-WNO utilizes randomized prior networks, which can account for prior information and is easier to implement for large, complex deep-learning architectures than its Bayesian counterpart. Four examples have been solved to test the proposed framework, and the results produced advocate favorably for the efficacy of the proposed framework.  ( 2 min )
    Unpaired Multi-Domain Causal Representation Learning. (arXiv:2302.00993v1 [stat.ML])
    The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our identifiability results into a practical method to recover the shared latent causal graph. Moreover, we study how multiple domains reduce errors in falsely detecting shared causal variables in the finite data setting.  ( 2 min )
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v1 [cs.LG])
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.  ( 2 min )
    Robust multi-item auction design using statistical learning: Overcoming uncertainty in bidders' types distributions. (arXiv:2302.00941v1 [cs.GT])
    This paper presents a novel mechanism design for multi-item auction settings with uncertain bidders' type distributions. Our proposed approach utilizes nonparametric density estimation to accurately estimate bidders' types from historical bids, and is built upon the Vickrey-Clarke-Groves (VCG) mechanism, ensuring satisfaction of Bayesian incentive compatibility (BIC) and $\delta$-individual rationality (IR). To further enhance the efficiency of our mechanism, we introduce two novel strategies for query reduction: a filtering method that screens potential winners' value regions within the confidence intervals generated by our estimated distribution, and a classification strategy that designates the lower bound of an interval as the estimated type when the length is below a threshold value. Simulation experiments conducted on both small-scale and large-scale data demonstrate that our mechanism consistently outperforms existing methods in terms of revenue maximization and query reduction, particularly in large-scale scenarios. This makes our proposed mechanism a highly desirable and effective option for sellers in the realm of multi-item auctions.  ( 2 min )
    Lower Bounds for Learning in Revealing POMDPs. (arXiv:2302.01333v1 [cs.LG])
    This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.  ( 2 min )
    Stochastic optimal transport in Banach Spaces for regularized estimation of multivariate quantiles. (arXiv:2302.00982v1 [math.PR])
    We introduce a new stochastic algorithm for solving entropic optimal transport (EOT) between two absolutely continuous probability measures $\mu$ and $\nu$. Our work is motivated by the specific setting of Monge-Kantorovich quantiles where the source measure $\mu$ is either the uniform distribution on the unit hypercube or the spherical uniform distribution. Using the knowledge of the source measure, we propose to parametrize a Kantorovich dual potential by its Fourier coefficients. In this way, each iteration of our stochastic algorithm reduces to two Fourier transforms that enables us to make use of the Fast Fourier Transform (FFT) in order to implement a fast numerical method to solve EOT. We study the almost sure convergence of our stochastic algorithm that takes its values in an infinite-dimensional Banach space. Then, using numerical experiments, we illustrate the performances of our approach on the computation of regularized Monge-Kantorovich quantiles. In particular, we investigate the potential benefits of entropic regularization for the smooth estimation of multivariate quantiles using data sampled from the target measure $\nu$.  ( 2 min )
    Brazilian tailing dam collapse, retrospective precursory monitoring of InSAR data using spectral analysis of time series. (arXiv:2302.00781v1 [stat.ME])
    Slope failures possess destructive power that can cause significant damage to both life and infrastructure. Monitoring slopes prone to instabilities is therefore critical in mitigating the risk posed by their failure. The purpose of slope monitoring is to detect precursory signs of stability issues, such as changes in the rate of displacement with which a slope is deforming. This information can then be used to predict the timing or probability of an imminent failure in order to provide an early warning. In this study, a more objective, statistical-learning algorithm is proposed to detect and characterise the risk of a slope failure, based on spectral analysis of serially correlated displacement time series data. The algorithm is applied to satellite-based interferometric synthetic radar (InSAR) displacement time series data to retrospectively analyse the risk of the 2019 Brumadinho tailings dam collapse in Brazil. Two potential risk milestones are identified and signs of a definitive but emergent risk (27 February 2018 to 26 August 2018) and imminent risk of collapse of the tailings dam (27 June 2018 to 24 December 2018) are detected by the algorithm. Importantly, this precursory indication of risk of failure is detected as early as at least five months prior to the dam collapse on 25 January 2019. The results of this study demonstrate that the combination of spectral methods and second order statistical properties of InSAR displacement time series data can reveal signs of a transition into an unstable deformation regime, and that this algorithm can provide sufficient early warning that could help mitigate catastrophic slope failures.  ( 2 min )
    A Light-weight CNN Model for Efficient Parkinson's Disease Diagnostics. (arXiv:2302.00973v1 [stat.ML])
    In recent years, deep learning methods have achieved great success in various fields due to their strong performance in practical applications. In this paper, we present a light-weight neural network for Parkinson's disease diagnostics, in which a series of hand-drawn data are collected to distinguish Parkinson's disease patients from healthy control subjects. The proposed model consists of a convolution neural network (CNN) cascading to long-short-term memory (LSTM) to adapt the characteristics of collected time-series signals. To make full use of their advantages, a multilayered LSTM model is firstly used to enrich features which are then concatenated with raw data and fed into a shallow one-dimensional (1D) CNN model for efficient classification. Experimental results show that the proposed model achieves a high-quality diagnostic result over multiple evaluation metrics with much fewer parameters and operations, outperforming conventional methods such as support vector machine (SVM), random forest (RF), lightgbm (LGB) and CNN-based methods.  ( 2 min )
    High-precision regressors for particle physics. (arXiv:2302.00753v1 [physics.comp-ph])
    Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of $10^9$ - $10^{12}$, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than $1\%$ or even $0.1\%$ over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of $10^3$ - $10^6$ over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.  ( 2 min )
    Conditional expectation for missing data imputation. (arXiv:2302.00911v1 [stat.ML])
    Missing data is common in datasets retrieved in various areas, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the logic behind the imputation is explainable, which is especially difficult for complex methods that are for example, based on deep learning. This motivates us to introduce a conditional Distribution based Imputation of Missing Values (DIMV) algorithm. This approach works based on finding the conditional distribution of a feature with missing entries based on the fully observed features. As will be illustrated in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods under comparison; (ii) is explainable; (iii) can provide an approximated confidence region for the missing values in a given sample; (iv) works for both small and large scale data; (v) in many scenarios, does not require a huge number of parameters as deep learning approaches and therefore can be used for mobile devices or web browsers; and (vi) is robust to the normally distributed assumption that its theoretical grounds rely on. In addition to DIMV, we also introduce the DPER* algorithm improving the speed of DPER for estimating the mean and covariance matrix from the data, and we confirm the speed-up via experiments.  ( 2 min )
    Pathologies of Predictive Diversity in Deep Ensembles. (arXiv:2302.00704v1 [cs.LG])
    Classical results establish that ensembles of small models benefit when predictive diversity is encouraged, through bagging, boosting, and similar. Here we demonstrate that this intuition does not carry over to ensembles of deep neural networks used for classification, and in fact the opposite can be true. Unlike regression models or small (unconfident) classifiers, predictions from large (confident) neural networks concentrate in vertices of the probability simplex. Thus, decorrelating these points necessarily moves the ensemble prediction away from vertices, harming confidence and moving points across decision boundaries. Through large scale experiments, we demonstrate that diversity-encouraging regularizers hurt the performance of high-capacity deep ensembles used for classification. Even more surprisingly, discouraging predictive diversity can be beneficial. Together this work strongly suggests that the best strategy for deep ensembles is utilizing more accurate, but likely less diverse, component models.  ( 2 min )
    Versatile Energy-Based Models for High Energy Physics. (arXiv:2302.00695v1 [cs.LG])
    Energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In accordance with these signs of progress, we build a versatile energy-based model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator, a generic anomalous signal detector, and an augmented event classifier.  ( 2 min )

  • Open

    Is the purpose of gamma in Q-learning just to help the q-values converge?
    This might be a bit of a dumb question; so I understand the concept of the discount factor with calculating the sum of expected rewards: the closer gamma is to 1.0, the more emphasis the agent places on future rewards as opposed to its current reward. Generally though this seems to happen with when a return is calculated for an entire episode. In Q-learning different state action pairs are getting updates to their q-values at each step, so there's no sort of monte-carlo return that's being calculated back through an entire episode. Therefore is the purpose of gamma here just to make sure the Q-values converge in an infinite horizon case? submitted by /u/1cedrake [link] [comments]  ( 42 min )
    Why is PPO classified as a policy-based method?
    Hello, I'm fairly new to RL and I'm trying to understand the concepts. I saw the spinningup's classification of RL algorithms and noticed that PPO is classified as a Policy based method. However, I read that PPO has both actor and critic networks and [I] considered it to be a hybrid method. I was wondering that if PPO trains both policy and value networks, why is it considered to be a value-based method? what is the difference between SAC and PPO that SAC is hybrid and PPO is not? Thanks in advance ​ https://preview.redd.it/r05j39rx69ga1.png?width=987&format=png&auto=webp&s=e230bf62944d9d2572e5fe6fbe204ef5eba250b0 submitted by /u/ahmadreza_hadi [link] [comments]  ( 42 min )
    Question on Q-Learning paper
    I've been reading this paper (Financial Trading as a Game: A Deep Reinforcement Learning Approach) and have been wondering about something they try in there. I'm still quite new to Q Learning so maybe I've yet to fully understand things... However, they propose a scheme where at each time step they can calculate the reward for the step that was taken and also the rewards for the other possible actions at that step. Intuitively it makes a lot of sense to me - we can learn more from each step without having to do more random exploration. But I immediately thought that there are probably quite a few areas of RL where we could benefit from the same thing (outside of financial trading). So my question is, given how intuitively smart this approach seems, why isn't it more broadly adopted already? What did I miss? submitted by /u/jarym [link] [comments]  ( 43 min )
    "Autonomous navigation of stratospheric balloons using reinforcement learning", Bellemare et al 2020 [Repost]
    submitted by /u/goolulusaurs [link] [comments]  ( 41 min )
    Why does this PPO implementation calculate the Advantage only once per rollout?
    I am looking at this PPO implementation, which follows the pseudocode given in Spinning Up. This implementation has been really easy to follow and I understand almost everything. However, I am lost in line 103, where the author computes the normalized advantage before the rollout - A_k = (A_k - A_k.mean()) / (A_k.std() + 1e-10) Moreover, within the rollout loop, the author goes ahead to recalculate the value, but uses the original advantage while computing the surrogate losses - for _ in range(self.n_updates_per_iteration): # ALG STEP 6 & 7 # Calculate V_phi and pi_theta(a_t | s_t) V, curr_log_probs = self.evaluate(batch_obs, batch_acts) # Calculate surrogate losses. surr1 = ratios * A_k surr2 = torch.clamp(ratios, 1 - self.clip, 1 + self.clip) * A_k The author also wrote a medium article about this implementation and wrote the following - ​ https://preview.redd.it/195482c373ga1.png?width=845&format=png&auto=webp&s=0624c10056311ce9d31b98fd4563a8f7acff39f8 ​ But in the rollouts, the author updates V(value) without updating A (the advantage). submitted by /u/Academic-Rent7800 [link] [comments]  ( 44 min )
    Minimax with neural network evaluation function
    Is this a thing? To combine game tree search like minimax (or alpha-beta pruning) with neural networks that model the value function of a state? I think Alpha Go did something similar but with Monre Carlo Search Trees and it also had a policy network. How would I go on about training said neural network? I am thinking, first as a supervised task where the target values are heuristic evaluation functions and then finw tuning with some kind of RL but I don't know what. submitted by /u/SupremeChampionOfDi [link] [comments]  ( 44 min )
  • Open

    The Chinese room argument holds that a digital computer executing a program cannot have a "mind", "understanding", or "consciousness", regardless of how intelligently or human-like the program may make the computer behave
    submitted by /u/insaneintheblain [link] [comments]  ( 41 min )
    What do you think are the hard limitations of AI?
    I saw recently a lot of roadblocks that we thought AI will struggle with (Like making art) have easier crossed even with the narrow AI (ML) we have. I feel a lot of limits we thought ai might have like thinking outside the box, understanding concepts, self-awareness, or lacking a 'soul' are all kinda subjective that can be overcome with the invention of AGI and ASI in the coming decades. Then they will grow behind human comprehension. So are there actual hard limitations (if any) Ais will encounter that are actually very hard or maybe never able to overcome? submitted by /u/uswhole [link] [comments]  ( 41 min )
    AI Dream 126 - AI MindStorm (2/6) - EPIC journey
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Can Apple, Amazon or Google create a same chatbot as ChatGPT
    GhatGPT is an amazing tool no doubt even in the prototype version and they are continuously improving at an incredible rate. But the thing is, all these three giants (Apple, Amazon and Google) have their incredible voice assistances. They do similar job of finding a different requested results on web etc. Apple is already adding new health support features. Google maps might make it more helpful for searches related to geographical regions. And again they probably have more user data from millions of users. What make it so different from them and how difficult it would be for them to create their own version as they have a lot more resources and data. submitted by /u/JaurasiD [link] [comments]  ( 41 min )
    Found another directory for AI tools
    submitted by /u/simplir [link] [comments]  ( 40 min )
    Midjourney AI has a new "/blend" feature! pretty cool
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    My best animation so far, everything I've learned is in this one!
    submitted by /u/LincolnOsiris_ [link] [comments]  ( 40 min )
    Amazon is Adding 1000 Robots a Day to Its Workforce
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    CMU Researchers Introduce FROMAGe: An AI Model That Efficiently Bootstraps Frozen Large Language Models (LLMs) To Generate Free-Form Text Interleaved With Images
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
    How to reproduce any human voice
    submitted by /u/visimens-technology [link] [comments]  ( 40 min )
    Easy guide for DreamBooth training and prompts quick on your mobile device with iSee app
    submitted by /u/Wonderful_Neat_1549 [link] [comments]  ( 44 min )
    AskReddit: Looking for an open-source text2music or text2audio model for a web-app project. Early stage in AI discovery, any help much appreciated!
    I'm working on an AI project and looking for an open-source text2music or text2audio model that I can incorporate into a website to experiment with. I'm aware of models like MusicLM and VALL-E but those haven't been released as APIs yet. I've also come across AudioGen and Mousai but same issues there. Does anyone happen to have a suggestion for a text2music model that is OSS and fairly accessible to incorporate into a web app? Looking for the best OSS model out there, but also open to the best text2audio that anyone would recommend - just looking to get text2audio working. Any suggestions from prior experience would be hugely helpful. Thanks very much in advance! submitted by /u/dmalikmusic [link] [comments]  ( 41 min )
    3D aware image synthesis with a spherical background — BALLGAN
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 42 min )
    ChatGPT’s Explosive Popularity Makes It the Fastest-Growing App in Human History
    submitted by /u/Tao_Dragon [link] [comments]  ( 41 min )
    My course on creating a ChatGPT Chrome Extension for GMail, would love your feedback!
    https://www.udemy.com/course/chatgpt-bot/?couponCode=5-DAYS-FREE Hey everyone, I recently made a course about ChatGPT as a fun passion project. This is for anyone who wants to learn how to create automated workflows (using Chrome extensions) with ChatGPT. Specifically, you will create a ChatGPT bot that automatically answers your emails. It is beginner friendly and includes getting some good practice with JavaScript. I hope you enjoy it and I'm looking forward to your feedback/questions :) submitted by /u/neuromodel [link] [comments]  ( 41 min )
    BoyWithUke AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    With apps like Lensa AI leading many to doubt AI's creativity, this TED talk is more relevant than ever
    https://www.youtube.com/watch?v=8TOgN-U0ask&t=1s After the Lensa AI controversy led many people to question whether AI really is creative or is it just "remixing" other artists' copyrighted work used with permission, it has led many to wonder whether AI trained on copyrighted images should be illegal. This talk makes some interesting comparisons which might just mean the answer is no. submitted by /u/BearNo21 [link] [comments]  ( 41 min )
    OpenAI Is Reportedly Launching A ChatGPT App For Android And iOS. Here’s What We Know So Far.
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    1000+ AI tools catalog - any feedback?
    I'm creating https://domore.ai/ - a catalog of 1000+ AI tools. The goal is to provide individuals and organizations with the latest information on AI tools. I'd love to hear any feedback you have for me, so feel free to share your thoughts :) submitted by /u/bart_so [link] [comments]  ( 41 min )
    Rasa entity detection
    I've been trying rasa for some time and got into a problem, the ai detects the intents perfectly but it doesn't pick up the entities. anyone can help me with it? submitted by /u/skychi_ [link] [comments]  ( 40 min )
    OpenAI to Launch ChatGPT Mobile App
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    Researchers at Stanford Introduce Parsel: An Artificial Intelligence AI Framework That Enables Automatic Implementation And Validation of Complex Algorithms With Code Large Language Models LLMs
    submitted by /u/madskills42001 [link] [comments]  ( 41 min )
  • Open

    Validation Accuracy Fluctuation
    Hello everyone, I am trying to implement DenseNet from scratch with some improvement for my project. When I fit my model, I am seeing that my validation accuracy fluctuating but my test accuracy is almost 96%. Is this fluctuation sign of the overfitting or how can I comment about this? I would be appreciate for any help. Thank you in advance. ​ ​ https://preview.redd.it/jtybvrt3z8ga1.png?width=778&format=png&auto=webp&s=75ec1a2ab9e2b60854bc92f361e203d825e9ca62 submitted by /u/Hungry-Engineer-5696 [link] [comments]  ( 41 min )
    BoyWithUke AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    AI Learns the Numbers
    submitted by /u/keghn [link] [comments]  ( 40 min )
  • Open

    [D] Mixing metadata and text in embedding for KNN search?
    Say I wanted to do a KNN similarity search, using a text embedding from a block of text from a PDF. But I also want to find these documents written at a similar time, and with similar title and author. Would it make sense to prepend a written form of this metadata ahead of the document text, and sending that to the embedding? Like “this document titled ABC was written by XYZ on January 1, 2020”. Or would it be better to create a separate embedding for the metadata, and merge the embeddings afterwards? submitted by /u/DeadPukka [link] [comments]  ( 43 min )
    [Project] ideas NLP
    Looking for ideas to start an NLP project, I'd like to explore something not too mainstream or novel to some extent, any ideas or datasets I should check out? submitted by /u/mems_m [link] [comments]  ( 42 min )
    [N] [R] Google announces Dreamix: a model that generates videos when given a prompt and an input image/video.
    submitted by /u/radi-cho [link] [comments]  ( 44 min )
    [P] NLP Q&A Bot Project Guidance
    I have performed below steps and require guidance to proceed further I have extracted and preprocessed the text from PDFs. Performed NER on the extracted text and created a data frame of entities. Created a function to preprocess the query and identified the entities in the question. Now I need guidance or any reference to perform the below steps. Match the entities from the question with the entities in the PDF text and retrieve the paragraph ? Calculate the similarity score for each paragraph and display the relevant paragraph Generate answer from the identified paragraph ? Please also guide me if the approach followed is correct or not ? submitted by /u/sasi_0212 [link] [comments]  ( 43 min )
    [D] Could you use SVD for supervised learning?
    It seems like Singular Value Decomposition is only used for unsupervised learning when trying to reduce the number of features in a high dimensional dataset, but I was wondering why I don't see any articles or literature on using SVD for supervised learning. I know that using a regularization function like Lasso (L1) can get rid of irrelevant features, but I don't see why SVD wouldn't be helpful too. submitted by /u/TemperatureOk6810 [link] [comments]  ( 43 min )
    [R] Coinductive guide to inductive transformer heads
    submitted by /u/adamnemecek [link] [comments]  ( 42 min )
    [R] Grounding Language Models to Images for Multimodal Generation
    submitted by /u/MysteryInc152 [link] [comments]  ( 42 min )
    [R] 3D aware image synthesis with a spherical background — BALLGAN
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 43 min )
    please help a bunch of students?(with pre annotated data set) we were assigned to this task with no prior knowledge of ML i don't know where to begin with we tried a couple of method which ultimately failed id be thankful for anyone who would tell me in steps what to do with this data[D]
    submitted by /u/errorr_unknown [link] [comments]  ( 42 min )
    [P] What tools are available for labelling data for LayoutLMv3?
    I have been working on information extraction from documents, but what I got to know is there are not enough free tools available for labelling data for these kind of tasks. Are there any free tools available for labelling data for LayoutLM models? submitted by /u/TensorDudee [link] [comments]  ( 42 min )
    15 years old and bad at math [D]
    Hi, I've done some programming type things before and I'm interested in learning about ML, to be able to make some basic projects with ML, how good does my math need to be. I get As at school in math but I know that what I'm learning now is pretty basic. I'm just wondering whether I should try learn about ML or wait a few more years for my math skills to improve submitted by /u/Daniel_C_____ [link] [comments]  ( 45 min )
    [N] GitHub CEO on why open source developers should be exempt from the EU’s AI Act
    submitted by /u/EmbarrassedHelp [link] [comments]  ( 43 min )
    [R] Bilingual (or Multilingual) Large Language models are the key to human parity on machine translations even for difficult language pairs and domains (e.g literature). An English-Chinese comparison.
    submitted by /u/MysteryInc152 [link] [comments]  ( 42 min )
    [R] Chinchilla data-optimal scaling laws: In plain English
    submitted by /u/adt [link] [comments]  ( 42 min )
    [D] Purchasing Google Colab Pro
    Hi everyone, I'm currently knees-deep in a ML project with a friend (~4 months of development) and my free compute units on Colab finally ran out. After searching for alternatives, and finding none that work as smoothly as Colab, we've considered to buy a Pro subscription. My question is: How can I share the compute units I'll get from Colab Pro with said friend? Don't want to make the purchase and later realize that I'm the only person with access to those compute units. submitted by /u/RaphDaPingu [link] [comments]  ( 43 min )
    Information Retrieval book recommendations? [D]
    Maybe not a Machine Learning question, but I'm searching for good books about information retrieval. The two primary ones I can find are: - Introduction to Information Retrieval (2008) - Information Retrieval - Implementing and Evaluating Search Engines (2016) ​ They seem a bit old for 2023, but they may still be useful? Do you have any good book recommendations? submitted by /u/Ggronne [link] [comments]  ( 44 min )
    [R] What’s your suggestion for offline RL?
    Hi guys! I read a lot of offline RL papers in last Fall semester and choose it as my course project. Offline RL seems to be a very hot topic in recent years, I believe that the major challenge for offline RL are (i) distribution shift and (ii) overestimation. The second challenge is caused by (i), because the learners/agents will never allow to interact with the true environment and they will too optimistic for unseen state-actions. Hence, there are many papers to address such challenges, e.g., CQL and MOPO. However can these methods handle misleading datasets? Consider the following example. Suppose we have only one state (MAB) and two arms. The reward of the first arm will return 2/3 with probability 1 and the reward model of second arm is Bernoulli distribution with p=1/2. Clearly, choosing the first arm is the best choice. Now, for the dataset, unfortunately, all samples on the second arm received reward 1. Because the agent only can access this misleading dataset, if we use Bayesian methods, then the posterior will give a high score for the second arm. If we use Lower Confidence Bound, we need to count the occurrence of each arm. Then, this is very hard to extend this method to MDPs with arbitrary large state and action space. So, does anyone know a function can capture this uncertainty (caused by the dataset) or can any methods to tell the learner that you’re in a very misleading situation? submitted by /u/AndyMeowMeow [link] [comments]  ( 44 min )

  • Open

    [N] FT: Google invests $300mn in artificial intelligence start-up Anthropic
    From the Financial Times: https://www.ft.com/content/583ead66-467c-4bd5-84d0-ed5df7b5bf9c Unpaywalled: https://archive.is/ciZPV I guess I'm a little surprised, this feels like Google backing a competitor to 1) their own Google Brain teams, and 2) Deepmind. The cynical take might be that they're trying to lock in Anthropic; the same way Microsoft locked in OpenAI. submitted by /u/bikeskata [link] [comments]  ( 47 min )
    [R] Topologically evolving new self-modifying multi-task learning algorithms
    I’ve been developing this idea since I first thought of it in mid December last year. Here’s the elevator pitch (skip to how for technical details): Why? Existing models and learning algorithms are extremely static and unable to generalize across tasks as well as humans or to adapt well to new / changing business requirements. This even applies to the final solutions in recent AutoML (see An Empirical Review of Automated Machine Learning, AutoML: A survey of the state-of-the-art). Beyond being static, most suffer from a need for high-performance systems with large amounts of compute and/or memory. This static and bloated nature not only limits the reusability of code, pipelines and all the computations that went into previous versions of a model architecture upon finding a better one. It…  ( 47 min )
    [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params!
    Paper: https://arxiv.org/abs/2302.00923 Github: https://github.com/amazon-science/mm-cot Twitter: https://paperswithcode.com/top-social Abstract: Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance. https://preview.redd.it/g9eo0f94k1ga1.jpg?width=1331&format=pjpg&auto=webp&s=9b5fc84b424aff7160b69ff7c7a5fad071cbb7d2 https://preview.redd.it/fgboci94k1ga1.jpg?width=1323&format=pjpg&auto=webp&s=35215544d9e0a74881c42503d04b62ab09081af1 https://preview.redd.it/2ojfym94k1ga1.jpg?width=1660&format=pjpg&auto=webp&s=cf040c4f422f6c323e8c4d75474a5881f45a41d1 https://preview.redd.it/k7huem94k1ga1.jpg?width=1326&format=pjpg&auto=webp&s=f4326a5088744d3856e5c5c23311be6348fab924 https://preview.redd.it/05m8rf94k1ga1.jpg?width=658&format=pjpg&auto=webp&s=ac4110e57a49fcea6f8c03571edd391ff71bd13d submitted by /u/Singularian2501 [link] [comments]  ( 47 min )
    [P] I trained an AI model on 120M+ songs from iTunes
    Hey ML Reddit! I just shipped a project I’ve been working on called Maroofy: https://maroofy.com You can search for any song, and it’ll use the song’s audio to find other similar-sounding music. Demo: https://twitter.com/subby_tech/status/1621293770779287554 How does it work? I’ve indexed ~120M+ songs from the iTunes catalog with a custom AI audio model that I built for understanding music. My model analyzes raw music audio as input and produces embedding vectors as output. I then store the embedding vectors for all songs into a vector database, and use semantic search to find similar music! Here are some examples you can try: Fetish (Selena Gomez feat. Gucci Mane) — https://maroofy.com/songs/1563859943 The Medallion Calls (Pirates of the Caribbean) — https://maroofy.com/songs/1440649752 Hope you like it! This is an early work in progress, so would love to hear any questions/feedback/comments! :D submitted by /u/BullyMaguireJr [link] [comments]  ( 52 min )
    [N] Google Open Sources Vizier, Hyperparameter + Blackbox Optimization Service at Scale
    Github: https://github.com/google/vizier Google AI Blog: https://ai.googleblog.com/2023/02/open-source-vizier-towards-reliable-and.html Tweet from Zoubin Ghahramani: https://twitter.com/ZoubinGhahrama1/status/1621321675936768000?s=20&t=ZEuz9oSc_GWYxixtXDskqA submitted by /u/enderlayer [link] [comments]  ( 43 min )
    [D] Understanding Vision Transformer (ViT) - What are the prerequisites?
    Hello everyone, I'm interested in diving into the field of computer vision and I recently came across the concept of Vision Transformer (ViT). I want to understand this concept in depth but I'm not sure what prerequisites I need to have in order to grasp the concept fully. Do I need to have a strong background in Recurrent Neural Networks (RNNs) and Transformer (Attention Is All You Need) to understand ViT, or can I get by just knowing the basics of deep learning and Convolutional Neural Networks (CNNs)? I would really appreciate if someone could shed some light on this and provide some guidance. Thank you in advance! submitted by /u/SAbdusSamad [link] [comments]  ( 7 min )
  • Open

    Created an AI research assistant where you can ask questions about any file (i.e. technical paper, report, etc) in English and automatically get the answer. It's like ChatGPT for your files.
    submitted by /u/HamletsLastLine [link] [comments]  ( 46 min )
    Ilya Sutskever says 40 papers explain 90% of modern AI
    In this article (https://dallasinnovates.com/exclusive-qa-john-carmacks-different-path-to-artificial-general-intelligence/) there is a quote from John Carmack that read: "I asked Ilya Sutskever, OpenAI’s chief scientist, for a reading list. He gave me a list of like 40 research papers and said, ‘If you really learn all of these, you’ll know 90% of what matters today. " My question is, what are these 40 papers? submitted by /u/Gryphx [link] [comments]  ( 42 min )
    Chat with your favorite characters from movies, TV shows, books, history, and more.
    ​ sample chat with my annoyed neighbor I built ChatFAI about a month ago. It's a simple web app that allows you to interact with your favorite characters from movies, TV shows, books, history, and beyond. People are having fun talking to whomever they want to talk to. There is a public characters library and you can also create custom characters based on anyone (or even your imagination). I have been actively improving it and have made it much better recently. So, I wanted to share it here to get feedback. The reason for sharing it here is I want feedback from you all. Let me know if there is anything else I should add or change. Here it is: https://chatfai.com submitted by /u/usamaejazch [link] [comments]  ( 42 min )
  • Open

    Augmented Lagrangian method for constrained MDP or constrained RL?
    Is there any work on applying Augmented Lagrangian method to constrained MDP problems that guarantee the constraint satisfaction as iterations goes? I tried to find but haven't got much result yet. Thanks for sharing any hints! submitted by /u/Sad-Dragonfruit-274 [link] [comments]  ( 41 min )
    reward function
    Hi, my agent is not working well, I feel like my reward function is not efficient. I'm trying to solve a control problem using the reinforcemnt learning, so my reward function was made with the state cost function. for example, reward = previous cost - current cost. so if the agent gets closer to the destination or makes any better control, it receives a positive reward. Otherwise, it will get a negative reward. but I don't think this is efficient.. Can anyone give me advice? Thanks all ​ https://preview.redd.it/rzf78ykzs0ga1.png?width=559&format=png&auto=webp&s=3f4cadb96f16eec59d221ab2b7f50f6839cf8ab6 submitted by /u/sonlightinn [link] [comments]  ( 42 min )
    Does anyone know of any model-based algorithms that deal with imperfect information and stochasticity and don't require a simulator?
    submitted by /u/atomicburn125 [link] [comments]  ( 42 min )
    Why does Advantage Learning help function approximators?
    Can someone please help with this question - https://ai.stackexchange.com/questions/39029/why-does-advantage-learning-help-function-approximators submitted by /u/Academic-Rent7800 [link] [comments]  ( 43 min )
  • Open

    Real-time tracking of wildfire boundaries using satellite imagery
    Posted by Zvika Ben-Haim and Omer Nevo, Software Engineers, Google Research As global temperatures rise, wildfires around the world are becoming more frequent and more dangerous. Their effects are felt by many communities as people evacuate their homes or suffer harm even from proximity to the fire and smoke. As part of Google’s mission to help people access trusted information in critical moments, we use satellite imagery and machine learning (ML) to track wildfires and inform affected communities. Our wildfire tracker was recently expanded. It provides updated fire boundary information every 10–15 minutes, is more accurate than similar satellite products, and improves on our previous work. These boundaries are shown for large fires in the continental US, Mexico, and most of Cana…  ( 92 min )
  • Open

    IoT Project: Why Is .NET The Best Choice?
    As the Internet of Things (IoT) continues to gain more traction at a rapid pace, there is growing demand and need for the development of apps driven by this technology. However, this leaves businesses with a challenging question: which development tool to use for creating such apps? The simple answer is .NET. It is a… Read More »IoT Project: Why Is .NET The Best Choice? The post IoT Project: Why Is .NET The Best Choice? appeared first on Data Science Central.  ( 19 min )
  • Open

    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v2 [cs.LG] UPDATED)
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.  ( 2 min )
    Additive Higher-Order Factorization Machines. (arXiv:2205.14515v2 [stat.CO] UPDATED)
    In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.  ( 2 min )
    Real Estate Property Valuation using Self-Supervised Vision Transformers. (arXiv:2302.00117v1 [cs.CV])
    The use of Artificial Intelligence (AI) in the real estate market has been growing in recent years. In this paper, we propose a new method for property valuation that utilizes self-supervised vision transformers, a recent breakthrough in computer vision and deep learning. Our proposed algorithm uses a combination of machine learning, computer vision and hedonic pricing models trained on real estate data to estimate the value of a given property. We collected and pre-processed a data set of real estate properties in the city of Boulder, Colorado and used it to train, validate and test our algorithm. Our data set consisted of qualitative images (including house interiors, exteriors, and street views) as well as quantitative features such as the number of bedrooms, bathrooms, square footage, lot square footage, property age, crime rates, and proximity to amenities. We evaluated the performance of our model using metrics such as Root Mean Squared Error (RMSE). Our findings indicate that these techniques are able to accurately predict the value of properties, with a low RMSE. The proposed algorithm outperforms traditional appraisal methods that do not leverage property images and has the potential to be used in real-world applications.  ( 2 min )
    Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks. (arXiv:2211.11869v2 [cs.LG] UPDATED)
    This effort is focused on examining the behavior of reinforcement learning systems in personalization environments and detailing the differences in policy entropy associated with the type of learning algorithm utilized. We demonstrate that Policy Optimization agents often possess low-entropy policies during training, which in practice results in agents prioritizing certain actions and avoiding others. Conversely, we also show that Q-Learning agents are far less susceptible to such behavior and generally maintain high-entropy policies throughout training, which is often preferable in real-world applications. We provide a wide range of numerical experiments as well as theoretical justification to show that these differences in entropy are due to the type of learning being employed.  ( 2 min )
    Expanding the Deployment Envelope of Behavior Prediction via Adaptive Meta-Learning. (arXiv:2209.11820v3 [cs.LG] UPDATED)
    Learning-based behavior prediction methods are increasingly being deployed in real-world autonomous systems, e.g., in fleets of self-driving vehicles, which are beginning to commercially operate in major cities across the world. Despite their advancements, however, the vast majority of prediction systems are specialized to a set of well-explored geographic regions or operational design domains, complicating deployment to additional cities, countries, or continents. Towards this end, we present a novel method for efficiently adapting behavior prediction models to new environments. Our approach leverages recent advances in meta-learning, specifically Bayesian regression, to augment existing behavior prediction models with an adaptive layer that enables efficient domain transfer via offline fine-tuning, online adaptation, or both. Experiments across multiple real-world datasets demonstrate that our method can efficiently adapt to a variety of unseen environments.  ( 2 min )
    Sliced Optimal Partial Transport. (arXiv:2212.08049v4 [cs.LG] UPDATED)
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.  ( 2 min )
    PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets. (arXiv:2302.00586v1 [q-fin.TR])
    The financial markets, which involve more than $90 trillion in market capitalization, attract the attention of innumerable investors around the world. Recently, reinforcement learning in financial markets (FinRL) emerges as a promising direction to train agents for making profitable investment decisions. However, the evaluation of most FinRL methods only focus on profit-related measures, which are far from satisfactory for practitioners to deploy these methods into real-world financial markets. Therefore, we introduce PRUDEX-Compass, which has 6 axes, i.e., Profitability, Risk-control, Universality, Diversity, rEliability, and eXplainability, with a total of 17 measures for a systematic evaluation. Specifically, i) we propose AlphaMix+ as a strong FinRL baseline, which leverages Mixture-of-Experts (MoE) and risk-10 sensitive approaches to make diversified risk-aware investment decisions, ii) we11 evaluate 8 widely used FinRL methods in 4 long-term real-world datasets of influential financial markets to demonstrate the usage of our PRUDEX-Compass, iii) PRUDEX-Compass1 together with 4 real-world datasets, standard implementation of 8 FinRL methods and a portfolio management RL environment is released as public resources to facilitate the design and comparison of new FinRL methods. We hope that PRUDEX-Compass can shed light on future FinRL research to prevent untrustworthy results from stagnating FinRL into successful industry deployment.  ( 2 min )
    Statistical Inference After Adaptive Sampling for Longitudinal Data. (arXiv:2202.07098v2 [cs.LG] UPDATED)
    Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the collected user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively collected data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop significant theory for empirical processes on non-i.i.d., adaptively collected, longitudinal data. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms pool data across users to learn to optimize treatment decisions, yet reliable statistical inference is essential for conducting a variety of statistical analyses after the experiment is over.  ( 2 min )
    Thermal Heating in ReRAM Crossbar Arrays: Challenges and Solutions. (arXiv:2212.13707v2 [cs.AR] UPDATED)
    The higher speed, scalability and parallelism offered by ReRAM crossbar arrays foster development of ReRAM-based next generation AI accelerators. At the same time, sensitivity of ReRAM to temperature variations decreases R_on/Roff ratio and negatively affects the achieved accuracy and reliability of the hardware. Various works on temperature-aware optimization and remapping in ReRAM crossbar arrays reported up to 58\% improvement in accuracy and 2.39$\times$ ReRAM lifetime enhancement. This paper classifies the challenges caused by thermal heat, starting from constraints in ReRAM cells' dimensions and characteristics to their placement in the architecture. In addition, it reviews available solutions designed to mitigate the impact of these challenges, including emerging temperature-resilient DNN training methods. Our work also provides a summary of the techniques and their advantages and limitations.  ( 2 min )
    Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees. (arXiv:2302.00037v1 [cs.LG])
    Hierarchical Clustering is a popular unsupervised machine learning method with decades of history and numerous applications. We initiate the study of differentially private approximation algorithms for hierarchical clustering under the rigorous framework introduced by (Dasgupta, 2016). We show strong lower bounds for the problem: that any $\epsilon$-DP algorithm must exhibit $O(|V|^2/ \epsilon)$-additive error for an input dataset $V$. Then, we exhibit a polynomial-time approximation algorithm with $O(|V|^{2.5}/ \epsilon)$-additive error, and an exponential-time algorithm that meets the lower bound. To overcome the lower bound, we focus on the stochastic block model, a popular model of graphs, and, with a separation assumption on the blocks, propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly. Finally, we perform an empirical study of our algorithms and validate their performance.  ( 2 min )
    ImpressLearn: Continual Learning via Combined Task Impressions. (arXiv:2210.01987v2 [cs.CV] UPDATED)
    This work proposes a new method to sequentially train deep neural networks on multiple tasks without suffering catastrophic forgetting, while endowing it with the capability to quickly adapt to unseen tasks. Starting from existing work on network masking (Wortsman et al., 2020), we show that simply learning a linear combination of a small number of task-specific supermasks (impressions) on a randomly initialized backbone network is sufficient to both retain accuracy on previously learned tasks, as well as achieve high accuracy on unseen tasks. In contrast to previous methods, we do not require to generate dedicated masks or contexts for each new task, instead leveraging transfer learning to keep per-task parameter overhead small. Our work illustrates the power of linearly combining individual impressions, each of which fares poorly in isolation, to achieve performance comparable to a dedicated mask. Moreover, even repeated impressions from the same task (homogeneous masks), when combined, can approach the performance of heterogeneous combinations if sufficiently many impressions are used. Our approach scales more efficiently than existing methods, often requiring orders of magnitude fewer parameters and can function without modification even when task identity is missing. In addition, in the setting where task labels are not given at inference, our algorithm gives an often favorable alternative to the one-shot procedure used by Wortsman et al., 2020. We evaluate our method on a number of well-known image classification datasets and network architectures.  ( 2 min )
    Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data. (arXiv:2302.00674v1 [cs.LG])
    Few-shot learning involves learning an effective model from only a few labeled datapoints. The use of a small training set makes it difficult to avoid overfitting but also makes few-shot learning applicable to many important real-world settings. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization. Introducing auxiliary data during few-shot learning leads to essential design choices where hand-designed heuristics can lead to sub-optimal performance. In this work, we focus on automated sampling strategies for FLAD and relate them to the explore-exploit dilemma that is central in multi-armed bandit settings. Based on this connection we propose two algorithms -- EXP3-FLAD and UCB1-FLAD -- and compare them with methods that either explore or exploit, finding that the combination of exploration and exploitation is crucial. Using our proposed algorithms to train T5 yields a 9% absolute improvement over the explicitly multi-task pre-trained T0 model across 11 datasets.
    Learning Equilibria in Matching Markets from Bandit Feedback. (arXiv:2108.08843v2 [cs.LG] UPDATED)
    Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. Classical notions of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) are unfortunately of limited value in the learning setting, given that preferences are inherently uncertain and destabilizing while they are being learned. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty," the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v3 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning. (arXiv:2302.00543v1 [cs.LG])
    Many compression techniques have been proposed to reduce the communication overhead of Federated Learning training procedures. However, these are typically designed for compressing model updates, which are expected to decay throughout training. As a result, such methods are inapplicable to downlink (i.e., from the parameter server to clients) compression in the cross-device setting, where heterogeneous clients $\textit{may appear only once}$ during training and thus must download the model parameters. In this paper, we propose a new framework ($\texttt{DoCoFL}$) for downlink compression in the cross-device federated learning setting. Importantly, $\texttt{DoCoFL}$ can be seamlessly combined with many uplink compression schemes, rendering it suitable for bi-directional compression. Through extensive evaluation, we demonstrate that $\texttt{DoCoFL}$ offers significant bi-directional bandwidth reduction while achieving competitive accuracy to that of $\texttt{FedAvg}$ without compression.
    A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel. (arXiv:2206.12543v2 [stat.ML] UPDATED)
    Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite width NTKs. For networks with O output units (e.g. an O-class classifier), however, the eNTK on N inputs is of size $NO \times NO$, taking $O((NO)^2)$ memory and up to $O((NO)^3)$ computation. Most existing applications have therefore used one of a handful of approximations yielding $N \times N$ kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call "sum of logits", converges to the true eNTK at initialization for any network with a wide final "readout" layer. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.
    The Replicator Dynamic, Chain Components and the Response Graph. (arXiv:2209.15230v2 [cs.GT] UPDATED)
    In this paper we examine the relationship between the flow of the replicator dynamic, the continuum limit of Multiplicative Weights Update, and a game's response graph. We settle an open problem establishing that under the replicator, sink chain components -- a topological notion of long-run outcome of a dynamical system -- always exist and are approximated by the sink connected components of the game's response graph. More specifically, each sink chain component contains a sink connected component of the response graph, as well as all mixed strategy profiles whose support consists of pure profiles in the same connected component, a set we call the content of the connected component. As a corollary, all profiles are chain recurrent in games with strongly connected response graphs. In any two-player game sharing a response graph with a zero-sum game, the sink chain component is unique. In two-player zero-sum and potential games the sink chain components and sink connected components are in a one-to-one correspondence, and we conjecture that this holds in all games.
    Development of deep biological ages aware of morbidity and mortality based on unsupervised and semi-supervised deep learning approaches. (arXiv:2302.00319v1 [cs.LG])
    Background: While deep learning technology, which has the capability of obtaining latent representations based on large-scale data, can be a potential solution for the discovery of a novel aging biomarker, existing deep learning methods for biological age estimation usually depend on chronological ages and lack of consideration of mortality and morbidity that are the most significant outcomes of aging. Methods: This paper proposes a novel deep learning model to learn latent representations of biological aging in regard to subjects' morbidity and mortality. The model utilizes health check-up data in addition to morbidity and mortality information to learn the complex relationships between aging and measured clinical attributes. Findings: The proposed model is evaluated on a large dataset of general populations compared with KDM and other learning-based models. Results demonstrate that biological ages obtained by the proposed model have superior discriminability of subjects' morbidity and mortality.
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v3 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
    Off-the-Grid MARL: a Framework for Dataset Generation with Baselines for Cooperative Offline Multi-Agent Reinforcement Learning. (arXiv:2302.00521v1 [cs.LG])
    Being able to harness the power of large, static datasets for developing autonomous multi-agent systems could unlock enormous value for real-world applications. Many important industrial systems are multi-agent in nature and are difficult to model using bespoke simulators. However, in industry, distributed system processes can often be recorded during operation, and large quantities of demonstrative data can be stored. Offline multi-agent reinforcement learning (MARL) provides a promising paradigm for building effective online controllers from static datasets. However, offline MARL is still in its infancy, and, therefore, lacks standardised benchmarks, baselines and evaluation protocols typically found in more mature subfields of RL. This deficiency makes it difficult for the community to sensibly measure progress. In this work, we aim to fill this gap by releasing \emph{off-the-grid MARL (OG-MARL)}: a framework for generating offline MARL datasets and algorithms. We release an initial set of datasets and baselines for cooperative offline MARL, created using the framework, along with a standardised evaluation protocol. Our datasets provide settings that are characteristic of real-world systems, including complex dynamics, non-stationarity, partial observability, suboptimality and sparse rewards, and are generated from popular online MARL benchmarks. We hope that OG-MARL will serve the community and help steer progress in offline MARL, while also providing an easy entry point for researchers new to the field.
    Uniswap Liquidity Provision: An Online Learning Approach. (arXiv:2302.00610v1 [cs.GT])
    Decentralized Exchanges (DEXs) are new types of marketplaces leveraging Blockchain technology. They allow users to trade assets with Automatic Market Makers (AMM), using funds provided by liquidity providers, removing the need for order books. One such DEX, Uniswap v3, allows liquidity providers to allocate funds more efficiently by specifying an active price interval for their funds. This introduces the problem of finding an optimal strategy for choosing price intervals. We formalize this problem as an online learning problem with non-stochastic rewards. We use regret-minimization methods to show a liquidity provision strategy that guarantees a lower bound on the reward. This is true even for non-stochastic changes to asset pricing, and we express this bound in terms of the trading volume.
    Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code. (arXiv:2111.01097v3 [cs.SE] UPDATED)
    There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and compare its performance with state-of-the-art representations that utilize the rich syntactic and semantic features of input programs. Our preliminary study on the utility of Code2Snapshot in the code summarization and code classification tasks suggests that simple snapshots of input programs have comparable performance to state-of-the-art representations. Interestingly, obscuring input programs have insignificant impacts on the Code2Snapshot performance, suggesting that, for some tasks, neural models may provide high performance by relying merely on the structure of input programs.
    Cross-client Label Propagation for Transductive Federated Learning. (arXiv:2210.06434v2 [cs.LG] UPDATED)
    We present Cross-Client Label Propagation(XCLP), a new method for transductive federated learning. XCLP estimates a data graph jointly from the data of multiple clients and computes labels for the unlabeled data by propagating label information across the graph. To avoid clients having to share their data with anyone, XCLP employs two cryptographically secure protocols: secure Hamming distance computation and secure summation. We demonstrate two distinct applications of XCLP within federated learning. In the first, we use it in a one-shot way to predict labels for unseen test points. In the second, we use it to repeatedly pseudo-label unlabeled training data in a federated semi-supervised setting. Experiments on both real federated and standard benchmark datasets show that in both applications XCLP achieves higher classification accuracy than alternative approaches.
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v2 [cs.LG] UPDATED)
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
    Short-term Prediction and Filtering of Solar Power Using State-Space Gaussian Processes. (arXiv:2302.00388v1 [cs.LG])
    Short-term forecasting of solar photovoltaic energy (PV) production is important for powerplant management. Ideally these forecasts are equipped with error bars, so that downstream decisions can account for uncertainty. To produce predictions with error bars in this setting, we consider Gaussian processes (GPs) for modelling and predicting solar photovoltaic energy production in the UK. A standard application of GP regression on the PV timeseries data is infeasible due to the large data size and non-Gaussianity of PV readings. However, this is made possible by leveraging recent advances in scalable GP inference, in particular, by using the state-space form of GPs, combined with modern variational inference techniques. The resulting model is not only scalable to large datasets but can also handle continuous data streams via Kalman filtering.
    WISE: Wavelet Transformation for Boosting Transformers' Long Sequence Learning Ability. (arXiv:2210.01989v2 [cs.CL] UPDATED)
    Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with a linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. Specifically, we focus on a new paradigm WISE, which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting non-linear transformations in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena benchmark demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer's performance and also significantly outperform Fourier-based methods.
    Graph Neural Network Based Surrogate Model of Physics Simulations for Geometry Design. (arXiv:2302.00557v1 [cs.LG])
    Computational Intelligence (CI) techniques have shown great potential as a surrogate model of expensive physics simulation, with demonstrated ability to make fast predictions, albeit at the expense of accuracy in some cases. For many scientific and engineering problems involving geometrical design, it is desirable for the surrogate models to precisely describe the change in geometry and predict the consequences. In that context, we develop graph neural networks (GNNs) as fast surrogate models for physics simulation, which allow us to directly train the models on 2/3D geometry designs that are represented by an unstructured mesh or point cloud, without the need for any explicit or hand-crafted parameterization. We utilize an encoder-processor-decoder-type architecture which can flexibly make prediction at both node level and graph level. The performance of our proposed GNN-based surrogate model is demonstrated on 2 example applications: feature designs in the domain of additive engineering and airfoil design in the domain of aerodynamics. The models show good accuracy in their predictions on a separate set of test geometries after training, with almost instant prediction speeds, as compared to O(hour) for the high-fidelity simulations required otherwise.
    Deep learning for $\psi$-weakly dependent processes. (arXiv:2302.00333v1 [stat.ML])
    In this paper, we perform deep neural networks for learning $\psi$-weakly dependent processes. Such weak-dependence property includes a class of weak dependence conditions such as mixing, association,$\cdots$ and the setting considered here covers many commonly used situations such as: regression estimation, time series prediction, time series classification,$\cdots$ The consistency of the empirical risk minimization algorithm in the class of deep neural networks predictors is established. We achieve the generalization bound and obtain a learning rate, which is less than $\mathcal{O}(n^{-1/\alpha})$, for all $\alpha > 2 $. Applications to binary time series classification and prediction in affine causal models with exogenous covariates are carried out. Some simulation results are provided, as well as an application to the US recession data.
    Fast and realistic large-scale structure from machine-learning-augmented random field simulations. (arXiv:2205.07898v2 [astro-ph.CO] UPDATED)
    Producing thousands of simulations of the dark matter distribution in the Universe with increasing precision is a challenging but critical task to facilitate the exploitation of current and forthcoming cosmological surveys. Many inexpensive substitutes to full $N$-body simulations have been proposed, even though they often fail to reproduce the statistics of the smaller, non-linear scales. Among these alternatives, a common approximation is represented by the lognormal distribution, which comes with its own limitations as well, while being extremely fast to compute even for high-resolution density fields. In this work, we train a generative deep learning model, mainly made of convolutional layers, to transform projected lognormal dark matter density fields to more realistic dark matter maps, as obtained from full $N$-body simulations. We detail the procedure that we follow to generate highly correlated pairs of lognormal and simulated maps, which we use as our training data, exploiting the information of the Fourier phases. We demonstrate the performance of our model comparing various statistical tests with different field resolutions, redshifts and cosmological parameters, proving its robustness and explaining its current limitations. When evaluated on 100 test maps, the augmented lognormal random fields reproduce the power spectrum up to wavenumbers of $1 \ h \ \rm{Mpc}^{-1}$, and the bispectrum within 10%, and always within the error bars, of the fiducial target simulations. Finally, we describe how we plan to integrate our proposed model with existing tools to yield more accurate spherical random fields for weak lensing analysis.
    A Comprehensive Survey of Continual Learning: Theory, Method and Application. (arXiv:2302.00487v1 [cs.LG])
    To cope with real-world dynamics, an intelligent agent needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance drop of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of continual learning in terms of the current trends, cross-directional prospects and interdisciplinary connections with neuroscience, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.
    Deterministic equivalent and error universality of deep random features learning. (arXiv:2302.00401v1 [stat.ML])
    This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
    Improved Exact and Heuristic Algorithms for Maximum Weight Clique. (arXiv:2302.00458v1 [cs.DS])
    We propose improved exact and heuristic algorithms for solving the maximum weight clique problem, a well-known problem in graph theory with many applications. Our algorithms interleave successful techniques from related work with novel data reduction rules that use local graph structure to identify and remove vertices and edges while retaining the optimal solution. We evaluate our algorithms on a range of synthetic and real-world graphs, and find that they outperform the current state of the art on most inputs. Our data reductions always produce smaller reduced graphs than existing data reductions alone. As a result, our exact algorithm, MWCRedu, finds solutions orders of magnitude faster on naturally weighted, medium-sized map labeling graphs and random hyperbolic graphs. Our heuristic algorithm, MWCPeel, outperforms its competitors on these instances, but is slightly less effective on extremely dense or large instances.
    Accordion: A Communication-Aware Machine Learning Framework for Next Generation Networks. (arXiv:2302.00623v1 [cs.NI])
    In this article, we advocate for the design of ad hoc artificial intelligence (AI)/machine learning (ML) models to facilitate their usage in future smart infrastructures based on communication networks. To motivate this, we first review key operations identified by the 3GPP for transferring AI/ML models through 5G networks and the main existing techniques to reduce their communication overheads. We also present a novel communication-aware ML framework, which we refer to as Accordion, that enables an efficient AI/ML model transfer thanks to an overhauled model training and communication protocol. We demonstrate the communication-related benefits of Accordion, analyse key performance trade-offs, and discuss potential research directions within this realm.
    Towards Implementing Energy-aware Data-driven Intelligence for Smart Health Applications on Mobile Platforms. (arXiv:2302.00514v1 [cs.LG])
    Recent breakthrough technological progressions of powerful mobile computing resources such as low-cost mobile GPUs along with cutting-edge, open-source software architectures have enabled high-performance deep learning on mobile platforms. These advancements have revolutionized the capabilities of today's mobile applications in different dimensions to perform data-driven intelligence locally, particularly for smart health applications. Unlike traditional machine learning (ML) architectures, modern on-device deep learning frameworks are proficient in utilizing computing resources in mobile platforms seamlessly, in terms of producing highly accurate results in less inference time. However, on the flip side, energy resources in a mobile device are typically limited. Hence, whenever a complex Deep Neural Network (DNN) architecture is fed into the on-device deep learning framework, while it achieves high prediction accuracy (and performance), it also urges huge energy demands during the runtime. Therefore, managing these resources efficiently within the spectrum of performance and energy efficiency is the newest challenge for any mobile application featuring data-driven intelligence beyond experimental evaluations. In this paper, first, we provide a timely review of recent advancements in on-device deep learning while empirically evaluating the performance metrics of current state-of-the-art ML architectures and conventional ML approaches with the emphasis given on energy characteristics by deploying them on a smart health application. With that, we are introducing a new framework through an energy-aware, adaptive model comprehension and realization (EAMCR) approach that can be utilized to make more robust and efficient inference decisions based on the available computing/energy resources in the mobile device during the runtime.
    Two for One: Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics. (arXiv:2302.00600v1 [cs.LG])
    Coarse-grained (CG) molecular dynamics enables the study of biological processes at temporal and spatial scales that would be intractable at an atomistic resolution. However, accurately learning a CG force field remains a challenge. In this work, we leverage connections between score-based generative models, force fields and molecular dynamics to learn a CG force field without requiring any force inputs during training. Specifically, we train a diffusion generative model on protein structures from molecular dynamics simulations, and we show that its score function approximates a force field that can directly be used to simulate CG molecular dynamics. While having a vastly simplified training setup compared to previous work, we demonstrate that our approach leads to improved performance across several small- to medium-sized protein simulations, reproducing the CG equilibrium distribution, and preserving dynamics of all-atom simulations such as protein folding events.
    MB-DECTNet: A Model-Based Unrolled Network for Accurate 3D DECT Reconstruction. (arXiv:2302.00577v1 [eess.IV])
    Numerous dual-energy CT (DECT) techniques have been developed in the past few decades. Dual-energy CT (DECT) statistical iterative reconstruction (SIR) has demonstrated its potential for reducing noise and increasing accuracy. Our lab proposed a joint statistical DECT algorithm for stopping power estimation and showed that it outperforms competing image-based material-decomposition methods. However, due to its slow convergence and the high computational cost of projections, the elapsed time of 3D DECT SIR is often not clinically acceptable. Therefore, to improve its convergence, we have embedded DECT SIR into a deep learning model-based unrolled network for 3D DECT reconstruction (MB-DECTNet) that can be trained in an end-to-end fashion. This deep learning-based method is trained to learn the shortcuts between the initial conditions and the stationary points of iterative algorithms while preserving the unbiased estimation property of model-based algorithms. MB-DECTNet is formed by stacking multiple update blocks, each of which consists of a data consistency layer (DC) and a spatial mixer layer, where the spatial mixer layer is the shrunken U-Net, and the DC layer is a one-step update of an arbitrary traditional iterative method. Although the proposed network can be combined with numerous iterative DECT algorithms, we demonstrate its performance with the dual-energy alternating minimization (DEAM). The qualitative result shows that MB-DECTNet with DEAM significantly reduces noise while increasing the resolution of the test image. The quantitative result shows that MB-DECTNet has the potential to estimate attenuation coefficients accurately as traditional statistical algorithms but with a much lower computational cost.
    Distribution free optimality intervals for clustering. (arXiv:2107.14442v2 [stat.ML] UPDATED)
    We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.
    Conditional Flow Matching: Simulation-Free Dynamic Optimal Transport. (arXiv:2302.00482v1 [cs.LG])
    Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have thus far been held back by limitations in their simulation-based maximum likelihood training. In this paper, we introduce a new technique called conditional flow matching (CFM), a simulation-free training objective for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, our CFM objective does not require the source distribution to be Gaussian or require evaluation of its density. Based on this new objective, we also introduce optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks such as inferring single cell dynamics, unsupervised image translation, and Schr\"odinger bridge inference. Code is available at https://github.com/atong01/conditional-flow-matching .
    Optimal Learning of Deep Random Networks of Extensive-width. (arXiv:2302.00375v1 [stat.ML])
    We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We derive a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We contrast these Bayes-optimal errors with the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
    Machine Learning for Visualization Recommendation Systems: Open Challenges and Future Directions. (arXiv:2302.00569v1 [cs.LG])
    Visualization Recommendation Systems (VRS) are a novel and challenging field of study, whose aim is to automatically generate insightful visualizations from data, to support non-expert users in the process of information discovery. Despite its enormous application potential in the era of big data, progress in this area of research is being held back by several obstacles among which are the absence of standardized datasets to train recommendation algorithms, and the difficulty in defining quantitative criteria to assess the effectiveness of the generated plots. In this paper, we aim not only to summarize the state-of-the-art of VRS, but also to outline promising future research directions.
    Graph Neural Operators for Classification of Spatial Transcriptomics Data. (arXiv:2302.00658v1 [cs.LG])
    The inception of spatial transcriptomics has allowed improved comprehension of tissue architectures and the disentanglement of complex underlying biological, physiological, and pathological processes through their positional contexts. Recently, these contexts, and by extension the field, have seen much promise and elucidation with the application of graph learning approaches. In particular, neural operators have risen in regards to learning the mapping between infinite-dimensional function spaces. With basic to deep neural network architectures being data-driven, i.e. dependent on quality data for prediction, neural operators provide robustness by offering generalization among different resolutions despite low quality data. Graph neural operators are a variant that utilize graph networks to learn this mapping between function spaces. The aim of this research is to identify robust machine learning architectures that integrate spatial information to predict tissue types. Under this notion, we propose a study incorporating various graph neural network approaches to validate the efficacy of applying neural operators towards prediction of brain regions in mouse brain tissue samples as a proof of concept towards our purpose. We were able to achieve an F1 score of nearly 72% for the graph neural operator approach which outperformed all baseline and other graph network approaches.
    Experimental observation on a low-rank tensor model for eigenvalue problems. (arXiv:2302.00538v1 [cs.LG])
    Here we utilize a low-rank tensor model (LTM) as a function approximator, combined with the gradient descent method, to solve eigenvalue problems including the Laplacian operator and the harmonic oscillator. Experimental results show the superiority of the polynomial-based low-rank tensor model (PLTM) compared to the tensor neural network (TNN). We also test such low-rank architectures for the classification problem on the MNIST dataset.
    QCRS: Improve Randomized Smoothing using Quasi-Concave Optimization. (arXiv:2302.00209v1 [cs.LG])
    Randomized smoothing is currently the state-of-the-art method that provides certified robustness for deep neural networks. However, it often cannot achieve an adequate certified region on real-world datasets. One way to obtain a larger certified region is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on Cifar10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead.
    GFlowNets for AI-Driven Scientific Discovery. (arXiv:2302.00615v1 [cs.LG])
    Tackling the most pressing problems for humanity, such as the climate crisis and the threat of global pandemics, requires accelerating the pace of scientific discovery. While science has traditionally relied on trial and error and even serendipity to a large extent, the last few decades have seen a surge of data-driven scientific discoveries. However, in order to truly leverage large-scale data sets and high-throughput experimental setups, machine learning methods will need to be further improved and better integrated in the scientific discovery pipeline. A key challenge for current machine learning methods in this context is the efficient exploration of very large search spaces, which requires techniques for estimating reducible (epistemic) uncertainty and generating sets of diverse and informative experiments to perform. This motivated a new probabilistic machine learning framework called GFlowNets, which can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop. GFlowNets learn to sample from a distribution given indirectly by a reward function corresponding to an unnormalized probability, which enables sampling diverse, high-reward candidates. GFlowNets can also be used to form efficient and amortized Bayesian posterior estimators for causal models conditioned on the already acquired experimental data. Having such posterior models can then provide estimators of epistemic uncertainty and information gain that can drive an experimental design policy. Altogether, here we will argue that GFlowNets can become a valuable tool for AI-driven scientific discovery, especially in scenarios of very large candidate spaces where we have access to cheap but inaccurate measurements or to expensive but accurate measurements. This is a common setting in the context of drug and material discovery, which we use as examples throughout the paper.
    Simple yet Effective Gradient-Free Graph Convolutional Networks. (arXiv:2302.00371v1 [cs.LG])
    Linearized Graph Neural Networks (GNNs) have attracted great attention in recent years for graph representation learning. Compared with nonlinear Graph Neural Network (GNN) models, linearized GNNs are much more time-efficient and can achieve comparable performances on typical downstream tasks such as node classification. Although some linearized GNN variants are purposely crafted to mitigate ``over-smoothing", empirical studies demonstrate that they still somehow suffer from this issue. In this paper, we instead relate over-smoothing with the vanishing gradient phenomenon and craft a gradient-free training framework to achieve more efficient and effective linearized GNNs which can significantly overcome over-smoothing and enhance the generalization of the model. The experimental results demonstrate that our methods achieve better and more stable performances on node classification tasks with varying depths and cost much less training time.
    Internally Rewarded Reinforcement Learning. (arXiv:2302.00270v1 [cs.LG])
    We study a class of reinforcement learning problems where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy. This interdependence between the policy and the discriminator leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning, and conversely, an untrained policy impedes discriminator learning. We call this learning setting $\textit{Internally Rewarded Reinforcement Learning}$ (IRRL) as the reward is not provided directly by the environment but $\textit{internally}$ by the discriminator. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
    Stream-based active learning with linear models. (arXiv:2207.09874v3 [stat.ML] UPDATED)
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
    Deep Learning Approach to Predict Hemorrhage in Moyamoya Disease. (arXiv:2302.00188v1 [cs.LG])
    Objective: Reliable tools to predict moyamoya disease (MMD) patients at risk for hemorrhage could have significant value. The aim of this paper is to develop three machine learning classification algorithms to predict hemorrhage in moyamoya disease. Methods: Clinical data of consecutive MMD patients who were admitted to our hospital between 2009 and 2015 were reviewed. Demographics, clinical, radiographic data were analyzed to develop artificial neural network (ANN), support vector machine (SVM), and random forest models. Results: We extracted 33 parameters, including 11 demographic and 22 radiographic features as input for model development. Of all compared classification results, ANN achieved the highest overall accuracy of 75.7% (95% CI, 68.6%-82.8%), followed by SVM with 69.2% (95% CI, 56.9%-81.5%) and random forest with 70.0% (95% CI, 57.0%-83.0%). Conclusions: The proposed ANN framework can be a potential effective tool to predict the possibility of hemorrhage among adult MMD patients based on clinical information and radiographic features.
    Learning from Stochastic Labels. (arXiv:2302.00299v1 [cs.LG])
    Annotating multi-class instances is a crucial task in the field of machine learning. Unfortunately, identifying the correct class label from a long sequence of candidate labels is time-consuming and laborious. To alleviate this problem, we design a novel labeling mechanism called stochastic label. In this setting, stochastic label includes two cases: 1) identify a correct class label from a small number of randomly given labels; 2) annotate the instance with None label when given labels do not contain correct class label. In this paper, we propose a novel suitable approach to learn from these stochastic labels. We obtain an unbiased estimator that utilizes less supervised information in stochastic labels to train a multi-class classifier. Additionally, it is theoretically justifiable by deriving the estimation error bound of the proposed method. Finally, we conduct extensive experiments on widely-used benchmark datasets to validate the superiority of our method by comparing it with existing state-of-the-art methods.
    Learning Choice Functions with Gaussian Processes. (arXiv:2302.00406v1 [cs.LG])
    In consumer theory, ranking available objects by means of preference relations yields the most common description of individual choices. However, preference-based models assume that individuals: (1) give their preferences only between pairs of objects; (2) are always able to pick the best preferred object. In many situations, they may be instead choosing out of a set with more than two elements and, because of lack of information and/or incomparability (objects with contradictory characteristics), they may not able to select a single most preferred object. To address these situations, we need a choice-model which allows an individual to express a set-valued choice. Choice functions provide such a mathematical framework. We propose a Gaussian Process model to learn choice functions from choice-data. The proposed model assumes a multiple utility representation of a choice function based on the concept of Pareto rationalization, and derives a strategy to learn both the number and the values of these latent multiple utilities. Simulation experiments demonstrate that the proposed model outperforms the state-of-the-art methods.
    Local convexity of the TAP free energy and AMP convergence for Z2-synchronization. (arXiv:2106.11428v2 [math.ST] UPDATED)
    We study mean-field variational Bayesian inference using the TAP approach, for Z2-synchronization as a prototypical example of a high-dimensional Bayesian model. We show that for any signal strength $\lambda > 1$ (the weak-recovery threshold), there exists a unique local minimizer of the TAP free energy functional near the mean of the Bayes posterior law. Furthermore, the TAP free energy in a local neighborhood of this minimizer is strongly convex. Consequently, a natural-gradient/mirror-descent algorithm achieves linear convergence to this minimizer from a local initialization, which may be obtained by a constant number of iterates of Approximate Message Passing (AMP). This provides a rigorous foundation for variational inference in high dimensions via minimization of the TAP free energy. We also analyze the finite-sample convergence of AMP, showing that AMP is asymptotically stable at the TAP minimizer for any $\lambda > 1$, and is linearly convergent to this minimizer from a spectral initialization for sufficiently large $\lambda$. Such a guarantee is stronger than results obtainable by state evolution analyses, which only describe a fixed number of AMP iterations in the infinite-sample limit. Our proofs combine the Kac-Rice formula and Sudakov-Fernique Gaussian comparison inequality to analyze the complexity of critical points that satisfy strong convexity and stability conditions within their local neighborhoods.
    CATFL: Certificateless Authentication-based Trustworthy Federated Learning for 6G Semantic Communications. (arXiv:2302.00271v1 [cs.CR])
    Federated learning (FL) provides an emerging approach for collaboratively training semantic encoder/decoder models of semantic communication systems, without private user data leaving the devices. Most existing studies on trustworthy FL aim to eliminate data poisoning threats that are produced by malicious clients, but in many cases, eliminating model poisoning attacks brought by fake servers is also an important objective. In this paper, a certificateless authentication-based trustworthy federated learning (CATFL) framework is proposed, which mutually authenticates the identity of clients and server. In CATFL, each client verifies the server's signature information before accepting the delivered global model to ensure that the global model is not delivered by false servers. On the contrary, the server also verifies the server's signature information before accepting the delivered model updates to ensure that they are submitted by authorized clients. Compared to PKI-based methods, the CATFL can avoid too high certificate management overheads. Meanwhile, the anonymity of clients shields data poisoning attacks, while real-name registration may suffer from user-specific privacy leakage risks. Therefore, a pseudonym generation strategy is also presented in CATFL to achieve a trade-off between identity traceability and user anonymity, which is essential to conditionally prevent from user-specific privacy leakage. Theoretical security analysis and evaluation results validate the superiority of CATFL.
    Bandit Convex Optimisation Revisited: FTRL Achieves $\tilde{O}(t^{1/2})$ Regret. (arXiv:2302.00358v1 [cs.LG])
    We show that a kernel estimator using multiple function evaluations can be easily converted into a sampling-based bandit estimator with expectation equal to the original kernel estimate. Plugging such a bandit estimator into the standard FTRL algorithm yields a bandit convex optimisation algorithm that achieves $\tilde{O}(t^{1/2})$ regret against adversarial time-varying convex loss functions.
    Equivariant Message Passing Neural Network for Crystal Material Discovery. (arXiv:2302.00485v1 [cs.LG])
    Automatic material discovery with desired properties is a fundamental challenge for material sciences. Considerable attention has recently been devoted to generating stable crystal structures. While existing work has shown impressive success on supervised tasks such as property prediction, the progress on unsupervised tasks such as material generation is still hampered by the limited extent to which the equivalent geometric representations of the same crystal are considered. To address this challenge, we propose EMPNN a periodic equivariant message-passing neural network that learns crystal lattice deformation in an unsupervised fashion. Our model equivalently acts on lattice according to the deformation action that must be performed, making it suitable for crystal generation, relaxation and optimisation. We present experimental evaluations that demonstrate the effectiveness of our approach.
    Learning Prototype Classifiers for Long-Tailed Recognition. (arXiv:2302.00491v1 [cs.CV])
    The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that have a tendency to correlate classifier norm with the amount of training data for a given class. On the other hand, Prototype classifiers do not suffer from this shortcoming and can deliver promising results simply using Nearest-Class-Mean (NCM), a special case where prototypes are empirical centroids. However, the potential of Prototype classifiers as an alternative to softmax in LTR is relatively underexplored. In this work, we propose Prototype classifiers, which jointly learn prototypes that minimize average cross-entropy loss based on probability scores from distances to prototypes. We theoretically analyze the properties of Euclidean distance based prototype classifiers that leads to stable gradient-based optimization which is robust to outliers. We further enhance Prototype classifiers by learning channel-dependent temperature parameters to enable independent distance scales along each channel. Our analysis shows that prototypes learned by Prototype classifiers are better separated than empirical centroids. Results on four long-tailed recognition benchmarks show that Prototype classifier outperforms or is comparable to the state-of-the-art methods.
    Robust online active learning. (arXiv:2302.00422v1 [stat.ML])
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.
    Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders. (arXiv:2302.00662v1 [stat.ML])
    Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
    A Survey of Methods, Challenges and Perspectives in Causality. (arXiv:2302.00293v1 [cs.LG])
    The Causality field aims to find systematic methods for uncovering cause-effect relationships. Such methods can find applications in many research fields, justifying a great interest in this domain. Machine Learning models have shown success in a large variety of tasks by extracting correlation patterns from high-dimensional data but still struggle when generalizing out of their initial distribution. As causal engines aim to learn mechanisms that are independent from a data distribution, combining Machine Learning with Causality has the potential to bring benefits to the two fields. In our work, we motivate this assumption and provide applications. We first perform an extensive overview of the theories and methods for Causality from different perspectives. We then provide a deeper look at the connections between Causality and Machine Learning and describe the challenges met by the two domains. We show the early attempts to bring the fields together and the possible perspectives for the future. We finish by providing a large variety of applications for techniques from Causality.
    Quickest Change Detection for Unnormalized Statistical Models. (arXiv:2302.00250v1 [stat.ML])
    Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyv\"arinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.
    Learning Cut Selection for Mixed-Integer Linear Programming via Hierarchical Sequence Model. (arXiv:2302.00244v1 [cs.LG])
    Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications. Cut selection -- which aims to select a proper subset of the candidate cuts to improve the efficiency of solving MILPs -- heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected. Although many modern MILP solvers tackle (P1)-(P2) by manually designed heuristics, machine learning offers a promising approach to learn more effective heuristics from MILPs collected from specific applications. However, many existing learning-based methods focus on learning which cuts should be preferred, neglecting the importance of learning the number of cuts that should be selected. Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred has a significant impact on the efficiency of solving MILPs as well. To address this challenge, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model -- that formulates the cut selection task as a sequence to sequence learning problem -- to learn policies selecting an ordered subset with the size determined by the higher-level model. To the best of our knowledge, HEM is the first method that can tackle (P1)-(P3) in cut selection simultaneously from a data-driven perspective. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs, including MIPLIB 2017. Moreover, experiments demonstrate that HEM well generalizes to MILPs that are significantly larger than those seen during training.
    Fourier series weight in quantum machine learning. (arXiv:2302.00105v1 [quant-ph])
    In this work, we aim to confirm the impact of the Fourier series on the quantum machine learning model. We will propose models, tests, and demonstrations to achieve this objective. We designed a quantum machine learning leveraged on the Hamiltonian encoding. With a subtle change, we performed the trigonometric interpolation, binary and multiclass classifier, and a quantum signal processing application. We also proposed a block diagram of determining approximately the Fourier coefficient based on quantum machine learning. We performed and tested all the proposed models using the Pennylane framework.
    Local transfer learning from one data space to another. (arXiv:2302.00160v1 [cs.LG])
    A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related.
    Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity. (arXiv:2211.07092v3 [stat.ML] UPDATED)
    In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
    Efficient Scopeformer: Towards Scalable and Rich Feature Extraction for Intracranial Hemorrhage Detection. (arXiv:2302.00220v1 [cs.CV])
    The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94\% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models
    Accelerated First-Order Optimization under Nonlinear Constraints. (arXiv:2302.00316v1 [math.OC])
    We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive rates for the convex setting. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.
    Graph-based Time-Series Anomaly Detection: A Survey. (arXiv:2302.00058v1 [cs.LG])
    With the recent advances in technology, a wide range of systems continues to collect a large amount of data over time and thus generating time series. Detecting anomalies in time series data is an important task in various applications such as e-commerce, cybersecurity, and health care monitoring. However, Time-series Anomaly Detection (TSAD) is very challenging as it requires considering both the temporal dependency and the structural dependency. Recent graph-based approaches have made impressive progress in tackling the challenges of this field. In this survey, we conduct a comprehensive and up-to-date review of Graph-based Time-series Anomaly Detection (G-TSAD). First, we explore the significant potential of graph-based methods in identifying different types of anomalies in time series data. Then, we provide a structured and comprehensive review of the state-of-the-art graph anomaly detection techniques in the context of time series. Finally, we discuss the technical challenges and potential future directions for possible improvements in this research field.
    OrthoReg: Improving Graph-regularized MLPs via Orthogonality Regularization. (arXiv:2302.00109v1 [cs.LG])
    Graph Neural Networks (GNNs) are currently dominating in modeling graph-structure data, while their high reliance on graph structure for inference significantly impedes them from widespread applications. By contrast, Graph-regularized MLPs (GR-MLPs) implicitly inject the graph structure information into model weights, while their performance can hardly match that of GNNs in most tasks. This motivates us to study the causes of the limited performance of GR-MLPs. In this paper, we first demonstrate that node embeddings learned from conventional GR-MLPs suffer from dimensional collapse, a phenomenon in which the largest a few eigenvalues dominate the embedding space, through empirical observations and theoretical analysis. As a result, the expressive power of the learned node representations is constrained. We further propose OrthoReg, a novel GR-MLP model to mitigate the dimensional collapse issue. Through a soft regularization loss on the correlation matrix of node embeddings, OrthoReg explicitly encourages orthogonal node representations and thus can naturally avoid dimensionally collapsed representations. Experiments on traditional transductive semi-supervised classification tasks and inductive node classification for cold-start scenarios demonstrate its effectiveness and superiority.
    TwinExplainer: Explaining Predictions of an Automotive Digital Twin. (arXiv:2302.00152v1 [cs.LG])
    Vehicles are complex Cyber Physical Systems (CPS) that operate in a variety of environments, and the likelihood of failure of one or more subsystems, such as the engine, transmission, brakes, and fuel, can result in unscheduled downtime and incur high maintenance or repair costs. In order to prevent these issues, it is crucial to continuously monitor the health of various subsystems and identify abnormal sensor channel behavior. Data-driven Digital Twin (DT) systems are capable of such a task. Current DT technologies utilize various Deep Learning (DL) techniques that are constrained by the lack of justification or explanation for their predictions. This inability of these opaque systems can influence decision-making and raises user trust concerns. This paper presents a solution to this issue, where the TwinExplainer system, with its three-layered architectural pipeline, explains the predictions of an automotive DT. Such a system can assist automotive stakeholders in understanding the global scale of the sensor channels and how they contribute towards generic DT predictions. TwinExplainer can also visualize explanations for both normal and abnormal local predictions computed by the DT.
    Neural Control of Parametric Solutions for High-dimensional Evolution PDEs. (arXiv:2302.00045v1 [math.NA])
    We develop a novel computational framework to approximate solution operators of evolution partial differential equations (PDEs). By employing a general nonlinear reduced-order model, such as a deep neural network, to approximate the solution of a given PDE, we realize that the evolution of the model parameter is a control problem in the parameter space. Based on this observation, we propose to approximate the solution operator of the PDE by learning the control vector field in the parameter space. From any initial value, this control field can steer the parameter to generate a trajectory such that the corresponding reduced-order model solves the PDE. This allows for substantially reduced computational cost to solve the evolution PDE with arbitrary initial conditions. We also develop comprehensive error analysis for the proposed method when solving a large class of semilinear parabolic PDEs. Numerical experiments on different high-dimensional evolution PDEs with various initial conditions demonstrate the promising results of the proposed method.
    Detection of Tomato Ripening Stages using Yolov3-tiny. (arXiv:2302.00164v1 [cs.CV])
    One of the most important agricultural products in Mexico is the tomato (Solanum lycopersicum), which occupies the 4th place national most produced product . Therefore, it is necessary to improve its production, building automatic detection system that detect, classify an keep tacks of the fruits is one way to archieve it. So, in this paper, we address the design of a computer vision system to detect tomatoes at different ripening stages. To solve the problem, we use a neural network-based model for tomato classification and detection. Specifically, we use the YOLOv3-tiny model because it is one of the lightest current deep neural networks. To train it, we perform two grid searches testing several combinations of hyperparameters. Our experiments showed an f1-score of 90.0% in the localization and classification of ripening stages in a custom dataset.
    FI-ODE: Certified and Robust Forward Invariance in Neural ODEs. (arXiv:2210.16940v2 [cs.LG] UPDATED)
    Forward invariance is a long-studied property in control theory that is used to certify that a dynamical system stays within some pre-specified set of states for all time, and also admits robustness guarantees (e.g., the certificate holds under perturbations). We propose a general framework for training and provably certifying robust forward invariance in Neural ODEs. We apply this framework in two settings: certified adversarial robustness for image classification, and certified safety in continuous control. Notably, our method empirically produces superior adversarial robustness guarantees compared to prior work on certifiably robust Neural ODEs (including implicit-depth models).
    Stable Target Field for Reduced Variance Score Estimation in Diffusion Models. (arXiv:2302.00670v1 [cs.LG])
    Diffusion models generate samples by reversing a fixed forward diffusion process. Despite already providing impressive empirical results, these diffusion models algorithms can be further improved by reducing the variance of the training targets in their denoising score-matching objective. We argue that the source of such variance lies in the handling of intermediate noise-variance scales, where multiple modes in the data affect the direction of reverse paths. We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets. We show that the procedure indeed helps in the challenging intermediate regime by reducing (the trace of) the covariance of training targets. The new stable targets can be seen as trading bias for reduced variance, where the bias vanishes with increasing reference batch size. Empirically, we show that the new objective improves the image quality, stability, and training speed of various popular diffusion models across datasets with both general ODE and SDE solvers. When used in combination with EDM, our method yields a current SOTA FID of 1.90 with 35 network evaluations on the unconditional CIFAR-10 generation task. The code is available at https://github.com/Newbeeer/stf
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v2 [stat.ML] UPDATED)
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.
    Tensor networks for unsupervised machine learning. (arXiv:2106.12974v2 [cond-mat.stat-mech] UPDATED)
    Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have the advantages of a principle understanding of the expressive power using entanglement properties, and as a bridge connecting classical computation and quantum computation. Despite the great potential, however, existing tensor network models for unsupervised machine learning only work as a proof of principle, as their performance is much worse than the standard models such as restricted Boltzmann machines and neural networks. In this Letter, we present autoregressive matrix product states (AMPS), a tensor network model combining matrix product states from quantum many-body physics and autoregressive modeling from machine learning. Our model enjoys the exact calculation of normalized probability and unbiased sampling. We demonstrate the performance of our model using two applications, generative modeling on synthetic and real-world data, and reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor network models and the restricted Boltzmann machines, and is competitive with state-of-the-art neural network models.
    How Out-of-Distribution Data Hurts Semi-Supervised Learning. (arXiv:2010.03658v3 [cs.LG] UPDATED)
    Recent semi-supervised learning algorithms have demonstrated greater success with higher overall performance due to better-unlabeled data representations. Nonetheless, recent research suggests that the performance of the SSL algorithm can be degraded when the unlabeled set contains out-of-distribution examples (OODs). This work addresses the following question: How do out-of-distribution (OOD) data adversely affect semi-supervised learning algorithms? To answer this question, we investigate the critical causes of OOD's negative effect on SSL algorithms. In particular, we found that 1) certain kinds of OOD data instances that are close to the decision boundary have a more significant impact on performance than those that are further away, and 2) Batch Normalization (BN), a popular module, may degrade rather than improve performance when the unlabeled set contains OODs. In this context, we developed a unified weighted robust SSL framework that can be easily extended to many existing SSL algorithms and improve their robustness against OODs. More specifically, we developed an efficient bi-level optimization algorithm that could accommodate high-order approximations of the objective and scale to multiple inner optimization steps to learn a massive number of weight parameters while outperforming existing low-order approximations of bi-level optimization. Further, we conduct a theoretical study of the impact of faraway OODs in the BN step and propose a weighted batch normalization (WBN) procedure for improved performance. Finally, we discuss the connection between our approach and low-order approximation techniques. Our experiments on synthetic and real-world datasets demonstrate that our proposed approach significantly enhances the robustness of four representative SSL algorithms against OODs compared to four state-of-the-art robust SSL strategies.
    Variational Causal Inference. (arXiv:2209.05935v2 [stat.ML] UPDATED)
    Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest.
    Efficient Meta-Learning via Error-based Context Pruning for Implicit Neural Representations. (arXiv:2302.00617v1 [cs.LG])
    We introduce an efficient optimization-based meta-learning technique for learning large-scale implicit neural representations (INRs). Our main idea is designing an online selection of context points, which can significantly reduce memory requirements for meta-learning in any established setting. By doing so, we expect additional memory savings which allows longer per-signal adaptation horizons (at a given memory budget), leading to better meta-initializations by reducing myopia and, more crucially, enabling learning on high-dimensional signals. To implement such context pruning, our technical novelty is three-fold. First, we propose a selection scheme that adaptively chooses a subset at each adaptation step based on the predictive error, leading to the modeling of the global structure of the signal in early steps and enabling the later steps to capture its high-frequency details. Second, we counteract any possible information loss from context pruning by minimizing the parameter distance to a bootstrapped target model trained on a full context set. Finally, we suggest using the full context set with a gradient scaling scheme at test-time. Our technique is model-agnostic, intuitive, and straightforward to implement, showing significant reconstruction improvements for a wide range of signals. Code is available at https://github.com/jihoontack/ECoP
    A Fair Empirical Risk Minimization with Generalized Entropy. (arXiv:2202.11966v3 [cs.LG] UPDATED)
    This paper studies a parametric family of algorithmic fairness metrics, called generalized entropy, which originally has been used in public welfare and recently introduced to machine learning community. As a meaningful metric to evaluate algorithmic fairness, it requires that generalized entropy specify fairness requirements of a classification problem and the fairness requirements should be realized with small deviation by an algorithm. We investigate the role of generalized entropy as a design parameter for fair classification algorithm through a fair empirical risk minimization with a constraint specified in terms of generalized entropy. We theoretically and experimentally study learnability of the problem.
    Anderson Acceleration For Bioinformatics-Based Machine Learning. (arXiv:2302.00347v1 [cs.LG])
    Anderson acceleration (AA) is a well-known method for accelerating the convergence of iterative algorithms, with applications in various fields including deep learning and optimization. Despite its popularity in these areas, the effectiveness of AA in classical machine learning classifiers has not been thoroughly studied. Tabular data, in particular, presents a unique challenge for deep learning models, and classical machine learning models are known to perform better in these scenarios. However, the convergence analysis of these models has received limited attention. To address this gap in research, we implement a support vector machine (SVM) classifier variant that incorporates AA to speed up convergence. We evaluate the performance of our SVM with and without Anderson acceleration on several datasets from the biology domain and demonstrate that the use of AA significantly improves convergence and reduces the training loss as the number of iterations increases. Our findings provide a promising perspective on the potential of Anderson acceleration in the training of simple machine learning classifiers and underscore the importance of further research in this area. By showing the effectiveness of AA in this setting, we aim to inspire more studies that explore the applications of AA in classical machine learning.
    Pessimistic Off-Policy Optimization for Learning to Rank. (arXiv:2206.02593v3 [cs.LG] UPDATED)
    Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient and we analyze it. We study its Bayesian and frequentist variants, and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines, is robust, and is also general.
    HCR-Net: A deep learning based script independent handwritten character recognition network. (arXiv:2108.06663v3 [cs.CV] UPDATED)
    Despite being studied extensively for a few decades, handwritten character recognition (HCR) is still considered a challenging learning problem in pattern recognition, and there is very limited research on script independent models. This is mainly because of similarity in structure of characters, different handwriting styles, noisy datasets, diversity of scripts, focus of the conventional research on handcrafted feature extraction techniques, and unavailability of public datasets and code-repositories to reproduce the results. On the other hand, deep learning has witnessed huge success in different areas of pattern recognition, including HCR, and provides an end-to-end learning. However, deep learning techniques are computationally expensive, need large amount of data for training and have been developed for specific scripts only. To address the above limitations, we have proposed a novel generic deep learning architecture for script independent handwritten character recognition, called HCR-Net. HCR-Net is based on a novel transfer learning approach for HCR, which partly utilizes feature extraction layers of a pre-trained network. Due to transfer learning and image-augmentation, HCR-Net provides faster and computationally efficient training, better performance and better generalizations, and can work with small datasets. HCR-Net is extensively evaluated on 40 publicly available datasets of Bangla, Punjabi, Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam, Telugu, Marathi, Nepali and Arabic languages, and established 26 new benchmark results while performed close to the best results in the rest cases. HCR-Net showed performance improvements up to 11% against the existing results and achieved a fast convergence rate showing up to 99% of final performance in the very first epoch. HCR-Net significantly outperformed the state-of-the-art transfer learning techniques...
    Diffusion Models for High-Resolution Solar Forecasts. (arXiv:2302.00170v1 [cs.LG])
    Forecasting future weather and climate is inherently difficult. Machine learning offers new approaches to increase the accuracy and computational efficiency of forecasts, but current methods are unable to accurately model uncertainty in high-dimensional predictions. Score-based diffusion models offer a new approach to modeling probability distributions over many dependent variables, and in this work, we demonstrate how they provide probabilistic forecasts of weather and climate variables at unprecedented resolution, speed, and accuracy. We apply the technique to day-ahead solar irradiance forecasts by generating many samples from a diffusion model trained to super-resolve coarse-resolution numerical weather predictions to high-resolution weather satellite observations.
    Deep Power Laws for Hyperparameter Optimization. (arXiv:2302.00441v1 [cs.LG])
    Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the scaling law property of learning curves. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 57 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.
    Analyzing Leakage of Personally Identifiable Information in Language Models. (arXiv:2302.00539v1 [cs.LG])
    Language Models (LMs) have been shown to leak information about training data through sentence-level membership inference and reconstruction attacks. Understanding the risk of LMs leaking Personally Identifiable Information (PII) has received less attention, which can be attributed to the false assumption that dataset curation techniques such as scrubbing are sufficient to prevent PII leakage. Scrubbing techniques reduce but do not prevent the risk of PII leakage: in practice scrubbing is imperfect and must balance the trade-off between minimizing disclosure and preserving the utility of the dataset. On the other hand, it is unclear to which extent algorithmic defenses such as differential privacy, designed to guarantee sentence- or user-level privacy, prevent PII disclosure. In this work, we propose (i) a taxonomy of PII leakage in LMs, (ii) metrics to quantify PII leakage, and (iii) attacks showing that PII leakage is a threat in practice. Our taxonomy provides rigorous game-based definitions for PII leakage via black-box extraction, inference, and reconstruction attacks with only API access to an LM. We empirically evaluate attacks against GPT-2 models fine-tuned on three domains: case law, health care, and e-mails. Our main contributions are (i) novel attacks that can extract up to 10 times more PII sequences as existing attacks, (ii) showing that sentence-level differential privacy reduces the risk of PII disclosure but still leaks about 3% of PII sequences, and (iii) a subtle connection between record-level membership inference and PII reconstruction.
    Efficient Multi-Task Reinforcement Learning via Selective Behavior Sharing. (arXiv:2302.00671v1 [cs.LG])
    The ability to leverage shared behaviors between tasks is critical for sample-efficient multi-task reinforcement learning (MTRL). While prior methods have primarily explored parameter and data sharing, direct behavior-sharing has been limited to task families requiring similar behaviors. Our goal is to extend the efficacy of behavior-sharing to more general task families that could require a mix of shareable and conflicting behaviors. Our key insight is an agent's behavior across tasks can be used for mutually beneficial exploration. To this end, we propose a simple MTRL framework for identifying shareable behaviors over tasks and incorporating them to guide exploration. We empirically demonstrate how behavior sharing improves sample efficiency and final performance on manipulation and navigation MTRL tasks and is even complementary to parameter sharing. Result videos are available at https://sites.google.com/view/qmp-mtrl.
    The Power of External Memory in Increasing Predictive Model Capacity. (arXiv:2302.00003v1 [cs.LG])
    One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for accessing the table and how are the contents of the table consumed? Prominent methods for accessing the table include 1) using words/wordpieces token-ids as table indices, 2) LSH hashing the token vector in each layer into a table of buckets, and 3) learnable softmax style routing to a table entry. The ways to consume the contents include adding/concatenating to input representation, and using the contents as expert networks that specialize to different inputs. In this work, we conduct rigorous experimental evaluations of existing ideas and their combinations. We also introduce a new method, alternating updates, that enables access to an increased token dimension without increasing the computation time, and demonstrate its effectiveness in language modeling.
    Model-Parallel Fourier Neural Operators as Learned Surrogates for Large-Scale Parametric PDEs. (arXiv:2204.01205v3 [cs.LG] UPDATED)
    Fourier neural operators (FNOs) are a recently introduced neural network architecture for learning solution operators of partial differential equations (PDEs), which have been shown to perform significantly better than comparable deep learning approaches. Once trained, FNOs can achieve speed-ups of multiple orders of magnitude over conventional numerical PDE solvers. However, due to the high dimensionality of their input data and network weights, FNOs have so far only been applied to two-dimensional or small three-dimensional problems. To remove this limited problem-size barrier, we propose a model-parallel version of FNOs based on domain-decomposition of both the input data and network weights. We demonstrate that our model-parallel FNO is able to predict time-varying PDE solutions of over 2.6 billion variables on Perlmutter using up to 512 A100 GPUs and show an example of training a distributed FNO on the Azure cloud for simulating multiphase CO$_2$ dynamics in the Earth's subsurface.
    Deep Dependency Networks for Multi-Label Classification. (arXiv:2302.00633v1 [cs.LG])
    We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label classification task, focusing specifically on image and video data. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, integer linear programming, and $\ell_1$ regularization-based structure learning. Then we propose a new modeling framework called deep dependency networks, which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, jointly learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video activity classification datasets: Charades, Textually Annotated Cooking Scenes (TACoS), and Wetlab, and three multi-label image classification datasets: MS-COCO, PASCAL VOC, and NUS-WIDE show that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.
    Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification. (arXiv:2302.00368v1 [cs.CV])
    We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of knowledge or domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semi-supervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE render it practical to be used with any off-the-shelf trained model to improve its predictions further.
    Generative Adversarial Symmetry Discovery. (arXiv:2302.00236v1 [cs.LG])
    Despite the success of equivariant neural networks in scientific applications, they require knowing the symmetry group a priori. However, it may be difficult to know the right symmetry to use as an inductive bias in practice and enforcing the wrong symmetry could hurt the performance. In this paper, we propose a framework, LieGAN, to automatically discover equivariances from a dataset using a paradigm akin to generative adversarial training. Specifically, a generator learns a group of transformations applied to the data, which preserves the original distribution and fools the discriminator. LieGAN represents symmetry as interpretable Lie algebra basis and can discover various symmetries such as rotation group $\mathrm{SO}(n)$ and restricted Lorentz group $\mathrm{SO}(1,3)^+$ in trajectory prediction and top quark tagging tasks. The learned symmetry can also be readily used in several existing equivariant neural networks to improve accuracy and generalization in prediction.
    Detecting Harmful Agendas in News Articles. (arXiv:2302.00102v1 [cs.CL])
    Manipulated news online is a growing problem which necessitates the use of automated systems to curtail its spread. We argue that while misinformation and disinformation detection have been studied, there has been a lack of investment in the important open challenge of detecting harmful agendas in news articles; identifying harmful agendas is critical to flag news campaigns with the greatest potential for real world harm. Moreover, due to real concerns around censorship, harmful agenda detectors must be interpretable to be effective. In this work, we propose this new task and release a dataset, NewsAgendas, of annotated news articles for agenda identification. We show how interpretable systems can be effective on this task and demonstrate that they can perform comparably to black-box models.
    Free Lunch for Domain Adversarial Training: Environment Label Smoothing. (arXiv:2302.00194v1 [cs.LG])
    A fundamental challenge for machine learning models is how to generalize learned models for out-of-distribution (OOD) data. Among various approaches, exploiting invariant features by Domain Adversarial Training (DAT) received widespread attention. Despite its success, we observe training instability from DAT, mostly due to over-confident domain discriminator and environment label noise. To address this issue, we proposed Environment Label Smoothing (ELS), which encourages the discriminator to output soft probability, which thus reduces the confidence of the discriminator and alleviates the impact of noisy environment labels. We demonstrate, both experimentally and theoretically, that ELS can improve training stability, local convergence, and robustness to noisy environment labels. By incorporating ELS with DAT methods, we are able to yield state-of-art results on a wide range of domain generalization/adaptation tasks, particularly when the environment labels are highly noisy.
    The Parametric Stability of Well-separated Spherical Gaussian Mixtures. (arXiv:2302.00242v1 [stat.ML])
    We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bound only depends on $P$. The motivation for this work lies in providing guarantees for fitting Gaussian mixtures; with this aim in mind, all the constants involved are well defined and distribution free conditions for fitting mixtures of spherical Gaussians. Our results tighten considerably the existing computable bounds, and asymptotically match the known sharp thresholds for this problem.
    Knowledge Distillation on Graphs: A Survey. (arXiv:2302.00219v1 [cs.LG])
    Graph Neural Networks (GNNs) have attracted tremendous attention by demonstrating their capability to handle graph data. However, they are difficult to be deployed in resource-limited devices due to model sizes and scalability constraints imposed by the multi-hop data dependency. In addition, real-world graphs usually possess complex structural information and features. Therefore, to improve the applicability of GNNs and fully encode the complicated topological information, knowledge distillation on graphs (KDG) has been introduced to build a smaller yet effective model and exploit more knowledge from data, leading to model compression and performance improvement. Recently, KDG has achieved considerable progress with many studies proposed. In this survey, we systematically review these works. Specifically, we first introduce KDG challenges and bases, then categorize and summarize existing works of KDG by answering the following three questions: 1) what to distillate, 2) who to whom, and 3) how to distillate. Finally, we share our thoughts on future research directions.
    Towards Label-Efficient Incremental Learning: A Survey. (arXiv:2302.00353v1 [cs.LG])
    The current dominant paradigm when building a machine learning model is to iterate over a dataset over and over until convergence. Such an approach is non-incremental, as it assumes access to all images of all categories at once. However, for many applications, non-incremental learning is unrealistic. To that end, researchers study incremental learning, where a learner is required to adapt to an incoming stream of data with a varying distribution while preventing forgetting of past knowledge. Significant progress has been made, however, the vast majority of works focus on the fully supervised setting, making these algorithms label-hungry thus limiting their real-life deployment. To that end, in this paper, we make the first attempt to survey recently growing interest in label-efficient incremental learning. We identify three subdivisions, namely semi-, few-shot- and self-supervised learning to reduce labeling efforts. Finally, we identify novel directions that can further enhance label-efficiency and improve incremental learning scalability. Project website: {https://github.com/kilickaya/label-efficient-il.
    Density peak clustering using tensor network. (arXiv:2302.00192v1 [cs.LG])
    Tensor networks, which have been traditionally used to simulate many-body physics, have recently gained significant attention in the field of machine learning due to their powerful representation capabilities. In this work, we propose a density-based clustering algorithm inspired by tensor networks. We encode classical data into tensor network states on an extended Hilbert space and train the tensor network states to capture the features of the clusters. Here, we define density and related concepts in terms of fidelity, rather than using a classical distance measure. We evaluate the performance of our algorithm on six synthetic data sets, four real world data sets, and three commonly used computer vision data sets. The results demonstrate that our method provides state-of-the-art performance on several synthetic data sets and real world data sets, even when the number of clusters is unknown. Additionally, our algorithm performs competitively with state-of-the-art algorithms on the MNIST, USPS, and Fashion-MNIST image data sets. These findings reveal the great potential of tensor networks for machine learning applications.
    Bridging Physics-Informed Neural Networks with Reinforcement Learning: Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO). (arXiv:2302.00237v1 [cs.LG])
    This paper introduces the Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO) algorithm into reinforcement learning. The Hamilton-Jacobi-Bellman (HJB) equation is used in control theory to evaluate the optimality of the value function. Our work combines the HJB equation with reinforcement learning in continuous state and action spaces to improve the training of the value network. We treat the value network as a Physics-Informed Neural Network (PINN) to solve for the HJB equation by computing its derivatives with respect to its inputs exactly. The Proximal Policy Optimization (PPO)-Clipped algorithm is improvised with this implementation as it uses a value network to compute the objective function for its policy network. The HJBPPO algorithm shows an improved performance compared to PPO on the MuJoCo environments.
    Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression. (arXiv:2302.00257v1 [cs.LG])
    In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$ with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.
    W2SAT: Learning to generate SAT instances from Weighted Literal Incidence Graphs. (arXiv:2302.00272v1 [cs.LG])
    The Boolean Satisfiability (SAT) problem stands out as an attractive NP-complete problem in theoretic computer science and plays a central role in a broad spectrum of computing-related applications. Exploiting and tuning SAT solvers under numerous scenarios require massive high-quality industry-level SAT instances, which unfortunately are quite limited in the real world. To address the data insufficiency issue, in this paper, we propose W2SAT, a framework to generate SAT formulas by learning intrinsic structures and properties from given real-world/industrial instances in an implicit fashion. To this end, we introduce a novel SAT representation called Weighted Literal Incidence Graph (WLIG), which exhibits strong representation ability and generalizability against existing counterparts, and can be efficiently generated via a specialized learning-based graph generative model. Decoding from WLIGs into SAT problems is then modeled as finding overlapping cliques with a novel hill-climbing optimization method termed Optimal Weight Coverage (OWC). Experiments demonstrate the superiority of our WLIG-induced approach in terms of graph metrics, efficiency, and scalability in comparison to previous methods. Additionally, we discuss the limitations of graph-based SAT generation for real-world applications, especially when utilizing generated instances for SAT solver parameter-tuning, and pose some potential directions.
    The geometry of hidden representations of large transformer models. (arXiv:2302.00294v1 [cs.LG])
    Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.
    Deep Active Learning for Scientific Computing in the Wild. (arXiv:2302.00098v1 [cs.LG])
    Deep learning (DL) is revolutionizing the scientific computing community. To reduce the data gap caused by usually expensive simulations or experimentation, active learning has been identified as a promising solution for the scientific computing community. However, the deep active learning (DAL) literature is currently dominated by image classification problems and pool-based methods, which are not directly transferrable to scientific computing problems, dominated by regression problems with no pre-defined 'pool' of unlabeled data. Here for the first time, we investigate the robustness of DAL methods for scientific computing problems using ten state-of-the-art DAL methods and eight benchmark problems. We show that, to our surprise, the majority of the DAL methods are not robust even compared to random sampling when the ideal pool size is unknown. We further analyze the effectiveness and robustness of DAL methods and suggest that diversity is necessary for a robust DAL for scientific computing problems.
    Stroke-based Rendering: From Heuristics to Deep Learning. (arXiv:2302.00595v1 [cs.CV])
    In the last few years, artistic image-making with deep learning models has gained a considerable amount of traction. A large number of these models operate directly in the pixel space and generate raster images. This is however not how most humans would produce artworks, for example, by planning a sequence of shapes and strokes to draw. Recent developments in deep learning methods help to bridge the gap between stroke-based paintings and pixel photo generation. With this survey, we aim to provide a structured introduction and understanding of common challenges and approaches in stroke-based rendering algorithms. These algorithms range from simple rule-based heuristics to stroke optimization and deep reinforcement agents, trained to paint images with differentiable vector graphics and neural rendering.
    Dynamic Flows on Curved Space Generated by Labeled Data. (arXiv:2302.00061v1 [cs.LG])
    The scarcity of labeled data is a long-standing challenge for many machine learning tasks. We propose our gradient flow method to leverage the existing dataset (i.e., source) to generate new samples that are close to the dataset of interest (i.e., target). We lift both datasets to the space of probability distributions on the feature-Gaussian manifold, and then develop a gradient flow method that minimizes the maximum mean discrepancy loss. To perform the gradient flow of distributions on the curved feature-Gaussian space, we unravel the Riemannian structure of the space and compute explicitly the Riemannian gradient of the loss function induced by the optimal transport metric. For practical applications, we also propose a discretized flow, and provide conditional results guaranteeing the global convergence of the flow to the optimum. We illustrate the results of our proposed gradient flow method on several real-world datasets and show our method can improve the accuracy of classification models in transfer learning settings.
    Active Uncertainty Reduction for Safe and Efficient Interaction Planning: A Shielding-Aware Dual Control Approach. (arXiv:2302.00171v1 [cs.RO])
    The ability to accurately predict the opponent's behavior is central to the safety and efficiency of robotic systems in interactive settings, such as human-robot interaction and multi-robot teaming tasks. Unfortunately, robots often lack access to key information on which these predictions may hinge, such as opponent's goals, attention, and willingness to cooperate. Dual control theory addresses this challenge by treating unknown parameters of a predictive model as hidden states and inferring their values at runtime using information gathered during system operation. While able to optimally and automatically trade off exploration and exploitation, dual control is computationally intractable for general interactive motion planning. In this paper, we present a novel algorithmic approach to enable active uncertainty reduction for interactive motion planning based on the implicit dual control paradigm. Our approach relies on sampling-based approximation of stochastic dynamic programming, leading to a model predictive control problem. The resulting policy is shown to preserve the dual control effect for a broad class of predictive models with both continuous and categorical uncertainty. To ensure the safe operation of the interacting agents, we leverage a supervisory control scheme, oftentimes referred to as ``shielding'', which overrides the ego agent's dual control policy with a safety fallback strategy when a safety-critical event is imminent. We then augment the dual control framework with an improved variant of the recently proposed shielding-aware robust planning scheme, which proactively balances the nominal planning performance with the risk of high-cost emergency maneuvers triggered by low-probability opponent's behaviors. We demonstrate the efficacy of our approach with both simulated driving examples and hardware experiments using 1/10 scale autonomous vehicles.
    Probabilistic Point Cloud Modeling via Self-Organizing Gaussian Mixture Models. (arXiv:2302.00047v1 [cs.LG])
    This letter presents a continuous probabilistic modeling methodology for spatial point cloud data using finite Gaussian Mixture Models (GMMs) where the number of components are adapted based on the scene complexity. Few hierarchical and adaptive methods have been proposed to address the challenge of balancing model fidelity with size. Instead, state-of-the-art mapping approaches require tuning parameters for specific use cases, but do not generalize across diverse environments. To address this gap, we utilize a self-organizing principle from information-theoretic learning to automatically adapt the complexity of the GMM model based on the relevant information in the sensor data. The approach is evaluated against existing point cloud modeling techniques on real-world data with varying degrees of scene complexity.
    Debiasing Vision-Language Models via Biased Prompts. (arXiv:2302.00070v1 [cs.LG])
    Machine learning models have been shown to inherit biases from their training datasets, which can be particularly problematic for vision-language foundation models trained on uncurated datasets scraped from the internet. The biases can be amplified and propagated to downstream applications like zero-shot classifiers and text-to-image generative models. In this study, we propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. In particular, we show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models. The closed-form solution enables easy integration into large-scale pipelines, and empirical results demonstrate that our approach effectively reduces social bias and spurious correlation in both discriminative and generative vision-language models without the need for additional data or training.
    Online Learning in Dynamically Changing Environments. (arXiv:2302.00103v1 [cs.LG])
    We study the problem of online learning and online regret minimization when samples are drawn from a general unknown non-stationary process. We introduce the concept of a dynamic changing process with cost $K$, where the conditional marginals of the process can vary arbitrarily, but that the number of different conditional marginals is bounded by $K$ over $T$ rounds. For such processes we prove a tight (upto $\sqrt{\log T}$ factor) bound $O(\sqrt{KT\cdot\mathsf{VC}(\mathcal{H})\log T})$ for the expected worst case regret of any finite VC-dimensional class $\mathcal{H}$ under absolute loss (i.e., the expected miss-classification loss). We then improve this bound for general mixable losses, by establishing a tight (up to $\log^3 T$ factor) regret bound $O(K\cdot\mathsf{VC}(\mathcal{H})\log^3 T)$. We extend these results to general smooth adversary processes with unknown reference measure by showing a sub-linear regret bound for $1$-dimensional threshold functions under a general bounded convex loss. Our results can be viewed as a first step towards regret analysis with non-stationary samples in the distribution blind (universal) regime. This also brings a new viewpoint that shifts the study of complexity of the hypothesis classes to the study of the complexity of processes generating data.
    ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference Pipeline. (arXiv:2212.05428v2 [cs.CR] UPDATED)
    Machine Learning as a service (MLaaS) permits resource-limited clients to access powerful data analytics services ubiquitously. Despite its merits, MLaaS poses significant concerns regarding the integrity of delegated computation and the privacy of the server's model parameters. To address this issue, Zhang et al. (CCS'20) initiated the study of zero-knowledge Machine Learning (zkML). Few zkML schemes have been proposed afterward; however, they focus on sole ML classification algorithms that may not offer satisfactory accuracy or require large-scale training data and model parameters, which may not be desirable for some applications. We propose ezDPS, a new efficient and zero-knowledge ML inference scheme. Unlike prior works, ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy. Each stage of ezDPS is harnessed with an established ML algorithm that is shown to be effective in various applications, including Discrete Wavelet Transformation, Principal Components Analysis, and Support Vector Machine. We design new gadgets to prove ML operations effectively. We fully implemented ezDPS and assessed its performance on real datasets. Experimental results showed that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics while maintaining more desirable accuracy than single ML classification approaches.
    Truthful Incentive Mechanism for Federated Learning with Crowdsourced Data Labeling. (arXiv:2302.00106v1 [cs.LG])
    Federated learning (FL) has emerged as a promising paradigm that trains machine learning (ML) models on clients' devices in a distributed manner without the need of transmitting clients' data to the FL server. In many applications of ML, the labels of training data need to be generated manually by human agents. In this paper, we study FL with crowdsourced data labeling where the local data of each participating client of FL are labeled manually by the client. We consider the strategic behavior of clients who may not make desired effort in their local data labeling and local model computation and may misreport their local models to the FL server. We characterize the performance bounds on the training loss as a function of clients' data labeling effort, local computation effort, and reported local models. We devise truthful incentive mechanisms which incentivize strategic clients to make truthful efforts and report true local models to the server. The truthful design exploits the non-trivial dependence of the training loss on clients' efforts and local models. Under the truthful mechanisms, we characterize the server's optimal local computation effort assignments. We evaluate the proposed FL algorithms with crowdsourced data labeling and the incentive mechanisms using experiments.
    ADAPT : Awesome Domain Adaptation Python Toolbox. (arXiv:2107.03049v2 [cs.LG] UPDATED)
    In this paper, we introduce the ADAPT library, an open source Python API providing the implementation of the main transfer learning and domain adaptation methods. The library is designed with a user friendly approach to facilitate the access to domain adaptation for a wide public. ADAPT is compatible with scikit-learn and TensorFlow and a full documentation is proposed online https://adapt-python.github.io/adapt/ with a substantial gallery of examples.
    Learning Optimal Fair Classification Trees: Trade-offs Between Interpretability, Fairness, and Accuracy. (arXiv:2201.09932v3 [cs.LG] UPDATED)
    The increasing use of machine learning in high-stakes domains -- where people's livelihoods are impacted -- creates an urgent need for interpretable, fair, and highly accurate algorithms. With these needs in mind, we propose a mixed integer optimization (MIO) framework for learning optimal classification trees -- one of the most interpretable models -- that can be augmented with arbitrary fairness constraints. In order to better quantify the "price of interpretability", we also propose a new measure of model interpretability called decision complexity that allows for comparisons across different classes of machine learning models. We benchmark our method against state-of-the-art approaches for fair classification on popular datasets; in doing so, we conduct one of the first comprehensive analyses of the trade-offs between interpretability, fairness, and predictive accuracy. Given a fixed disparity threshold, our method has a price of interpretability of about 4.2 percentage points in terms of out-of-sample accuracy compared to the best performing, complex models. However, our method consistently finds decisions with almost full parity, while other methods rarely do.
    Multi-Grade Deep Learning. (arXiv:2302.00150v1 [cs.LG])
    The current deep learning model is of a single-grade, that is, it learns a deep neural network by solving a single nonconvex optimization problem. When the layer number of the neural network is large, it is computationally challenging to carry out such a task efficiently. Inspired by the human education process which arranges learning in grades, we propose a multi-grade learning model: We successively solve a number of optimization problems of small sizes, which are organized in grades, to learn a shallow neural network for each grade. Specifically, the current grade is to learn the leftover from the previous grade. In each of the grades, we learn a shallow neural network stacked on the top of the neural network, learned in the previous grades, which remains unchanged in training of the current and future grades. By dividing the task of learning a deep neural network into learning several shallow neural networks, one can alleviate the severity of the nonconvexity of the original optimization problem of a large size. When all grades of the learning are completed, the final neural network learned is a stair-shape neural network, which is the superposition of networks learned from all grades. Such a model enables us to learn a deep neural network much more effectively and efficiently. Moreover, multi-grade learning naturally leads to adaptive learning. We prove that in the context of function approximation if the neural network generated by a new grade is nontrivial, the optimal error of the grade is strictly reduced from the optimal error of the previous grade. Furthermore, we provide several proof-of-concept numerical examples which demonstrate that the proposed multi-grade model outperforms significantly the traditional single-grade model and is much more robust than the traditional model.
    FLSTRA: Federated Learning in Stratosphere. (arXiv:2302.00163v1 [cs.NI])
    We propose a federated learning (FL) in stratosphere (FLSTRA) system, where a high altitude platform station (HAPS) felicitates a large number of terrestrial clients to collaboratively learn a global model without sharing the training data. FLSTRA overcomes the challenges faced by FL in terrestrial networks, such as slow convergence and high communication delay due to limited client participation and multi-hop communications. HAPS leverages its altitude and size to allow the participation of more clients with line-of-sight (LoS) links and the placement of a powerful server. However, handling many clients at once introduces computing and transmission delays. Thus, we aim to obtain a delay-accuracy trade-off for FLSTRA. Specifically, we first develop a joint client selection and resource allocation algorithm for uplink and downlink to minimize the FL delay subject to the energy and quality-of-service (QoS) constraints. Second, we propose a communication and computation resource-aware (CCRA-FL) algorithm to achieve the target FL accuracy while deriving an upper bound for its convergence rate. The formulated problem is non-convex; thus, we propose an iterative algorithm to solve it. Simulation results demonstrate the effectiveness of the proposed FLSTRA system, compared to terrestrial benchmarks, in terms of FL delay and accuracy.
    Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training. (arXiv:2302.00286v1 [cs.SD])
    In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. %We also argue that transcription models can be used as a preprocessing module for other music analysis tasks. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist.
    Epic-Sounds: A Large-scale Dataset of Actions That Sound. (arXiv:2302.00646v1 [cs.SD])
    We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound.
    Transformers Meet Directed Graphs. (arXiv:2302.00049v1 [cs.LG])
    Transformers were originally proposed as a sequence-to-sequence model for text but have become vital for a wide range of modalities, including images, audio, video, and undirected graphs. However, transformers for directed graphs are a surprisingly underexplored topic, despite their applicability to ubiquitous domains including source code and logic circuits. In this work, we propose two direction- and structure-aware positional encodings for directed graphs: (1) the eigenvectors of the Magnetic Laplacian - a direction-aware generalization of the combinatorial Laplacian; (2) directional random walk encodings. Empirically, we show that the extra directionality information is useful in various downstream tasks, including correctness testing of sorting networks and source code understanding. Together with a data-flow-centric graph construction, our model outperforms the prior state of the art on the Open Graph Benchmark Code2 relatively by 14.7%.
    Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP. (arXiv:2208.05516v4 [cs.LG] UPDATED)
    Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.
    Distributed sequential federated learning. (arXiv:2302.00107v1 [stat.ML])
    The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.
    On the Within-Group Discrimination of Screening Classifiers. (arXiv:2302.00025v1 [cs.LG])
    Screening classifiers are increasingly used to identify qualified candidates in a variety of selection processes. In this context, it has been recently shown that, if a classifier is calibrated, one can identify the smallest set of candidates which contains, in expectation, a desired number of qualified candidates using a threshold decision rule. This lends support to focusing on calibration as the only requirement for screening classifiers. In this paper, we argue that screening policies that use calibrated classifiers may suffer from an understudied type of within-group discrimination -- they may discriminate against qualified members within demographic groups of interest. Further, we argue that this type of discrimination can be avoided if classifiers satisfy within-group monotonicity, a natural monotonicity property within each of the groups. Then, we introduce an efficient post-processing algorithm based on dynamic programming to minimally modify a given calibrated classifier so that its probability estimates satisfy within-group monotonicity. We validate our algorithm using US Census survey data and show that within-group monotonicity can be often achieved at a small cost in terms of prediction granularity and shortlist size.
    Towards Answering Open-ended Ethical Quandary Questions. (arXiv:2205.05989v3 [cs.CL] UPDATED)
    Considerable advancements have been made in various NLP tasks based on the impressive power of large language models (LLMs) and many NLP applications are deployed in our daily lives. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We explore the current capability of LLMs in providing an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. We propose a model that searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based few-shot learning. We also discuss the remaining challenges and ethical issues involved in this task and suggest the direction toward developing responsible NLP systems by incorporating human values explicitly.
    HOAX: A Hyperparameter Optimization Algorithm Explorer for Neural Networks. (arXiv:2302.00374v1 [physics.chem-ph])
    Computational chemistry has become an important tool to predict and understand molecular properties and reactions. Even though recent years have seen a significant growth in new algorithms and computational methods that speed up quantum chemical calculations, the bottleneck for trajectory-based methods to study photoinduced processes is still the huge number of electronic structure calculations. In this work, we present an innovative solution, in which the amount of electronic structure calculations is drastically reduced, by employing machine learning algorithms and methods borrowed from the realm of artificial intelligence. However, applying these algorithms effectively requires finding optimal hyperparameters, which remains a challenge itself. Here we present an automated user-friendly framework, HOAX, to perform the hyperparameter optimization for neural networks, which bypasses the need for a lengthy manual process. The neural network generated potential energy surfaces (PESs) reduces the computational costs compared to the ab initio-based PESs. We perform a comparative investigation on the performance of different hyperparameter optimiziation algorithms, namely grid search, simulated annealing, genetic algorithm, and bayesian optimizer in finding the optimal hyperparameters necessary for constructing the well-performing neural network in order to fit the PESs of small organic molecules. Our results show that this automated toolkit not only facilitate a straightforward way to perform the hyperparameter optimization but also the resulting neural networks-based generated PESs are in reasonable agreement with the ab initio-based PESs.
    Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization. (arXiv:2302.00138v1 [cs.LG])
    Generative models have the ability to synthesize data points drawn from the data distribution, however, not all generated samples are high quality. In this paper, we propose using a combination of coresets selection methods and ``entropic regularization'' to select the highest fidelity samples. We leverage an Energy-Based Model which resembles a variational auto-encoder with an inference and generator model for which the latent prior is complexified by an energy-based model. In a semi-supervised learning scenario, we show that augmenting the labeled data-set, by adding our selected subset of samples, leads to better accuracy improvement rather than using all the synthetic samples.
    Training Normalizing Flows with the Precision-Recall Divergence. (arXiv:2302.00628v1 [cs.LG])
    Generative models can have distinct mode of failures like mode dropping and low quality samples, which cannot be captured by a single scalar metric. To address this, recent works propose evaluating generative models using precision and recall, where precision measures quality of samples and recall measures the coverage of the target distribution. Although a variety of discrepancy measures between the target and estimated distribution are used to train generative models, it is unclear what precision-recall trade-offs are achieved by various choices of the discrepancy measures. In this paper, we show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the {\em PR-divergences }. Conversely, any -divergence can be written as a linear combination of PR-divergences and therefore correspond to minimising a weighted precision-recall trade-off. Further, we propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.
    Automatically Marginalized MCMC in Probabilistic Programming. (arXiv:2302.00564v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a powerful algorithm to sample latent variables from Bayesian models. The advent of probabilistic programming languages (PPLs) frees users from writing inference algorithms and lets users focus on modeling. However, many models are difficult for HMC to solve directly, which often require tricks like model reparameterization. We are motivated by the fact that many of those models could be simplified by marginalization. We propose to use automatic marginalization as part of the sampling process using HMC in a graphical model extracted from a PPL, which substantially improves sampling from real-world hierarchical models.
    QLAB: Quadratic Loss Approximation-Based Optimal Learning Rate for Deep Learning. (arXiv:2302.00252v1 [cs.LG])
    We propose a learning rate adaptation scheme, called QLAB, for descent optimizers. We derive QLAB by optimizing the quadratic approximation of the loss function and QLAB can be combined with any optimizer who can provide the descent update direction. The computation of an adaptive learning rate with QLAB requires only computing an extra loss function value. We theoretically prove the convergence of the descent optimizers with QLAB. We demonstrate the effectiveness of QLAB in a range of optimization problems by combining with conclusively stochastic gradient descent, stochastic gradient descent with momentum, and Adam. The performance is validated on multi-layer neural networks, CNN, VGG-Net, ResNet and ShuffleNet with two datasets, MNIST and CIFAR10.
    Simplicity Bias in 1-Hidden Layer Neural Networks. (arXiv:2302.00457v1 [cs.LG])
    Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.
    Electrode Selection for Noninvasive Fetal Electrocardiogram Extraction using Mutual Information Criteria. (arXiv:2302.00206v1 [eess.SP])
    Blind source separation (BSS) techniques have revealed to be promising approaches for, among other, biomedical signal processing applications. Specifically, for the noninvasive extraction of fetal cardiac signals from maternal abdominal recordings, where conventional filtering schemes have failed to extract the complete fetal ECG components. From previous studies, it is now believed that a carefully selected array of electrodes well-placed over the abdomen of a pregnant woman contains the required `information' for BSS, to extract the complete fetal components. Based on this idea, in previous works array recording systems and sensor selection strategies based on the Mutual Information (MI) criterion have been developed. In this paper the previous works have been extended, by considering the 3-dimensional aspects of the cardiac electrical activity. The proposed method has been tested on simulated and real maternal abdominal recordings. The results show that the new sensor selection strategy together with the MI criterion, can be effectively used to select the channels containing the most `information' concerning the fetal ECG components from an array of 72 recordings. The method is hence believed to be useful for the selection of the most informative channels in online applications, considering the different fetal positions and movements.
    Iterative Deepening Hyperband. (arXiv:2302.00511v1 [cs.LG])
    Hyperparameter optimization (HPO) is concerned with the automated search for the most appropriate hyperparameter configuration (HPC) of a parameterized machine learning algorithm. A state-of-the-art HPO method is Hyperband, which, however, has its own parameters that influence its performance. One of these parameters, the maximal budget, is especially problematic: If chosen too small, the budget needs to be increased in hindsight and, as Hyperband is not incremental by design, the entire algorithm must be re-run. This is not only costly but also comes with a loss of valuable knowledge already accumulated. In this paper, we propose incremental variants of Hyperband that eliminate these drawbacks, and show that these variants satisfy theoretical guarantees qualitatively similar to those for the original Hyperband with the "right" budget. Moreover, we demonstrate their practical utility in experiments with benchmark data sets.
    Decompositional Generation Process for Instance-Dependent Partial Label Learning. (arXiv:2204.03845v3 [cs.LG] UPDATED)
    Partial label learning (PLL) is a typical weakly supervised learning problem, where each training example is associated with a set of candidate labels among which only one is true. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels and model the generation process of the candidate labels in a simple way. However, these approaches usually do not perform as well as expected due to the fact that the generation process of the candidate labels is always instance-dependent. Therefore, it deserves to be modeled in a refined way. In this paper, we consider instance-dependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts, where the correct label emerges first in the mind of the annotator but then the incorrect labels related to the feature are also selected with the correct label as candidate labels due to uncertainty of labeling. Motivated by this consideration, we propose a novel PLL method that performs Maximum A Posterior (MAP) based on an explicitly modeled generation process of candidate labels via decomposed probability distribution models. Extensive experiments on manually corrupted benchmark datasets and real-world datasets validate the effectiveness of the proposed method. Source code is available at https://github.com/palm-ml/idgp.
    SPIDE: A Purely Spike-based Method for Training Feedback Spiking Neural Networks. (arXiv:2302.00232v1 [cs.NE])
    Spiking neural networks (SNNs) with event-based computation are promising brain-inspired models for energy-efficient applications on neuromorphic hardware. However, most supervised SNN training methods, such as conversion from artificial neural networks or direct training with surrogate gradients, require complex computation rather than spike-based operations of spiking neurons during training. In this paper, we study spike-based implicit differentiation on the equilibrium state (SPIDE) that extends the recently proposed training method, implicit differentiation on the equilibrium state (IDE), for supervised learning with purely spike-based computation, which demonstrates the potential for energy-efficient training of SNNs. Specifically, we introduce ternary spiking neuron couples and prove that implicit differentiation can be solved by spikes based on this design, so the whole training procedure, including both forward and backward passes, is made as event-driven spike computation, and weights are updated locally with two-stage average firing rates. Then we propose to modify the reset membrane potential to reduce the approximation error of spikes. With these key components, we can train SNNs with flexible structures in a small number of time steps and with firing sparsity during training, and the theoretical estimation of energy costs demonstrates the potential for high efficiency. Meanwhile, experiments show that even with these constraints, our trained models can still achieve competitive results on MNIST, CIFAR-10, CIFAR-100, and CIFAR10-DVS. Our code is available at https://github.com/pkuxmq/SPIDE-FSNN.
    Dictionary-based Manifold Learning. (arXiv:2302.00263v1 [cs.LG])
    We propose a paradigm for interpretable Manifold Learning for scientific data analysis, whereby we parametrize a manifold with $d$ smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse non-linear regression in the manifold tangent bundle, bypassing more standard manifold learning algorithms. We also discuss conditions for the existence of such parameterizations in function space and for successful recovery from finite samples. We demonstrate our method with experimental results from a real scientific domain.
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v2 [stat.ML] UPDATED)
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that time series data are non-exchangeable, and thus many existing conformal prediction algorithms are not applicable. The main idea is to exploit the temporal dependence of non-conformity scores (e.g., prediction residuals); thus, the past residuals contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a user-specified point prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.
    End-to-End Full-Atom Antibody Design. (arXiv:2302.00203v1 [q-bio.BM])
    Antibody design is an essential yet challenging task in various domains like therapeutics and biology. There are two major defects in current learning-based methods: 1) tackling only a certain subtask of the whole antibody design pipeline, making them suboptimal or resource-intensive. 2) omitting either the framework regions or side chains, thus incapable of capturing the full-atom geometry. To address these pitfalls, we propose dynamic Multi-channel Equivariant grAph Network (dyMEAN), an end-to-end full-atom model for E(3)-equivariant antibody design given the epitope and the incomplete sequence of the antibody. Specifically, we first explore structural initialization as a knowledgeable guess of the antibody structure and then propose shadow paratope to bridge the epitope-antibody connections. Both 1D sequences and 3D structures are updated via an adaptive multi-channel equivariant encoder that is able to process protein residues of variable sizes when considering full atoms. Finally, the updated antibody is docked to the epitope via the alignment of the shadow paratope. Experiments on epitope-binding CDR-H3 design, complex structure prediction, and affinity optimization demonstrate the superiority of our end-to-end framework and full-atom modeling.
    Whats Missing? Learning Hidden Markov Models When the Locations of Missing Observations are Unknown. (arXiv:2203.06527v2 [stat.ML] UPDATED)
    The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis, and it has been successfully applied in a large variety of domains. One of the key reasons for this versatility is the ability of HMMs to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In some situations where such assumptions are not feasible, a number of special algorithms have been developed. Currently, these algorithms rely strongly on specific structural assumptions of the underlying chain, such as acyclicity, and are not applicable in the general case. In particular, there are numerous domains within medicine and computational biology, where the missing observation locations are unknown and acyclicity assumptions do not hold, thus presenting a barrier for the application of HMMs in those fields. In this paper we consider a general problem of learning HMMs from data with unknown missing observation locations (i.e., only the order of the non-missing observations are known). We introduce a generative model of the location omissions and propose two learning methods for this model, a (semi) analytic approach, and a Gibbs sampler. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision and robustness under model misspecification.
    Width and Depth Limits Commute in Residual Networks. (arXiv:2302.00453v1 [stat.ML])
    We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We also demonstrate that the pre-activations, in this case, have Gaussian distributions which has direct applications in Bayesian deep learning. We conduct extensive simulations that show an excellent match with our theoretical findings.
    Exploring Semantic Perturbations on Grover. (arXiv:2302.00509v1 [cs.LG])
    With news and information being as easy to access as they currently are, it is more important than ever to ensure that people are not mislead by what they read. Recently, the rise of neural fake news (AI-generated fake news) and its demonstrated effectiveness at fooling humans has prompted the development of models to detect it. One such model is the Grover model, which can both detect neural fake news to prevent it, and generate it to demonstrate how a model could be misused to fool human readers. In this work we explore the Grover model's fake news detection capabilities by performing targeted attacks through perturbations on input news articles. Through this we test Grover's resilience to these adversarial attacks and expose some potential vulnerabilities which should be addressed in further iterations to ensure it can detect all types of fake news accurately.
    Gradient Descent in Neural Networks as Sequential Learning in RKBS. (arXiv:2302.00205v1 [stat.ML])
    The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training.
    Molecular Graph Generation by Decomposition and Reassembling. (arXiv:2302.00587v1 [q-bio.BM])
    Designing molecular structures with desired chemical properties is an essential task in drug discovery and material design. However, finding molecules with the optimized desired properties is still a challenging task due to combinatorial explosion of candidate space of molecules. Here we propose a novel \emph{decomposition-and-reassembling} based approach, which does not include any optimization in hidden space and our generation process is highly interpretable. Our method is a two-step procedure: In the first decomposition step, we apply frequent subgraph mining to a molecular database to collect smaller size of subgraphs as building blocks of molecules. In the second reassembling step, we search desirable building blocks guided via reinforcement learning and combine them to generate new molecules. Our experiments show that not only can our method find better molecules in terms of two standard criteria, the penalized $\log P$ and drug-likeness, but also generate drug molecules with showing the valid intermediate molecules.
    Physics-informed Reduced-Order Learning from the First Principles for Simulation of Quantum Nanostructures. (arXiv:2302.00100v1 [cs.CE])
    Multi-dimensional direct numerical simulation (DNS) of the Schr\"odinger equation is needed for design and analysis of quantum nanostructures that offer numerous applications in biology, medicine, materials, electronic/photonic devices, etc. In large-scale nanostructures, extensive computational effort needed in DNS may become prohibitive due to the high degrees of freedom (DoF). This study employs a reduced-order learning algorithm, enabled by the first principles, for simulation of the Schr\"odinger equation to achieve high accuracy and efficiency. The proposed simulation methodology is applied to investigate two quantum-dot structures; one operates under external electric field, and the other is influenced by internal potential variation with periodic boundary conditions. The former is similar to typical operations of nanoelectronic devices, and the latter is of interest to simulation and design of nanostructures and materials, such as applications of density functional theory. Using the proposed methodology, a very accurate prediction can be realized with a reduction in the DoF by more than 3 orders of magnitude and in the computational time by 2 orders, compared to DNS. The proposed physics-informed learning methodology is also able to offer an accurate prediction beyond the training conditions, including higher external field and larger internal potential in untrained quantum states.
    Learning Functional Transduction. (arXiv:2302.00328v1 [cs.LG])
    Research in Machine Learning has polarized into two general regression approaches: Transductive methods derive estimates directly from available data but are usually problem unspecific. Inductive methods can be much more particular, but generally require tuning and compute-intensive searches for solutions. In this work, we adopt a hybrid approach: We leverage the theory of Reproducing Kernel Banach Spaces (RKBS) and show that transductive principles can be induced through gradient descent to form efficient \textit{in-context} neural approximators. We apply this approach to RKBS of function-valued operators and show that once trained, our \textit{Transducer} model can capture on-the-fly relationships between infinite-dimensional input and output functions, given a few example pairs, and return new function estimates. We demonstrate the benefit of our transductive approach to model complex physical systems influenced by varying external factors with little data at a fraction of the usual deep learning training computation cost for partial differential equations and climate modeling applications.
    Diffusion-based Image Translation using Disentangled Style and Content Representation. (arXiv:2209.15264v2 [cs.CV] UPDATED)
    Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Specifically, inspired by the splicing Vision Transformer, we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. Then, an image guided style transfer is performed by matching the [CLS] classification token from the denoised samples and target image, whereas additional CLIP loss is used for the text-driven style transfer. To further accelerate the semantic change during the reverse diffusion, we also propose a novel semantic divergence loss and resampling strategy. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
    Learning noisy-OR Bayesian Networks with Max-Product Belief Propagation. (arXiv:2302.00099v1 [cs.LG])
    Noisy-OR Bayesian Networks (BNs) are a family of probabilistic graphical models which express rich statistical dependencies in binary data. Variational inference (VI) has been the main method proposed to learn noisy-OR BNs with complex latent structures (Jaakkola & Jordan, 1999; Ji et al., 2020; Buhai et al., 2020). However, the proposed VI approaches either (a) use a recognition network with standard amortized inference that cannot induce ``explaining-away''; or (b) assume a simple mean-field (MF) posterior which is vulnerable to bad local optima. Existing MF VI methods also update the MF parameters sequentially which makes them inherently slow. In this paper, we propose parallel max-product as an alternative algorithm for learning noisy-OR BNs with complex latent structures and we derive a fast stochastic training scheme that scales to large datasets. We evaluate both approaches on several benchmarks where VI is the state-of-the-art and show that our method (a) achieves better test performance than Ji et al. (2020) for learning noisy-OR BNs with hierarchical latent structures on large sparse real datasets; (b) recovers a higher number of ground truth parameters than Buhai et al. (2020) from cluttered synthetic scenes; and (c) solves the 2D blind deconvolution problem from Lazaro-Gredilla et al. (2021) and variant - including binary matrix factorization - while VI catastrophically fails and is up to two orders of magnitude slower.
    Filtering Context Mitigates Scarcity and Selection Bias in Political Ideology Prediction. (arXiv:2302.00239v1 [cs.LG])
    We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.
    How to select predictive models for causal inference?. (arXiv:2302.00370v1 [stat.ML])
    Predictive models -- as with machine learning -- can underpin causal inference, to estimate the effects of an intervention at the population or individual level. This opens the door to a plethora of models, useful to match the increasing complexity of health data, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Classic machine-learning cross-validation procedures are not directly applicable. Indeed, an appropriate selection procedure for causal inference should equally weight both outcome errors for each individual, treated or not treated, whereas one outcome may be seldom observed for a sub-population. We study how more elaborate risks benefit causal model selection. We show theoretically that simple risks are brittle to weak overlap between treated and non-treated individuals as well as to heterogeneous errors between populations. Rather a more elaborate metric, the R-risk appears as a proxy of the oracle error on causal estimates, observable at the cost of an overlap re-weighting. As the R-risk is defined not only from model predictions but also by using the conditional mean outcome and the treatment probability, using it for model selection requires adapting cross validation. Extensive experiments show that the resulting procedure gives the best causal model selection.
    Personalized Privacy Auditing and Optimization at Test Time. (arXiv:2302.00077v1 [cs.LG])
    A number of learning models used in consequential domains, such as to assist in legal, banking, hiring, and healthcare decisions, make use of potentially sensitive users' information to carry out inference. Further, the complete set of features is typically required to perform inference. This not only poses severe privacy risks for the individuals using the learning systems, but also requires companies and organizations massive human efforts to verify the correctness of the released information. This paper asks whether it is necessary to require \emph{all} input features for a model to return accurate predictions at test time and shows that, under a personalized setting, each individual may need to release only a small subset of these features without impacting the final decisions. The paper also provides an efficient sequential algorithm that chooses which attributes should be provided by each individual. Evaluation over several learning tasks shows that individuals may be able to report as little as 10\% of their information to ensure the same level of accuracy of a model that uses the complete users' information.
    Mind the (optimality) Gap: A Gap-Aware Learning Rate Scheduler for Adversarial Nets. (arXiv:2302.00089v1 [cs.LG])
    Adversarial nets have proved to be powerful in various domains including generative modeling (GANs), transfer learning, and fairness. However, successfully training adversarial nets using first-order methods remains a major challenge. Typically, careful choices of the learning rates are needed to maintain the delicate balance between the competing networks. In this paper, we design a novel learning rate scheduler that dynamically adapts the learning rate of the adversary to maintain the right balance. The scheduler is driven by the fact that the loss of an ideal adversarial net is a constant known a priori. The scheduler is thus designed to keep the loss of the optimized adversarial net close to that of an ideal network. We run large-scale experiments to study the effectiveness of the scheduler on two popular applications: GANs for image generation and adversarial nets for domain adaptation. Our experiments indicate that adversarial nets trained with the scheduler are less likely to diverge and require significantly less tuning. For example, on CelebA, a GAN with the scheduler requires only one-tenth of the tuning budget needed without a scheduler. Moreover, the scheduler leads to statistically significant improvements in model quality, reaching up to $27\%$ in Frechet Inception Distance for image generation and $3\%$ in test accuracy for domain adaptation.
    Neuromechanical Autoencoders: Learning to Couple Elastic and Neural Network Nonlinearity. (arXiv:2302.00032v1 [cs.LG])
    Intelligent biological systems are characterized by their embodiment in a complex environment and the intimate interplay between their nervous systems and the nonlinear mechanical properties of their bodies. This coordination, in which the dynamics of the motor system co-evolved to reduce the computational burden on the brain, is referred to as ``mechanical intelligence'' or ``morphological computation''. In this work, we seek to develop machine learning analogs of this process, in which we jointly learn the morphology of complex nonlinear elastic solids along with a deep neural network to control it. By using a specialized differentiable simulator of elastic mechanics coupled to conventional deep learning architectures -- which we refer to as neuromechanical autoencoders -- we are able to learn to perform morphological computation via gradient descent. Key to our approach is the use of mechanical metamaterials -- cellular solids, in particular -- as the morphological substrate. Just as deep neural networks provide flexible and massively-parametric function approximators for perceptual and control tasks, cellular solid metamaterials are promising as a rich and learnable space for approximating a variety of actuation tasks. In this work we take advantage of these complementary computational concepts to co-design materials and neural network controls to achieve nonintuitive mechanical behavior. We demonstrate in simulation how it is possible to achieve translation, rotation, and shape matching, as well as a ``digital MNIST'' task. We additionally manufacture and evaluate one of the designs to verify its real-world behavior.
    Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning. (arXiv:2302.00390v1 [cs.DL])
    The scholarly publication space is growing steadily not just in numbers but also in complexity due to collaboration between individuals from within and across fields of research. This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set of fields (discipline-field-subfield). This system enables a holistic view about the interdependence of research activities in the mentioned hierarchical tiers in terms of knowledge production through articles and impact through citations. The classification system (44 disciplines - 738 fields - 1,501 subfields) utilizes and is able to cope with 160 million abstract snippets in Microsoft Academic Graph (Version 2018-05-17) using batch training in a modularized and distributed fashion to address and assess interdisciplinarity and inter-field classifications. In addition, we have explored multi-class classifications in both the single-label and multi-label settings. In total, we have conducted 3,140 experiments, in all models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers), the classification accuracy is > 90% in 77.84% and 78.83% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, as well as to capture the degree of interdisciplinarity in a publication which enables downstream analytics such as field interdisciplinarity. This system (a set of pretrained models) can serve as a backbone to an interactive system of indexing scientific publications.
    Student-centric Model of Learning Management System Activity and Academic Performance: from Correlation to Causation. (arXiv:2210.15430v2 [cs.CY] UPDATED)
    In recent years, there is a lot of interest in modeling students' digital traces in Learning Management System (LMS) to understand students' learning behavior patterns including aspects of meta-cognition and self-regulation, with the ultimate goal to turn those insights into actionable information to support students to improve their learning outcomes. In achieving this goal, however, there are two main issues that need to be addressed given the existing literature. Firstly, most of the current work is course-centered (i.e. models are built from data for a specific course) rather than student-centered; secondly, a vast majority of the models are correlational rather than causal. Those issues make it challenging to identify the most promising actionable factors for intervention at the student level where most of the campus-wide academic support is designed for. In this paper, we explored a student-centric analytical framework for LMS activity data that can provide not only correlational but causal insights mined from observational data. We demonstrated this approach using a dataset of 1651 computing major students at a public university in the US during one semester in the Fall of 2019. This dataset includes students' fine-grained LMS interaction logs and administrative data, e.g. demographics and academic performance. In addition, we expand the repository of LMS behavior indicators to include those that can characterize the time-of-the-day of login (e.g. chronotype). Our analysis showed that student login volume, compared with other login behavior indicators, is both strongly correlated and causally linked to student academic performance, especially among students with low academic performance. We envision that those insights will provide convincing evidence for college student support groups to launch student-centered and targeted interventions that are effective and scalable.
    Effectiveness of Moving Target Defenses for Adversarial Attacks in ML-based Malware Detection. (arXiv:2302.00537v1 [cs.LG])
    Several moving target defenses (MTDs) to counter adversarial ML attacks have been proposed in recent years. MTDs claim to increase the difficulty for the attacker in conducting attacks by regularly changing certain elements of the defense, such as cycling through configurations. To examine these claims, we study for the first time the effectiveness of several recent MTDs for adversarial ML attacks applied to the malware detection domain. Under different threat models, we show that transferability and query attack strategies can achieve high levels of evasion against these defenses through existing and novel attack strategies across Android and Windows. We also show that fingerprinting and reconnaissance are possible and demonstrate how attackers may obtain critical defense hyperparameters as well as information about how predictions are produced. Based on our findings, we present key recommendations for future work on the development of effective MTDs for adversarial attacks in ML-based malware detection.
    Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization. (arXiv:2302.00275v1 [cs.CV])
    Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
    Fast Sampling of Diffusion Models via Operator Learning. (arXiv:2211.13449v2 [cs.LG] UPDATED)
    Diffusion models have found widespread adoption in various areas. However, their sampling process is slow because it requires hundreds to thousands of network evaluations to emulate a continuous process defined by differential equations. In this work, we use neural operators, an efficient method to solve the probability flow differential equations, to accelerate the sampling process of diffusion models. Compared to other fast sampling methods that have a sequential nature, we are the first to propose parallel decoding method that generates images with only one model forward pass. We propose \textit{diffusion model sampling with neural operator} (DSNO) that maps the initial condition, i.e., Gaussian distribution, to the continuous-time solution trajectory of the reverse diffusion process. To model the temporal correlations along the trajectory, we introduce temporal convolution layers that are parameterized in the Fourier space into the given diffusion model backbone. We show our method achieves state-of-the-art FID of 4.12 for CIFAR-10 and 8.35 for ImageNet-64 in the one-model-evaluation setting.
    Predicting CSI Sequences With Attention-Based Neural Networks. (arXiv:2302.00341v1 [stat.ML])
    In this work, we consider the problem of multi-step channel prediction in wireless communication systems. In existing works, autoregressive (AR) models are either replaced or combined with feed-forward neural networks(NNs) or, alternatively, with recurrent neural networks (RNNs). This paper explores the possibility of using sequence-to-sequence (Seq2Seq) and transformer neural network (TNN) models for channel state information (CSI) prediction. Simulation results show that both, Seq2Seq and TNNs, represent an appealing alternative to RNNs and feed-forward NNs in the context of CSI prediction. Additionally, the TNN with a few adaptations can extrapolate better than other models to CSI sequences that are either shorter or longer than the ones the model saw during training.
    Distributed Traffic Synthesis and Classification in Edge Networks: A Federated Self-supervised Learning Approach. (arXiv:2302.00207v1 [cs.LG])
    With the rising demand for wireless services and increased awareness of the need for data protection, existing network traffic analysis and management architectures are facing unprecedented challenges in classifying and synthesizing the increasingly diverse services and applications. This paper proposes FS-GAN, a federated self-supervised learning framework to support automatic traffic analysis and synthesis over a large number of heterogeneous datasets. FS-GAN is composed of multiple distributed Generative Adversarial Networks (GANs), with a set of generators, each being designed to generate synthesized data samples following the distribution of an individual service traffic, and each discriminator being trained to differentiate the synthesized data samples and the real data samples of a local dataset. A federated learning-based framework is adopted to coordinate local model training processes of different GANs across different datasets. FS-GAN can classify data of unknown types of service and create synthetic samples that capture the traffic distribution of the unknown types. We prove that FS-GAN can minimize the Jensen-Shannon Divergence (JSD) between the distribution of real data across all the datasets and that of the synthesized data samples. FS-GAN also maximizes the JSD among the distributions of data samples created by different generators, resulting in each generator producing synthetic data samples that follow the same distribution as one particular service type. Extensive simulation results show that the classification accuracy of FS-GAN achieves over 20% improvement in average compared to the state-of-the-art clustering-based traffic analysis algorithms. FS-GAN also has the capability to synthesize highly complex mixtures of traffic types without requiring any human-labeled data samples.
    Learning Topology-Preserving Data Representations. (arXiv:2302.00136v1 [cs.LG])
    We propose a method for learning topology-preserving data representations (dimensionality reduction). The method aims to provide topological similarity between the data manifold and its latent representation via enforcing the similarity in topological features (clusters, loops, 2D voids, etc.) and their localization. The core of the method is the minimization of the Representation Topology Divergence (RTD) between original high-dimensional data and low-dimensional representation in latent space. RTD minimization provides closeness in topological features with strong theoretical guarantees. We develop a scheme for RTD differentiation and apply it as a loss term for the autoencoder. The proposed method ``RTD-AE'' better preserves the global structure and topology of the data manifold than state-of-the-art competitors as measured by linear correlation, triplet distance ranking accuracy, and Wasserstein distance between persistence barcodes.
    Learning to be Fair: A Consequentialist Approach to Equitable Decision-Making. (arXiv:2109.08792v3 [cs.LG] UPDATED)
    In the dominant paradigm for designing equitable machine learning systems, one works to ensure that model predictions satisfy various fairness criteria, such as parity in error rates across race, gender, and other legally protected traits. That approach, however, typically ignores the downstream decisions and outcomes that predictions affect, and, as a result, can induce unexpected harms. Here we present an alternative framework for fairness that directly anticipates the consequences of decisions. Stakeholders first specify preferences over the possible outcomes of an algorithmically informed decision-making process. For example, lenders may prefer extending credit to those most likely to repay a loan, while also preferring similar lending rates across neighborhoods. One then searches the space of decision policies to maximize the specified utility. We develop and describe a method for efficiently learning these optimal policies from data for a large family of expressive utility functions, facilitating a more holistic approach to equitable decision-making.
    A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee. (arXiv:2302.00248v1 [cs.DS])
    Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.
    Homotopy-based training of NeuralODEs for accurate dynamics discovery. (arXiv:2210.01407v3 [cs.LG] UPDATED)
    Conceptually, Neural Ordinary Differential Equations (NeuralODEs) pose an attractive way to extract dynamical laws from time series data, as they are natural extensions of the traditional differential equation-based modeling paradigm of the physical sciences. In practice, NeuralODEs display long training times and suboptimal results, especially for longer duration data where they may fail to fit the data altogether. While methods have been proposed to stabilize NeuralODE training, many of these involve placing a strong constraint on the functional form the trained NeuralODE can take that the actual underlying governing equation does not guarantee satisfaction. In this work, we present a novel NeuralODE training algorithm that leverages tools from the chaos and mathematical optimization communities - synchronization and homotopy optimization - for a breakthrough in tackling the NeuralODE training obstacle. We demonstrate architectural changes are unnecessary for effective NeuralODE training. Compared to the conventional training methods, our algorithm achieves drastically lower loss values without any changes to the model architectures. Experiments on both simulated and real systems with complex temporal behaviors demonstrate NeuralODEs trained with our algorithm are able to accurately capture true long term behaviors and correctly extrapolate into the future.
    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets. (arXiv:2010.12909v3 [cs.LG] UPDATED)
    We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
    Delayed Feedback in Kernel Bandits. (arXiv:2302.00392v1 [stat.ML])
    Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with $\tilde{\mathcal{O}}(\sqrt{\Gamma_k(T)T}+\mathbb{E}[\tau])$ regret, where $T$ is the number of time steps, $\Gamma_k(T)$ is the maximum information gain of the kernel with $T$ observations, and $\tau$ is the delay random variable. This represents a significant improvement over the state of the art regret bound of $\tilde{\mathcal{O}}(\Gamma_k(T)\sqrt{T}+\mathbb{E}[\tau]\Gamma_k(T))$ reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.
    $\rm A^2Q$: Aggregation-Aware Quantization for Graph Neural Networks. (arXiv:2302.00193v1 [cs.LG])
    As graph data size increases, the vast latency and memory consumption during inference pose a significant challenge to the real-world deployment of Graph Neural Networks (GNNs). While quantization is a powerful approach to reducing GNNs complexity, most previous works on GNNs quantization fail to exploit the unique characteristics of GNNs, suffering from severe accuracy degradation. Through an in-depth analysis of the topology of GNNs, we observe that the topology of the graph leads to significant differences between nodes, and most of the nodes in a graph appear to have a small aggregation value. Motivated by this, in this paper, we propose the Aggregation-Aware mixed-precision Quantization ($\rm A^2Q$) for GNNs, where an appropriate bitwidth is automatically learned and assigned to each node in the graph. To mitigate the vanishing gradient problem caused by sparse connections between nodes, we propose a Local Gradient method to serve the quantization error of the node features as the supervision during training. We also develop a Nearest Neighbor Strategy to deal with the generalization on unseen graphs. Extensive experiments on eight public node-level and graph-level datasets demonstrate the generality and robustness of our proposed method. Compared to the FP32 models, our method can achieve up to a 18.6x (i.e., 1.70bit) compression ratio with negligible accuracy degradation. Morever, compared to the state-of-the-art quantization method, our method can achieve up to 11.4\% and 9.5\% accuracy improvements on the node-level and graph-level tasks, respectively, and up to 2x speedup on a dedicated hardware accelerator.
    Weight Prediction Boosts the Convergence of AdamW. (arXiv:2302.00195v1 [cs.LG])
    In this paper, we introduce weight prediction into the AdamW optimizer to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights to do both forward pass and backward propagation. In this way, the AdamW optimizer always utilizes the gradients w.r.t. the future weights instead of current weights to update the DNN parameters, making the AdamW optimizer achieve better convergence. Our proposal is simple and straightforward to implement but effective in boosting the convergence of DNN training. We performed extensive experimental evaluations on image classification and language modeling tasks to verify the effectiveness of our proposal. The experimental results validate that our proposal can boost the convergence of AdamW and achieve better accuracy than AdamW when training the DNN models.
    TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation. (arXiv:2302.00247v1 [cs.LG])
    Model parallelism has become necessary to train large neural networks. However, finding a suitable model parallel schedule for an arbitrary neural network is a non-trivial task due to the exploding search space. In this work, we present a model parallelism framework TAP that automatically searches for the best data and tensor parallel schedules. Leveraging the key insight that a neural network can be represented as a directed acyclic graph, within which may only exist a limited set of frequent subgraphs, we design a graph pruning algorithm to fold the search space efficiently. TAP runs at sub-linear complexity concerning the neural network size. Experiments show that TAP is $20\times- 160\times$ faster than the state-of-the-art automatic parallelism framework, and the performance of its discovered schedules is competitive with the expert-engineered ones.
    A Transaction Represented with Weighted Finite-State Transducers. (arXiv:2302.00200v1 [cs.FL])
    Not all contracts are good, but all good contracts can be expressed as a finite-state transition system ("State-Transition Contracts"). Contracts that can be represented as State-Transition Contracts discretize fat-tailed risk to foreseeable, managed risk, define the boundary of relevant events governed by the relationship, and eliminate the potential of inconsistent contractual provisions. Additionally, State-Transition Contracts reap the substantial benefit of being able to be analyzed under the rules governing the science of the theory of computation. Simple State-Transition Contracts can be represented as discrete finite automata; more complicated State-Transition Contracts, such as those that have downstream effects on other agreements or complicated pathways of performance, benefit from representation as weighted finite-state transducers, with weights assigned as costs, penalties, or probabilities of transitions. This research paper (the "Research" or "Paper") presents a complex legal transaction represented as weighted finite-state transducers. Furthermore, we show that the mathematics/algorithms permitted by the algebraic structure of weighted finite-state transducers provides actionable, legal insight into the transaction.
    Adaptive sparseness for correntropy-based robust regression via automatic relevance determination. (arXiv:2302.00082v1 [cs.LG])
    Sparseness and robustness are two important properties for many machine learning scenarios. In the present study, regarding the maximum correntropy criterion (MCC) based robust regression algorithm, we investigate to integrate the MCC method with the automatic relevance determination (ARD) technique in a Bayesian framework, so that MCC-based robust regression could be implemented with adaptive sparseness. To be specific, we use an inherent noise assumption from the MCC to derive an explicit likelihood function, and realize the maximum a posteriori (MAP) estimation with the ARD prior by variational Bayesian inference. Compared to the existing robust and sparse L1-regularized MCC regression, the proposed MCC-ARD regression can eradicate the troublesome tuning for the regularization hyper-parameter which controls the regularization strength. Further, MCC-ARD achieves superior prediction performance and feature selection capability than L1-regularized MCC, as demonstrated by a noisy and high-dimensional simulation study.
    A Prescriptive Learning Analytics Framework: Beyond Predictive Modelling and onto Explainable AI with Prescriptive Analytics and ChatGPT. (arXiv:2208.14582v2 [cs.LG] UPDATED)
    A significant body of recent research in the field of Learning Analytics has focused on leveraging machine learning approaches for predicting at-risk students in order to initiate timely interventions and thereby elevate retention and completion rates. The overarching feature of the majority of these research studies has been on the science of prediction only. The component of predictive analytics concerned with interpreting the internals of the models and explaining their predictions for individual cases to stakeholders has largely been neglected. Additionally, works that attempt to employ data-driven prescriptive analytics to automatically generate evidence-based remedial advice for at-risk learners are in their infancy. eXplainable AI is a field that has recently emerged providing cutting-edge tools which support transparent predictive analytics and techniques for generating tailored advice for at-risk students. This study proposes a novel framework that unifies both transparent machine learning as well as techniques for enabling prescriptive analytics, while integrating the latest advances in large language models. This work practically demonstrates the proposed framework using predictive models for identifying at-risk learners of programme non-completion. The study then further demonstrates how predictive modelling can be augmented with prescriptive analytics on two case studies in order to generate human-readable prescriptive feedback for those who are at risk using ChatGPT.
    Selective Uncertainty Propagation in Offline RL. (arXiv:2302.00284v1 [cs.LG])
    We study the finite-horizon offline reinforcement learning (RL) problem. Since actions at any state can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically complex than offline policy learning for a finite sequence of stochastic contextual bandit environments. We formalize this insight by showing that the statistical hardness of offline RL instances can be measured by estimating the size of actions' impact on next-state distributions. Furthermore, this estimated impact allows us to propagate just enough value function uncertainty from future steps to avoid model exploitation, enabling us to develop algorithms that improve upon traditional pessimistic approaches for offline RL on statistically simple instances. Our approach is supported by theory and simulations.
    Revisiting Bellman Errors for Offline Model Selection. (arXiv:2302.00141v1 [cs.LG])
    Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.
    Program Generation from Diverse Video Demonstrations. (arXiv:2302.00178v1 [cs.CV])
    The ability to use inductive reasoning to extract general rules from multiple observations is a vital indicator of intelligence. As humans, we use this ability to not only interpret the world around us, but also to predict the outcomes of the various interactions we experience. Generalising over multiple observations is a task that has historically presented difficulties for machines to grasp, especially when requiring computer vision. In this paper, we propose a model that can extract general rules from video demonstrations by simultaneously performing summarisation and translation. Our approach differs from prior works by framing the problem as a multi-sequence-to-sequence task, wherein summarisation is learnt by the model. This allows our model to utilise edge cases that would otherwise be suppressed or discarded by traditional summarisation techniques. Additionally, we show that our approach can handle noisy specifications without the need for additional filtering methods. We evaluate our model by synthesising programs from video demonstrations in the Vizdoom environment achieving state-of-the-art results with a relative increase of 11.75% program accuracy on prior works
    Distillation Policy Optimization. (arXiv:2302.00533v1 [cs.LG])
    On-policy algorithms are supposed to be stable, however, sample-intensive yet. Off-policy algorithms utilizing past experiences are deemed to be sample-efficient, nevertheless, unstable in general. Can we design an algorithm that can employ the off-policy data, while exploit the stable learning by sailing along the course of the on-policy walkway? In this paper, we present an actor-critic learning framework that borrows the distributional perspective of interest to evaluate, and cross-breeds two sources of the data for policy improvement, which enables fast learning and can be applied to a wide class of algorithms. In its backbone, the variance reduction mechanisms, such as unified advantage estimator (UAE), that extends generalized advantage estimator (GAE) to be applicable on any state-dependent baseline, and a learned baseline, that is competent to stabilize the policy gradient, are firstly put forward to not merely be a bridge to the action-value function but also distill the advantageous learning signal. Lastly, it is empirically shown that our method improves sample efficiency and interpolates different levels well. Being of an organic whole, its mixture places more inspiration to the algorithm design.
    GANravel: User-Driven Direction Disentanglement in Generative Adversarial Networks. (arXiv:2302.00079v1 [cs.HC])
    Generative adversarial networks (GANs) have many application areas including image editing, domain translation, missing data imputation, and support for creative work. However, GANs are considered 'black boxes'. Specifically, the end-users have little control over how to improve editing directions through disentanglement. Prior work focused on new GAN architectures to disentangle editing directions. Alternatively, we propose GANravel a user-driven direction disentanglement tool that complements the existing GAN architectures and allows users to improve editing directions iteratively. In two user studies with 16 participants each, GANravel users were able to disentangle directions and outperformed the state-of-the-art direction discovery baselines in disentanglement performance. In the second user study, GANravel was used in a creative task of creating dog memes and was able to create high-quality edited images and GIFs.  ( 2 min )
    Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor. (arXiv:2302.00036v1 [cs.LG])
    We introduce the Blackwell discount factor for Markov Decision Processes (MDPs). Classical objectives for MDPs include discounted, average, and Blackwell optimality. Many existing approaches to computing average-optimal policies solve for discounted optimal policies with a discount factor close to $1$, but they only work under strong or hard-to-verify assumptions such as ergodicity or weakly communicating MDPs. In this paper, we show that when the discount factor is larger than the Blackwell discount factor $\gamma_{\mathrm{bw}}$, all discounted optimal policies become Blackwell- and average-optimal, and we derive a general upper bound on $\gamma_{\mathrm{bw}}$. The upper bound on $\gamma_{\mathrm{bw}}$ provides the first reduction from average and Blackwell optimality to discounted optimality, without any assumptions, and new polynomial-time algorithms for average- and Blackwell-optimal policies. Our work brings new ideas from the study of polynomials and algebraic numbers to the analysis of MDPs. Our results also apply to robust MDPs, enabling the first algorithms to compute robust Blackwell-optimal policies.  ( 2 min )
  • Open

    Data fission: splitting a single data point. (arXiv:2112.11079v7 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.  ( 2 min )
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v3 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v2 [stat.ML] UPDATED)
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that time series data are non-exchangeable, and thus many existing conformal prediction algorithms are not applicable. The main idea is to exploit the temporal dependence of non-conformity scores (e.g., prediction residuals); thus, the past residuals contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a user-specified point prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.
    Simplicity Bias in 1-Hidden Layer Neural Networks. (arXiv:2302.00457v1 [cs.LG])
    Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.
    Optimal Learning of Deep Random Networks of Extensive-width. (arXiv:2302.00375v1 [stat.ML])
    We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We derive a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We contrast these Bayes-optimal errors with the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
    Probabilistic Neural Data Fusion for Learning from an Arbitrary Number of Multi-fidelity Data Sets. (arXiv:2301.13271v1 [cs.LG] CROSS LISTED)
    In many applications in engineering and sciences analysts have simultaneous access to multiple data sources. In such cases, the overall cost of acquiring information can be reduced via data fusion or multi-fidelity (MF) modeling where one leverages inexpensive low-fidelity (LF) sources to reduce the reliance on expensive high-fidelity (HF) data. In this paper, we employ neural networks (NNs) for data fusion in scenarios where data is very scarce and obtained from an arbitrary number of sources with varying levels of fidelity and cost. We introduce a unique NN architecture that converts MF modeling into a nonlinear manifold learning problem. Our NN architecture inversely learns non-trivial (e.g., non-additive and non-hierarchical) biases of the LF sources in an interpretable and visualizable manifold where each data source is encoded via a low-dimensional distribution. This probabilistic manifold quantifies model form uncertainties such that LF sources with small bias are encoded close to the HF source. Additionally, we endow the output of our NN with a parametric distribution not only to quantify aleatoric uncertainties, but also to reformulate the network's loss function based on strictly proper scoring rules which improve robustness and accuracy on unseen HF data. Through a set of analytic and engineering examples, we demonstrate that our approach provides a high predictive power while quantifying various sources uncertainties.
    Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders. (arXiv:2302.00662v1 [stat.ML])
    Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets. (arXiv:2010.12909v3 [cs.LG] UPDATED)
    We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v2 [stat.ML] UPDATED)
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.
    Stream-based active learning with linear models. (arXiv:2207.09874v3 [stat.ML] UPDATED)
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
    Additive Higher-Order Factorization Machines. (arXiv:2205.14515v2 [stat.CO] UPDATED)
    In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v2 [cs.LG] UPDATED)
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
    Sliced Optimal Partial Transport. (arXiv:2212.08049v4 [cs.LG] UPDATED)
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.
    Predicting CSI Sequences With Attention-Based Neural Networks. (arXiv:2302.00341v1 [stat.ML])
    In this work, we consider the problem of multi-step channel prediction in wireless communication systems. In existing works, autoregressive (AR) models are either replaced or combined with feed-forward neural networks(NNs) or, alternatively, with recurrent neural networks (RNNs). This paper explores the possibility of using sequence-to-sequence (Seq2Seq) and transformer neural network (TNN) models for channel state information (CSI) prediction. Simulation results show that both, Seq2Seq and TNNs, represent an appealing alternative to RNNs and feed-forward NNs in the context of CSI prediction. Additionally, the TNN with a few adaptations can extrapolate better than other models to CSI sequences that are either shorter or longer than the ones the model saw during training.
    Deterministic equivalent and error universality of deep random features learning. (arXiv:2302.00401v1 [stat.ML])
    This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
    Variational Causal Inference. (arXiv:2209.05935v2 [stat.ML] UPDATED)
    Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest.
    Distribution free optimality intervals for clustering. (arXiv:2107.14442v2 [stat.ML] UPDATED)
    We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.
    Approximate Bayesian Computation with Path Signatures. (arXiv:2106.12555v2 [stat.ME] UPDATED)
    Simulation models often lack tractable likelihood functions, making likelihood-free inference methods indispensable. Approximate Bayesian computation generates likelihood-free posterior samples by comparing simulated and observed data through some distance measure, but existing approaches are often poorly suited to time series simulators, for example due to an independent and identically distributed data assumption. In this paper, we propose to use path signatures in approximate Bayesian computation to handle the sequential nature of time series. We provide theoretical guarantees on the resultant posteriors and demonstrate competitive Bayesian parameter inference for simulators generating univariate, multivariate, irregularly spaced, and even non-Euclidean sequences.
    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v2 [cs.LG] UPDATED)
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.
    How to select predictive models for causal inference?. (arXiv:2302.00370v1 [stat.ML])
    Predictive models -- as with machine learning -- can underpin causal inference, to estimate the effects of an intervention at the population or individual level. This opens the door to a plethora of models, useful to match the increasing complexity of health data, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Classic machine-learning cross-validation procedures are not directly applicable. Indeed, an appropriate selection procedure for causal inference should equally weight both outcome errors for each individual, treated or not treated, whereas one outcome may be seldom observed for a sub-population. We study how more elaborate risks benefit causal model selection. We show theoretically that simple risks are brittle to weak overlap between treated and non-treated individuals as well as to heterogeneous errors between populations. Rather a more elaborate metric, the R-risk appears as a proxy of the oracle error on causal estimates, observable at the cost of an overlap re-weighting. As the R-risk is defined not only from model predictions but also by using the conditional mean outcome and the treatment probability, using it for model selection requires adapting cross validation. Extensive experiments show that the resulting procedure gives the best causal model selection.
    Learning Equilibria in Matching Markets from Bandit Feedback. (arXiv:2108.08843v2 [cs.LG] UPDATED)
    Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. Classical notions of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) are unfortunately of limited value in the learning setting, given that preferences are inherently uncertain and destabilizing while they are being learned. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty," the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v3 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    Automatically Marginalized MCMC in Probabilistic Programming. (arXiv:2302.00564v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a powerful algorithm to sample latent variables from Bayesian models. The advent of probabilistic programming languages (PPLs) frees users from writing inference algorithms and lets users focus on modeling. However, many models are difficult for HMC to solve directly, which often require tricks like model reparameterization. We are motivated by the fact that many of those models could be simplified by marginalization. We propose to use automatic marginalization as part of the sampling process using HMC in a graphical model extracted from a PPL, which substantially improves sampling from real-world hierarchical models.
    Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity. (arXiv:2211.07092v3 [stat.ML] UPDATED)
    In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
    A Group-Equivariant Autoencoder for Identifying Spontaneously Broken Symmetries. (arXiv:2202.06319v2 [cond-mat.stat-mech] UPDATED)
    We introduce the group-equivariant autoencoder (GE-autoencoder) -- a deep neural network (DNN) method that locates phase boundaries by determining which symmetries of the Hamiltonian have spontaneously broken at each temperature. We use group theory to deduce which symmetries of the system remain intact in all phases, and then use this information to constrain the parameters of the GE-autoencoder such that the encoder learns an order parameter invariant to these ``never-broken'' symmetries. This procedure produces a dramatic reduction in the number of free parameters such that the GE-autoencoder size is independent of the system size. We include symmetry regularization terms in the loss function of the GE-autoencoder so that the learned order parameter is also equivariant to the remaining symmetries of the system. By examining the group representation by which the learned order parameter transforms, we are then able to extract information about the associated spontaneous symmetry breaking. We test the GE-autoencoder on the 2D classical ferromagnetic and antiferromagnetic Ising models, finding that the GE-autoencoder (1) accurately determines which symmetries have spontaneously broken at each temperature; (2) estimates the critical temperature in the thermodynamic limit with greater accuracy, robustness, and time-efficiency than a symmetry-agnostic baseline autoencoder; and (3) detects the presence of an external symmetry-breaking magnetic field with greater sensitivity than the baseline method. Finally, we describe various key implementation details, including a new method for extracting the critical temperature estimate from trained autoencoders and calculations of the DNN initialization and learning rate settings required for fair model comparisons.
    On the Within-Group Discrimination of Screening Classifiers. (arXiv:2302.00025v1 [cs.LG])
    Screening classifiers are increasingly used to identify qualified candidates in a variety of selection processes. In this context, it has been recently shown that, if a classifier is calibrated, one can identify the smallest set of candidates which contains, in expectation, a desired number of qualified candidates using a threshold decision rule. This lends support to focusing on calibration as the only requirement for screening classifiers. In this paper, we argue that screening policies that use calibrated classifiers may suffer from an understudied type of within-group discrimination -- they may discriminate against qualified members within demographic groups of interest. Further, we argue that this type of discrimination can be avoided if classifiers satisfy within-group monotonicity, a natural monotonicity property within each of the groups. Then, we introduce an efficient post-processing algorithm based on dynamic programming to minimally modify a given calibrated classifier so that its probability estimates satisfy within-group monotonicity. We validate our algorithm using US Census survey data and show that within-group monotonicity can be often achieved at a small cost in terms of prediction granularity and shortlist size.
    A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel. (arXiv:2206.12543v2 [stat.ML] UPDATED)
    Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite width NTKs. For networks with O output units (e.g. an O-class classifier), however, the eNTK on N inputs is of size $NO \times NO$, taking $O((NO)^2)$ memory and up to $O((NO)^3)$ computation. Most existing applications have therefore used one of a handful of approximations yielding $N \times N$ kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call "sum of logits", converges to the true eNTK at initialization for any network with a wide final "readout" layer. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.
    $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning. (arXiv:2302.00543v1 [cs.LG])
    Many compression techniques have been proposed to reduce the communication overhead of Federated Learning training procedures. However, these are typically designed for compressing model updates, which are expected to decay throughout training. As a result, such methods are inapplicable to downlink (i.e., from the parameter server to clients) compression in the cross-device setting, where heterogeneous clients $\textit{may appear only once}$ during training and thus must download the model parameters. In this paper, we propose a new framework ($\texttt{DoCoFL}$) for downlink compression in the cross-device federated learning setting. Importantly, $\texttt{DoCoFL}$ can be seamlessly combined with many uplink compression schemes, rendering it suitable for bi-directional compression. Through extensive evaluation, we demonstrate that $\texttt{DoCoFL}$ offers significant bi-directional bandwidth reduction while achieving competitive accuracy to that of $\texttt{FedAvg}$ without compression.
    Width and Depth Limits Commute in Residual Networks. (arXiv:2302.00453v1 [stat.ML])
    We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We also demonstrate that the pre-activations, in this case, have Gaussian distributions which has direct applications in Bayesian deep learning. We conduct extensive simulations that show an excellent match with our theoretical findings.
    Delayed Feedback in Kernel Bandits. (arXiv:2302.00392v1 [stat.ML])
    Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with $\tilde{\mathcal{O}}(\sqrt{\Gamma_k(T)T}+\mathbb{E}[\tau])$ regret, where $T$ is the number of time steps, $\Gamma_k(T)$ is the maximum information gain of the kernel with $T$ observations, and $\tau$ is the delay random variable. This represents a significant improvement over the state of the art regret bound of $\tilde{\mathcal{O}}(\Gamma_k(T)\sqrt{T}+\mathbb{E}[\tau]\Gamma_k(T))$ reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.
    Distributed sequential federated learning. (arXiv:2302.00107v1 [stat.ML])
    The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.  ( 2 min )
    Whats Missing? Learning Hidden Markov Models When the Locations of Missing Observations are Unknown. (arXiv:2203.06527v2 [stat.ML] UPDATED)
    The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis, and it has been successfully applied in a large variety of domains. One of the key reasons for this versatility is the ability of HMMs to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In some situations where such assumptions are not feasible, a number of special algorithms have been developed. Currently, these algorithms rely strongly on specific structural assumptions of the underlying chain, such as acyclicity, and are not applicable in the general case. In particular, there are numerous domains within medicine and computational biology, where the missing observation locations are unknown and acyclicity assumptions do not hold, thus presenting a barrier for the application of HMMs in those fields. In this paper we consider a general problem of learning HMMs from data with unknown missing observation locations (i.e., only the order of the non-missing observations are known). We introduce a generative model of the location omissions and propose two learning methods for this model, a (semi) analytic approach, and a Gibbs sampler. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision and robustness under model misspecification.  ( 2 min )
    Diffusion-based Image Translation using Disentangled Style and Content Representation. (arXiv:2209.15264v2 [cs.CV] UPDATED)
    Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Specifically, inspired by the splicing Vision Transformer, we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. Then, an image guided style transfer is performed by matching the [CLS] classification token from the denoised samples and target image, whereas additional CLIP loss is used for the text-driven style transfer. To further accelerate the semantic change during the reverse diffusion, we also propose a novel semantic divergence loss and resampling strategy. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.  ( 2 min )
    Filtering Context Mitigates Scarcity and Selection Bias in Political Ideology Prediction. (arXiv:2302.00239v1 [cs.LG])
    We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.  ( 2 min )
    Adaptive sparseness for correntropy-based robust regression via automatic relevance determination. (arXiv:2302.00082v1 [cs.LG])
    Sparseness and robustness are two important properties for many machine learning scenarios. In the present study, regarding the maximum correntropy criterion (MCC) based robust regression algorithm, we investigate to integrate the MCC method with the automatic relevance determination (ARD) technique in a Bayesian framework, so that MCC-based robust regression could be implemented with adaptive sparseness. To be specific, we use an inherent noise assumption from the MCC to derive an explicit likelihood function, and realize the maximum a posteriori (MAP) estimation with the ARD prior by variational Bayesian inference. Compared to the existing robust and sparse L1-regularized MCC regression, the proposed MCC-ARD regression can eradicate the troublesome tuning for the regularization hyper-parameter which controls the regularization strength. Further, MCC-ARD achieves superior prediction performance and feature selection capability than L1-regularized MCC, as demonstrated by a noisy and high-dimensional simulation study.  ( 2 min )
    The Parametric Stability of Well-separated Spherical Gaussian Mixtures. (arXiv:2302.00242v1 [stat.ML])
    We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bound only depends on $P$. The motivation for this work lies in providing guarantees for fitting Gaussian mixtures; with this aim in mind, all the constants involved are well defined and distribution free conditions for fitting mixtures of spherical Gaussians. Our results tighten considerably the existing computable bounds, and asymptotically match the known sharp thresholds for this problem.  ( 2 min )
    Tensor networks for unsupervised machine learning. (arXiv:2106.12974v2 [cond-mat.stat-mech] UPDATED)
    Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have the advantages of a principle understanding of the expressive power using entanglement properties, and as a bridge connecting classical computation and quantum computation. Despite the great potential, however, existing tensor network models for unsupervised machine learning only work as a proof of principle, as their performance is much worse than the standard models such as restricted Boltzmann machines and neural networks. In this Letter, we present autoregressive matrix product states (AMPS), a tensor network model combining matrix product states from quantum many-body physics and autoregressive modeling from machine learning. Our model enjoys the exact calculation of normalized probability and unbiased sampling. We demonstrate the performance of our model using two applications, generative modeling on synthetic and real-world data, and reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor network models and the restricted Boltzmann machines, and is competitive with state-of-the-art neural network models.  ( 2 min )
    Revisiting Bellman Errors for Offline Model Selection. (arXiv:2302.00141v1 [cs.LG])
    Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.  ( 2 min )
    Local transfer learning from one data space to another. (arXiv:2302.00160v1 [cs.LG])
    A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related.  ( 2 min )
    A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee. (arXiv:2302.00248v1 [cs.DS])
    Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.  ( 2 min )
    Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression. (arXiv:2302.00257v1 [cs.LG])
    In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$ with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.  ( 2 min )
    Quickest Change Detection for Unnormalized Statistical Models. (arXiv:2302.00250v1 [stat.ML])
    Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyv\"arinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.  ( 2 min )
    Accelerated First-Order Optimization under Nonlinear Constraints. (arXiv:2302.00316v1 [math.OC])
    We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive rates for the convex setting. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.  ( 2 min )
    Gradient Descent in Neural Networks as Sequential Learning in RKBS. (arXiv:2302.00205v1 [stat.ML])
    The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training.  ( 2 min )
    Training Normalizing Flows with the Precision-Recall Divergence. (arXiv:2302.00628v1 [cs.LG])
    Generative models can have distinct mode of failures like mode dropping and low quality samples, which cannot be captured by a single scalar metric. To address this, recent works propose evaluating generative models using precision and recall, where precision measures quality of samples and recall measures the coverage of the target distribution. Although a variety of discrepancy measures between the target and estimated distribution are used to train generative models, it is unclear what precision-recall trade-offs are achieved by various choices of the discrepancy measures. In this paper, we show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the {\em PR-divergences }. Conversely, any -divergence can be written as a linear combination of PR-divergences and therefore correspond to minimising a weighted precision-recall trade-off. Further, we propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.  ( 2 min )
    Deep learning for $\psi$-weakly dependent processes. (arXiv:2302.00333v1 [stat.ML])
    In this paper, we perform deep neural networks for learning $\psi$-weakly dependent processes. Such weak-dependence property includes a class of weak dependence conditions such as mixing, association,$\cdots$ and the setting considered here covers many commonly used situations such as: regression estimation, time series prediction, time series classification,$\cdots$ The consistency of the empirical risk minimization algorithm in the class of deep neural networks predictors is established. We achieve the generalization bound and obtain a learning rate, which is less than $\mathcal{O}(n^{-1/\alpha})$, for all $\alpha > 2 $. Applications to binary time series classification and prediction in affine causal models with exogenous covariates are carried out. Some simulation results are provided, as well as an application to the US recession data.  ( 2 min )
    Robust online active learning. (arXiv:2302.00422v1 [stat.ML])
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.  ( 2 min )
    The geometry of hidden representations of large transformer models. (arXiv:2302.00294v1 [cs.LG])
    Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.  ( 2 min )

  • Open

    Creating "Her" using GPT-3 & TTS trained on voice from movie
    submitted by /u/justLV [link] [comments]  ( 40 min )
    A look at Rewind.AI - the search engine for life?
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    How to make an AI learn osu!? (for school project year)
    Hello, I am currently in Highschool and we have a Project Year in computer science. I recently got really interested in AI and especially AI's that are able to learn how to play a specific game. You may or may not know Vedal987 but I am taking huge inspiration from him on how his AI "Neuro-sama" is able to play osu! I want to do something similar with the same game but, I really am a beginner with AI's of that kind so I wanted to ask if you have any learning materials for me or guides on how to start and all the stuff that I could need to actually execute my plan (and if it even is possible within a year) I am an expert in Python (also good in Java and Javascript), I can code and stuff but have no clue how to start out since I've never done this. Thanks in advance :) (feel free to ask me if you dont understand something, i am bad in expressing myself) submitted by /u/iLeg1999 [link] [comments]  ( 41 min )
    We built a mobile app powered by GPT, specialising in daily task assistance 🚀
    Hey folks! 👋 I am Ed, one of the creators of Toucan. Super excited to share what we've been working on 👇 Toucan is a mobile app that helps you get the most out of GPT-3 “on the go” – with a slick mobile UX, powerful wrap-around features and fine-tuned models. If you're keen to learn a bit more about our journey, you can read on below, of if you just want to throw your best queries at GPT, you can download Toucan on the App Store, here: https://apps.apple.com/us/app/toucan-ai-chatbot-assistant/id1665298806 🧩 Our story We started Toucan to make engaging with cutting-edge AI on mobile devices a vastly better experience. No more browser hopping, endless auth challenges, conversation loss or clunky UI interactions. We’ve also built handy supporting functionalities around the AI layer (s…  ( 42 min )
    1-click deploy for your GPT-3 App
    Link - https://github.com/ClerkieAI/berri_ai We made a package that makes it easy for you to quickly deploy your LLM Agent from Google Colab to production (Web App and API Endpoint). How it works? Just install the package, import the function, and run deploy. At the end of the deploy (~10-15mins), you will get: A web app to interact with your agent 👉 https://agent-repo-35aa2cf3-a0a1-4cf8-834f-302e5b7fe07e-4524... An endpoint you can query 👉 https://agent-repo-35aa2cf3-a0a1-4cf8-834f-302e5b7fe07e-4524... is obama?" Want a more detailed walkthrough? Check out our loom - https://www.loom.com/share/fd4375b4a77f4ea7802369cb06a16d43 We’re still early so would love your feedback and opinions. Feel free to try us out for free – and if you need help building an agent / want a specific integration, just let us know! https://i.redd.it/xu6a92464ufa1.gif submitted by /u/VideoTo [link] [comments]  ( 41 min )
    AI Dream 124 - Great Relaxing Psychedelic AI Video
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI tool for typography?
    I'm wondering if such a thing exists. Say I have the word "Eyeball" or anything at all. I want to find a font that makes the word "Eyeball" look good, while adding parameters such as - professional, business like, etc - possibly aiding in its suggestions. I already have been using Chat GPT to create concepts with font suggestions, but it would be neat if there was a dedicated tool for this. Is there currently any AI out there that can do that? submitted by /u/Pinkisacoloryes [link] [comments]  ( 41 min )
    is there an ai for helping YOU learn?
    Ive heard of all types of ai but is there one every conceived of helping you learn fast and quicker? submitted by /u/mrmaskfawkes [link] [comments]  ( 41 min )
    Best ChatGPT Alternatives for 2023
    submitted by /u/visimens-technology [link] [comments]  ( 40 min )
    Join the AI Art Revolution! Help Create a Masterpiece and Explore the Boundaries of Technology and Creativity.
    Hey reddit, I'm a student that is taking part in a contest focusing on banned works of art, and the relationship between art and society. After trying my hand at Chat GPT and DALLE2 I've spurred an idea for the topic of this contest. A couple details, the contest is in the form of a comic book page, it's surrounding the themes of obscenity, libel, codes, copyrights, etc. I was considering using AI as the subject matter considering the recent controversies surrounding this new technology. Questions such as whose art is it? The person using the technology, or the person owning it? Is it copyrightable considering it's fed data based on previous artists work? I'm relatively new to this space other than witnessing peoples uses on various social media. So given that, my idea was to have Chat GPT generate a fictional story surrounding the relationship of society and art, tweaking until I have something that can be applied into a comic book form, then feeding parameters into DALLE2 to create comic book panels. Given the controversial nature of utilizing AI for school related essays, homework, etc I was curious what this community thinks of this idea? I felt like it was one two punch, utilizing new technology to create artwork that's relevant to the topic at hand and also has the potential to be "banned" so to speak from participating in the contest. submitted by /u/KylesButler [link] [comments]  ( 42 min )
    AI is terrifying if you actually think about what we can do with it ... it made me question what it means to even be human
    submitted by /u/andytk33 [link] [comments]  ( 43 min )
    I Made a List of The 5 Best AI Detection Tools
    submitted by /u/HODLTID [link] [comments]  ( 40 min )
    📌[Searchcolab] League of Legends characters at 80's Dark Fantasy Movie. SafeTensor Link in comments.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    📌[Searchcolab] What is the reason behind the noticeable similarity in quality between the voice cloning results generated by Elevenlabs and Microsoft VALL-E? Unofficial implementation of Microsoft VALL-E present on Searchcolab.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    A.I Eminem - 'Raw Sh*t'
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    The Moravec Paradox is Not Paradoxical: Intelligence from the Primordial Soup of Signs
    submitted by /u/SonntagMorgen [link] [comments]  ( 40 min )
    🌌 Why A.I. is Increasingly a Game Changer in Astronomy and Cosmology
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    Chinese Company Gearing Up to Release Powerful ChatGPT Competitor
    submitted by /u/Mental_Character7367 [link] [comments]  ( 44 min )
    Join the experiment: exploring a new approach to emotional support with the AI Friend/Counselor Bot
    I've been working on a project that uses the OpenAI API to create a friend/counselor bot. The goal of this bot is to provide emotional support and have meaningful conversations about a variety of topics including work, relationships, emotions, health, and more. If you're interested, you can check out the demo of the bot at https://friendbot.ai. I'm looking for alpha testers to help me improve the features such as handling long interactions, sematics etc. Participation in the experiment is anonymous and no data about interactions are stored on the servers. submitted by /u/M-K111 [link] [comments]  ( 41 min )
    AI writes video about itself. The Rise of AI Art: A Creative Revolution.
    submitted by /u/anekii [link] [comments]  ( 40 min )
    Write Faster SQL With AI
    submitted by /u/EloquentPickle [link] [comments]  ( 40 min )
    Pokémon as an 80’s Dark Fantasy Film (AI Generated Art)
    submitted by /u/HooverHooverHoober [link] [comments]  ( 40 min )
    Web 3.0:Era of condensed knowledge
    Yandex(Russian search engine),Google,Bing and Baidu race to release chatbots as replacement for traditional search. Here is what I think will happen: Most websites will die out. Why?You don't need 10000 recipe websites.You need culinary Wikipedia.And a chatbot to query it.99% percent of recipes are not unique.Culinary theory is pretty established.They can be condensed into branching stricture.e.g. Pasta with tomatoes.Pasta with cheese. Plumbing.Pretty much the same. Programming.All you need is documentation pretty much.You can even analyse open source projects and generate documentation on the fly.Tutorials and best practices can be condensed. News.I was thinking about extracting news from data collected from ambient computing.E.g. Google develops new feature.News article is generated from Todo lists,meeting notes,code.But it's too far out for now. And list goes on and on. You need 100 Wikipedia to contain all human knowledge.And even if websites will refuse to cooperate it doesn't matter.There is plenty of copies of the web. submitted by /u/nikitastaf1996 [link] [comments]  ( 44 min )
    AI's Effects on/Associations with cognitive processes, memory, and other importance in psychology
    Hello! Does anyone know of a decent scholarly article on something relevant to this that they could share with me? ​ Thanks! :) submitted by /u/EvilPeppermintHelix [link] [comments]  ( 40 min )
    How do you find new ai-based software/programs?
    There are numerous amazing AI-powered software/programs available, such as Wave2Lip on GitHub (which enables you to lip sync a video to any voice audio file), CascadeOur (which generates AI animations), or Google Film (which creates a seamless slow-motion effect from two distant frames). These programs possess tremendous potential, but how can I discover new ones? I'm sure there are many more out there that I am not aware of. Can you suggest any tips on how I can find them? submitted by /u/PM_ME_LIFE_MEANING [link] [comments]  ( 41 min )
  • Open

    Lagrange multiplier setup: Now what?
    Suppose you need to optimize, i.e. maximize or minimize, a function f(x). If this is a practical problem and not a textbook exercise, you probably need to optimize f(x) subject to some constraint on x, say g(x) = 0. Hmm. Optimize one function subject to a constraint given by another function. Oh yeah, Lagrange multipliers! […] Lagrange multiplier setup: Now what? first appeared on John D. Cook.  ( 6 min )
  • Open

    Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS
    Today, the NFL is continuing their journey to increase the number of statistics provided by the Next Gen Stats Platform to all 32 teams and fans alike. With advanced analytics derived from machine learning (ML), the NFL is creating new ways to quantify football, and to provide fans with the tools needed to increase their […]  ( 10 min )
    Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab
    The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More […]  ( 10 min )
  • Open

    Google Research, 2022 & beyond: ML & computer systems
    Posted by Phitchaya Mangpo Phothilimthana, Staff Research Scientist, and Adam Paszke, Staff Research Scientist, Google Research (This is Part 3 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) Great machine learning (ML) research requires great systems. With the increasing sophistication of the algorithms and hardware in use today and with the scale at which they run, the complexity of the software necessary to carry out day-to-day tasks only increases. In this post, we provide an overview of the numerous advances made across Google this past year in systems for ML that enable us to support the serving and training of complex models while easing the complexity of implementation for end users. This b…  ( 97 min )
    Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization
    Posted by Xingyou (Richard) Song, Research Scientist, and Chansoo Lee, Software Engineer, Google Research, Brain Team Google Vizier is the de-facto system for blackbox optimization over objective functions and hyperparameters across Google, having serviced some of Google’s largest research efforts and optimized a wide range of products (e.g., Search, Ads, YouTube). For research, it has not only reduced language model latency for users, designed computer architectures, accelerated hardware, assisted protein discovery, and enhanced robotics, but also provided a reliable backend interface for users to search for neural architectures and evolve reinforcement learning algorithms. To operate at the scale of optimizing thousands of users’ critical systems and tuning millions of machine learning…  ( 91 min )
  • Open

    [D] I'm at a crossroads: Bayesian methods VS Reinforcement Learning, which to choose?
    I know it looks like a very absurd comparison. The reason why I ask this is because I'm taking a masters of ML(that wasn't exactly cheap) and while the basic concepts are covered in the mandatory path, now I have to choose an elective, in this case between "Bayesian Methods" and "Reinforcement Learning". So what I'd like to ask you is which of both concepts/techniques are more used or relevant to work in the Data Science industry, meaning which one is going to help me more in that job or make a more positive impression in my resume. Thx in advance! submitted by /u/fuscarili [link] [comments]  ( 46 min )
    [P] Domestic Violence Dataset
    Hi, I am working on project and for that I need a Twitter Domestic Violence Dataset. Basically I need a dataset with domestic violence tweets against woman. I have searched Kaggle and other websites but found no luck. Plus, I tried using Snscrape, but I need some phrases ideas related to domestic violence so I can get some tweets using that. I tried "Domestic Violence" , "My husband tried to kill me" and looking for more. Help is appreciated. submitted by /u/Naive-Aioli4849 [link] [comments]  ( 43 min )
    [D] Workflow chair for AI conference
    Hi! Does anyone here have experience working as a workflow chair for major conferences? What are the duties and how much does it pay? (I heard that it's a paid role) submitted by /u/Expensive-Track [link] [comments]  ( 42 min )
    [p] I built an open source platform to deploy computationally intensive Python functions as serverless jobs, with no timeouts
    Hi friends! I ran into this problem enough times at my last few jobs that I built a tool to solve it. I spent many hours building Docker containers for my Python functions, as many of the data science modules required building C libraries (since they significantly speed up compute-intensive routines, such as math calculations). Deploying the containers to AWS Lambda or Fargate (if the processes required more CPU or memory or were >15 minutes) and wiring functions to talk to each other using queues, databases, and blob storage made iterating on the actual code, which wasn't even that complex most of the time, slow. I made cakework https://github.com/usecakework/cakework, a platform that lets you spin up your Python functions as serverless, production-scale backends with a single command. Using the client SDK, you submit requests, check status, and get results. You can also specify the amount of CPU (up to 16 cores) and memory (up to 128GB) for each individual request, which is helpful when your data size and complexity varies across different requests. A common pattern that I built cakework for is doing file processing for ML: - ingest data from some source daily, or in response to an external event (data written to blob storage) - run my function (often using pandas/numpy/scipy) - write results to storage, update database - track failures and re-run/fix It's open source <3. Here are some fun examples to get you started: https://docs.cakework.com/examples Would love to hear your thoughts! submitted by /u/seattleite849 [link] [comments]  ( 47 min )
    [P] Time series outlier / anomaly detection
    I have traffic speed time series data for each day of the week over several months, with data samples about every 30 seconds. I'd like to find periods of time (subsequences) where the speed is much slower than usual. Any recommendations for algorithms that would be well suited to this problem? Thanks submitted by /u/dudester_el [link] [comments]  ( 44 min )
    [D] Querying with multiple vectors during embedding nearest neighbor search?
    Are there tools or techniques that permit you to joint query using more than one query vector? Use case: iterative ANN search refinement, where I start with a seed vector, select matches, and re-query with more examples to improve the search results. I tried doing this with FAISS, but it performs a "batch query" that returns a separate set of results for each query vector (not a joint query). submitted by /u/mostlyhydrogen [link] [comments]  ( 42 min )
    [D] Do high leverage points affect Neural Net and Tree-based model?
    I know they can affect linear regression badly but given the fact that neural net and tree-based models can approximate non-linear complex functions, I don't think the high leverage points would be a problem. Just curious about your opinion whether my thinking makes sense submitted by /u/Temporary_Cap_2855 [link] [comments]  ( 43 min )
    [D] ImageNet normalization vs [-1, 1] normalization
    For ImageNet classification, there are two common ways of normalizing the input images: - Normalize to [-1, 1] using an affine transformation (2*(x/255) - 1). - Normalize using ImageNet mean = (0.485, 0.456, 0.406) and std = (0.229, 0.224, 0.225). I observe that the first one is more common in TensorFlow codebases (including Jax models with TensorFlow data processing, e.g. the official Vision Transformers code), whereas the second is ubiquitous in PyTorch codebases. I tried to find empirical comparisons of the two, but there doesn't seem to be any. Which one is better in your opinion? I guess the performance shouldn't be too different, but still it's interesting to hear your experience. submitted by /u/netw0rkf10w [link] [comments]  ( 44 min )
    [D] PC takes a long time to execute code, possibility to use a cloud/external device?
    Hello people, I am currently attending a Data Science course and to finish I have to write a paper about a project that I am currently working on. I write the code in VSCode and I use .ipynb notebooks. So I am basically training a few ML models after a long data preprocessing which worked out fine. But as soon as I run my hyperparameter tuning code, my PC takes a lot of time. Right now I am running hyperparameter tuning for RandomForest and it already runs for 21 hours. Is there any possibility for me to run my code somewhere else? I read abour Heroku, but that seems to be too much than what I am looking for. I am getting a bit nervous, because I want to get this paper done. The worst case is that I have to buy a new PC. Thank you so much! submitted by /u/Emergency-Dig-5262 [link] [comments]  ( 45 min )
    [N] Microsoft integrates GPT 3.5 into Teams
    Official blog post: https://www.microsoft.com/en-us/microsoft-365/blog/2023/02/01/microsoft-teams-premium-cut-costs-and-add-ai-powered-productivity/ Given the amount of money they pumped into OpenAI, it's not surprising that you'd see it integrated into their products. I do wonder how this will work in highly regulated fields (finance, law, medicine, education). submitted by /u/bikeskata [link] [comments]  ( 47 min )
    [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples?
    Aligned LLMs such as InstructGPT and ChatGPT are trained via supervised fine-tuning after the initial self-supervised pretraining. Then, the researchers train a reward model on responses ranked by humans. When I understand correctly, they let the LLM generate responses that humans have to rank on a scale from 1-5. Then, they train a reward model (I suppose in supervised fashion?) on these ranked outputs. Once that's done, they use reinforcement learning (RL) with proximal policy optimization (PPO) to update the LLM. My question is why they use RL with PPO for this last step? Why don't they fine-tune the LLM using regular supervised learning, whereas the human-ranked outputs represent the labels. Since these are labels in the range 1-5, this could be a ranking or ordinal regression loss for supervised learning. submitted by /u/alpha-meta [link] [comments]  ( 47 min )
    [D] Inconsistent Featurespace in Data
    Hi colleagues! I am working on a model for which I have a dataset consisting of 2 data sources. Problem is that one datastream starts in 2017 and the other only in 2022. Feature spaces from those 2 data streams are different. I am wondering if there is a methodology to follow which allows me to use both data streams for training even though one starts way later than the other. Or am I forced to drop the newer one? (just 2022 data from two sources is too small for me to train on) Thank you! submitted by /u/pahalie [link] [comments]  ( 43 min )
    [D] Commercial Use of a Model that has been trained using Human3.6M
    I wanted to use the Learnable Trainangulation model in a commercial project. The source code itself is under MIT licensing. However, the dataset they have used is Human3.6M, which states that the license is "FREE OF CHARGE FOR ACADEMIC USE ONLY". Yet, recent court rulings (in the US) state that models can use copyrighted data during training, and the results are no longer bound by that copyright (e.g. Google Books). Does the same apply here? submitted by /u/mfarahmand98 [link] [comments]  ( 42 min )
    [D] Global Optimum of K-Means Cost Function
    I've recently started reading up on classical ML and I got a question about K-Means. More concretely, I am confused about the uniqueness of the global optimal solution of K-Means's cost function. Let's state the problem formally below, extracted from Bishop's Pattern Recognition and Machine Learning book, exercise 9.1. Consider the 𝐾-means algorithm discussed in Section 9.1. Show that as a consequence of there being a finite number of possible assignments for the set of discrete indicator variables 𝑟𝑛𝑘, and that for each such assignment there is a unique optimum for the 𝝁𝑘, the K-means algorithm must converge after a finite number of iterations. I made an answer [here](https://stats.stackexchange.com/questions/603327/question-on-the-proof-of-convergence-of-k-means) detailing the proof of why it does converge in Lloyd's algorithm, but I think I still do not understand why Lloyd's do not converge to a global minimum, which mathematical theorem/understanding am I missing here? I think that optimizing both the assignments and the centroids of K-Means at the same time is non-convex and hence there are many local minimums, we can use brute force to search for the global minimum but of course it is exponential to the number of data points. On the other hand, Lloyd optimizes it (greedily) alternatively, and hence you will find the cost functions' local minima (guaranteed)? submitted by /u/healthymonkey100 [link] [comments]  ( 45 min )
    [P] [R] A simplistic UI to edit images with Stable Diffusion and InstructPix2Pix
    https://preview.redd.it/ut4us5251rfa1.png?width=2000&format=png&auto=webp&s=dbf79c3832b20287203faa97e5c1303472bdbc22 Currently, the UI supports a picture upload and uses InstructPix2Pix to edit it. Also, it uses upscaling models for quality enhancements. More models are coming soon. The goal is to provide a way for non-ML people to use diffusion-based image editing through simplistic app design. Web demo: https://diffground.com/ submitted by /u/radi-cho [link] [comments]  ( 43 min )
    [D]How Will Open Source Alternatives Compete With GPT3?
    To clarify, I'm not talking about ChatGPT here. I've been testing outputs from GPT-3 davinci003 against alternatives in terms of output quality, relevance, and ability to understand "instruct" (versus vanilla autocompletion). I tried these: AI21 Jurassic 178B NeoX 20B GPT J 6B FairSeq 13B As well as: GPT-3 davinci002 GPT-3 davinci001 Of course, I didn't expect the smaller models to be on par with GPT-3, but I was surprised at how much better GPT3 davinci 003 performed compared to AI21's 178B model. AI21's Jurassic 178B seems to be comparable to GPT3 davinci 001. Does this mean that only well-funded corporations will be able to train general-purpose LLMs? It seems to me that just having a large model doesn't do much, it's also about several iterations of training and feedback. How are open source alternatives going to be able to compete? (I'm not in the ML or CS field, just an amateur who enjoys using these models) submitted by /u/noellarkin [link] [comments]  ( 45 min )
    [R] Sentence autoencoder
    Any suggestion on sentence auto-encoder? I want to learn the vector representation of a sentence and reconstruct the sentence itself. I used plane LSTM with self attention in the encoder and decoder architecture (no cross attention) but the results are not that good enough. I can not use cross attention i.e decoder will not have access to all the outputs of encoder but only access to the bottleneck latent vector. BART have pre-trained in similar manner but I don't know if we can pre-train that model to fit my use case. This is just a module of other work, after pre-training the sentence-sentence auto-encoder, I need to add some more module in between them, so I should have encoder and decoder separable, which I think can not be done in BART as well. Any direction would be much appreciated. Thank you submitted by /u/Bishwa12 [link] [comments]  ( 43 min )
    [D] Apple's ane-transformers - experiences?
    I'm using Huggingface's transformers regularly for experimentations, but I plan to deploy some of the models to iOS. I have found ml-ane-transformers repo from Apple, which shows how transformers can be rewritten to have much better performance on Apple's devices. There's an example of DistilBERT implemented in that optimized way. As I plan to deploy transformers to iOS, I started thinking about this. I'm hoping some already have experience about this, so we can discuss: Has anyone tried this themselves? Do they actually see the improvements in performance on iOS? I'm using Huggingface's transformer models in my experiments. How much work do you think there is to rewrite model in this optimized way? It's very difficult to train transformers from scratch (especially if they're big :) ), so I'm fine-tuning on top of pre-trained models on Huggingface. Is it possible to use weights from pretrained Huggingface models with the Apple's reference code? How difficult is it? submitted by /u/alkibijad [link] [comments]  ( 43 min )
  • Open

    Best Reinforcement Learning Papers from the past 1-2 years
    I am searching for the best Reinforcement Learning Papers from the past 1-2 years. My special focus is on Communications, but as we all know a lot of approaches can be applied in different fields. I would appreciate any recommendations :) submitted by /u/GolemX14 [link] [comments]  ( 41 min )
    Multi-Agent Stable Baselines
    I want to extend an implementation that currently uses stable baselines 3 from a single-agent into a multi-agent system. As far as I can tell, stable baselines isn't really suited for this. Does anyone have experience with multi-agent systems in stable baselines or with switching from stable baselines to RLlib? submitted by /u/tessherelurkingnow [link] [comments]  ( 41 min )
    "Distillation Policy Optimization", Ma et al 2023
    https://arxiv.org/abs/2302.00533 submitted by /u/OutOfCharm [link] [comments]  ( 41 min )
    Where do the new weights in PPO come from?
    I am looking at the pseudocode for PPO given in SpinningUp over here - ​ https://preview.redd.it/sx3kx08k2pfa1.png?width=1013&format=png&auto=webp&s=85511fe4cccda77975bd4731558f3936c9f5f522 ​ I understand that $\theta_k$ are the old weights. I am a little lost regarding the weight $\theta$. Is this the new weight? ​ Here is what I guess is happening (I could be completely wrong) - PPO is based on a generic Actor Critic Algorithm Therefore the new weight $\theta$ is learned as per conventional A-C algorithms The old weights $\theta_k$ and new weights $\theta$ are stored and then used in the clip objective The clip objective uses these 2 weights to come up with $\theta_{k+1}$ submitted by /u/Academic-Rent7800 [link] [comments]  ( 43 min )
    Noob question: why is this trivial problem not accordingly trivial to train? (PPO)
    I'm new to ML/RL and trying to build some intuition on how the fancy algorithms are made useful in practice. So far I was surprised how pairing SB3 + a gym environment frequently does nothing unless the right parameters are hit. So I've written a trivial mock environment: it "thinks" of a random target floating point number between -1 and 1 and a random state number in the same range. The observation (-2 to 2) is the difference between both. The (continous -1 to 1) action is added to the state The reward is the absolute value of the distance between current state and the target times -1, with a little negative bit added to "make the life painful" and enforce finishing Done is signaled once converged within 0.001, which is achievable within 1 to max. 2 steps depending on the reset state. So my thinking is: this is as easy as it gets - it does not matter how the algorithm explores or what it sees, the input to output relation is always exactly same. The actor simply needs to learn to forward the input to the output, which is seen learnable from every possible action. So it should train instantly with almost random choices of most hyperparameters - reward discount would not really matter, neither would most of the others, right? (I use PPO from SB3.) Yet I almost could not get it to work: it would behave wildly different with different n_env*n_steps sizes, learning and randomly unlearning again even with small learning_rates set. The cases I could find that work manually would take total_timesteps in a range of millions to finish consistently, and even then they would finish mostly in 2-4 steps rather then 1-2 in evaluation. Is such a problem really such sensitive or am I doing it wrong? Is there a way to actually train it within 10000s rather then millions of steps with PPO and if so, how to find from the train metrics? How could it be improved to converge more precisely? For anyone wanting to play around, code is attached below. submitted by /u/EvilButFluffy [link] [comments]  ( 45 min )
  • Open

    Three Cheers: GFN Thursday Celebrates Third Anniversary With 25 New Games
    Cheers to another year of cloud gaming! GeForce NOW celebrates its third anniversary with a look at how far cloud gaming has come, a community celebration and 25 new games supported in February. Members can celebrate all month long, starting with a sweet Dying Light 2 reward and support for nine more games this week, Read article >  ( 7 min )
    NVIDIA A100 Aces Throughput, Latency Results in Key Inference Benchmark for Financial Services Industry
    NVIDIA A100 Tensor Core GPUs running on Supermicro servers have captured leading results for inference in the latest STAC-ML Markets benchmark, a key technology performance gauge for the financial services industry. The results show NVIDIA demonstrating unrivaled throughput — serving up thousands of inferences per second on the most demanding models — and top latency Read article >  ( 6 min )
    Survey Reveals Financial Industry’s Top 4 AI Priorities for 2023
    For several years, NVIDIA has been working with some of the world’s leading financial institutions to develop and execute a wide range of rapidly evolving AI strategies. For the past three years, we’ve asked them to tell us collectively what’s on the top of their minds. Sometimes the results are just what we thought they’d Read article >  ( 6 min )
  • Open

    Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization. (arXiv:2210.12057v2 [cs.LG] UPDATED)
    We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
    Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation. (arXiv:2210.04296v3 [cs.LG] UPDATED)
    Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a partial differential equation (PDE) governing the spatial-temporal evolution of a density undergoing a diffusion process. In this work, we derive a corresponding equation, called the score FPE that characterizes the noise-conditional scores of the perturbed data densities (i.e., their gradients). Surprisingly, despite impressive empirical performance, we observe that scores learned via denoising score matching (DSM) do not satisfy the underlying score FPE. We prove that satisfying the FPE is desirable as it improves the likelihood and the degree of conservativity. Hence, we propose to regularize the DSM objective to enforce satisfaction of the score FPE, and we show the effectiveness of this approach across various datasets.
    Prioritizing Samples in Reinforcement Learning with Reducible Loss. (arXiv:2208.10483v2 [cs.LG] UPDATED)
    Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. Not all samples carry the same amount of significance and simply assigning equal importance to each of the samples is a na\"ive strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in prioritized experience replay.
    ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models. (arXiv:2301.12935v2 [cs.LG] UPDATED)
    Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous sampling methods with fixed analytical form are not robust with the error in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise. Experiments on Cifar10, LSUN-Church, and LSUN-Bedroom datasets demonstrate that our proposed ERA-Solver achieves 5.14, 9.42, and 9.69 Fenchel Inception Distance (FID) for image generation, with only 10 network evaluations.
    Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers. (arXiv:2212.04325v2 [eess.AS] UPDATED)
    Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid modes, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance.
    Accelerating Material Design with the Generative Toolkit for Scientific Discovery. (arXiv:2207.03928v4 [cs.LG] UPDATED)
    With the growing availability of data within various scientific domains, generative models hold enormous potential to accelerate scientific discovery. They harness powerful representations learned from datasets to speed up the formulation of novel hypotheses with the potential to impact material discovery broadly. We present the Generative Toolkit for Scientific Discovery (GT4SD). This extensible open-source library enables scientists, developers, and researchers to train and use state-of-the-art generative models to accelerate scientific discovery focused on material design.
    Learning to reject meets OOD detection: Are all abstentions created equal?. (arXiv:2301.12386v2 [cs.LG] UPDATED)
    Learning to reject (L2R) and out-of-distribution (OOD) detection are two classical problems, each of which involve detecting certain abnormal samples: in L2R, the goal is to detect "hard" samples on which to abstain, while in OOD detection, the goal is to detect "outlier" samples not drawn from the training distribution. Intriguingly, despite being developed in parallel literatures, both problems share a simple baseline: the maximum softmax probability (MSP) score. However, there is limited understanding of precisely how these problems relate. In this paper, we formally relate these problems, and show how they may be jointly solved. We first show that while MSP is theoretically optimal for L2R, it can be theoretically sub-optimal for OOD detection in some important practical settings. We then characterize the Bayes-optimal classifier for a unified formulation that generalizes both L2R and OOD detection. Based on this, we design a plug-in approach for learning to abstain on both inlier and OOD samples, while constraining the total abstention budget. Experiments on benchmark OOD datasets demonstrate that our approach yields competitive classification and OOD detection performance compared to baselines from both literatures.
    Continuous Soft Pseudo-Labeling in ASR. (arXiv:2211.06007v2 [cs.LG] UPDATED)
    Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.
    Spectral Maps for Learning on Subgraphs. (arXiv:2205.14938v4 [cs.LG] UPDATED)
    In graph learning, maps between graphs and their subgraphs frequently arise. For instance, when coarsening or rewiring operations are present along the pipeline, one needs to keep track of the corresponding nodes between the original and modified graphs. Classically, these maps are represented as binary node-to-node correspondence matrices and used as-is to transfer node-wise features between the graphs. In this paper, we argue that simply changing this map representation can bring notable benefits to graph learning tasks. Drawing inspiration from recent progress in geometry processing, we introduce a spectral representation for maps that is easy to integrate into existing graph learning models. This spectral representation is a compact and straightforward plug-in replacement and is robust to topological changes of the graphs. Remarkably, the representation exhibits structural properties that make it interpretable, drawing an analogy with recent results on smooth manifolds. We demonstrate the benefits of incorporating spectral maps in graph learning pipelines, addressing scenarios where a node-to-node map is not well defined, or in the absence of exact isomorphism. Our approach bears practical benefits in knowledge distillation and hierarchical learning, where we show comparable or improved performance at a fraction of the computational cost.
    STEEL: Singularity-aware Reinforcement Learning. (arXiv:2301.13152v2 [stat.ML] UPDATED)
    Batch reinforcement learning (RL) aims at finding an optimal policy in a dynamic environment in order to maximize the expected total rewards by leveraging pre-collected data. A fundamental challenge behind this task is the distributional mismatch between the batch data generating process and the distribution induced by target policies. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice, especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable the power of model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, STEEL only requires some minimal data-coverage assumption and thus greatly enhances the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method when facing possible singularity in batch RL.
    Active Sequential Two-Sample Testing. (arXiv:2301.12616v2 [cs.LG] UPDATED)
    Two-sample testing tests whether the distributions generating two samples are identical. We pose the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. We devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries} sample labels to address the problem. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is given by a classification model. The classification model is adaptively updated and then used to guide an active query scheme called bimodal query to label sample features in the regions with high dependency between the feature variables and the label variables. The theoretical contributions in the paper include proof that our framework produces an \emph{anytime-valid} $p$-value; and, under reachable conditions and a mild assumption, the framework asymptotically generates a minimum normalized log-likelihood ratio statistic that a passive query scheme can only achieve when the feature variable and the label variable have the highest dependence. Lastly, we provide a \emph{query-switching (QS)} algorithm to decide when to switch from passive query to active query and adapt bimodal query to increase the testing power of our test. Extensive experiments justify our theoretical contributions and the effectiveness of QS.
    What can be learnt with wide convolutional neural networks?. (arXiv:2208.01003v4 [stat.ML] UPDATED)
    Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.
    Causal Estimation for Text Data with (Apparent) Overlap Violations. (arXiv:2210.00079v2 [stat.ML] UPDATED)
    Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.
    Fairness-aware Vision Transformer via Debiased Self-Attention. (arXiv:2301.13803v1 [cs.CV])
    Vision Transformer (ViT) has recently gained significant interest in solving computer vision (CV) problems due to its capability of extracting informative features and modeling long-range dependencies through the self-attention mechanism. To fully realize the advantages of ViT in real-world applications, recent works have explored the trustworthiness of ViT, including its robustness and explainability. However, another desiderata, fairness has not yet been adequately addressed in the literature. We establish that the existing fairness-aware algorithms (primarily designed for CNNs) do not perform well on ViT. This necessitates the need for developing our novel framework via Debiased Self-Attention (DSA). DSA is a fairness-through-blindness approach that enforces ViT to eliminate spurious features correlated with the sensitive attributes for bias mitigation. Notably, adversarial examples are leveraged to locate and mask the spurious features in the input image patches. In addition, DSA utilizes an attention weights alignment regularizer in the training objective to encourage learning informative features for target prediction. Importantly, our DSA framework leads to improved fairness guarantees over prior works on multiple prediction tasks without compromising target prediction performance
    Gaussian Noise is Nearly Instance Optimal for Private Unbiased Mean Estimation. (arXiv:2301.13850v1 [math.ST])
    We investigate unbiased high-dimensional mean estimators in differential privacy. We consider differentially private mechanisms whose expected output equals the mean of the input dataset, for every dataset drawn from a fixed convex domain $K$ in $\mathbb{R}^d$. In the setting of concentrated differential privacy, we show that, for every input such an unbiased mean estimator introduces approximately at least as much error as a mechanism that adds Gaussian noise with a carefully chosen covariance. This is true when the error is measured with respect to $\ell_p$ error for any $p \ge 2$. We extend this result to local differential privacy, and to approximate differential privacy, but for the latter the error lower bound holds either for a dataset or for a neighboring dataset. We also extend our results to mechanisms that take i.i.d.~samples from a distribution over $K$ and are unbiased with respect to the mean of the distribution.
    FedPass: Privacy-Preserving Vertical Federated Deep Learning with Adaptive Obfuscation. (arXiv:2301.12623v2 [cs.DC] UPDATED)
    Vertical federated learning (VFL) allows an active party with labeled feature to leverage auxiliary features from the passive parties to improve model performance. Concerns about the private feature and label leakage in both the training and inference phases of VFL have drawn wide research attention. In this paper, we propose a general privacy-preserving vertical federated deep learning framework called FedPass, which leverages adaptive obfuscation to protect the feature and label simultaneously. Strong privacy-preserving capabilities about private features and labels are theoretically proved (in Theorems 1 and 2). Extensive experimental result s with different datasets and network architectures also justify the superiority of FedPass against existing methods in light of its near-optimal trade-off between privacy and model performance.
    Personalized Decentralized Bilevel Optimization over Random Directed Networks. (arXiv:2210.02129v2 [stat.ML] UPDATED)
    Personalization and decentralization are two major lines of studies to realize practical federated learning in the real world. The aim of this study is to establish a general and unified approach that can solve these two problems simultaneously. In this work, we first propose a bilevel problem that can adapt to various personalization scenarios by allowing an arbitrary choice of two parameters: a client-wise outer-parameter representing heterogeneity, and a shared inner-parameter representing homogeneity across client data distributions. We then present an algorithm that can solve this bilevel problem in a decentralized manner by estimating gradients of clients' outer-costs with respect to their outer-parameters. We show that the proposed algorithm can be extended to handle a random directed network, which is one of the most robust decentralized communication classes. The proposed method achieves state-of-the-art performance on a personalization benchmark across various communication settings.
    On the Global Convergence of Fitted Q-Iteration with Two-layer Neural Network Parametrization. (arXiv:2211.07675v2 [cs.LG] UPDATED)
    Deep Q-learning based algorithms have been applied successfully in many decision making problems, while their theoretical foundations are not as well understood. In this paper, we study a Fitted Q-Iteration with two-layer ReLU neural network parameterization, and find the sample complexity guarantees for the algorithm. Our approach estimates the Q-function in each iteration using a convex optimization problem. We show that this approach achieves a sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^{2})$, which is order-optimal. This result holds for a countable state-spaces and does not require any assumptions such as a linear or low rank structure on the MDP.
    ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text. (arXiv:2301.13852v1 [cs.CL])
    ChatGPT has the ability to generate grammatically flawless and seemingly-human replies to different types of questions from various domains. The number of its users and of its applications is growing at an unprecedented rate. Unfortunately, use and abuse come hand in hand. In this paper, we study whether a machine learning model can be effectively trained to accurately distinguish between original human and seemingly human (that is, ChatGPT-generated) text, especially when this text is short. Furthermore, we employ an explainable artificial intelligence framework to gain insight into the reasoning behind the model trained to differentiate between ChatGPT-generated and human-generated text. The goal is to analyze model's decisions and determine if any specific patterns or characteristics can be identified. Our study focuses on short online reviews, conducting two experiments comparing human-generated and ChatGPT-generated text. The first experiment involves ChatGPT text generated from custom queries, while the second experiment involves text generated by rephrasing original human-generated reviews. We fine-tune a Transformer-based model and use it to make predictions, which are then explained using SHAP. We compare our model with a perplexity score-based approach and find that disambiguation between human and ChatGPT-generated reviews is more challenging for the ML model when using rephrased text. However, our proposed approach still achieves an accuracy of 79%. Using explainability, we observe that ChatGPT's writing is polite, without specific details, using fancy and atypical vocabulary, impersonal, and typically it does not express feelings.
    Learning useful representations for shifting tasks and distributions. (arXiv:2212.07346v2 [cs.LG] UPDATED)
    Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions? Our thesis is that such scenarios are better served by representations that are richer than those obtained with a single optimization episode. We support this thesis with simple theoretical arguments and with experiments utilizing an apparently na\"{\i}ve ensembling technique: concatenating the representations obtained from multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained with a single training run. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
    Adaptively Weighted Data Augmentation Consistency Regularization for Robust Optimization under Concept Shift. (arXiv:2210.01891v2 [cs.CV] UPDATED)
    Concept shift is a prevailing problem in natural tasks like medical image segmentation where samples usually come from different subpopulations with variant correlations between features and labels. One common type of concept shift in medical image segmentation is the "information imbalance" between label-sparse samples with few (if any) segmentation labels and label-dense samples with plentiful labeled pixels. Existing distributionally robust algorithms have focused on adaptively truncating/down-weighting the "less informative" (i.e., label-sparse in our context) samples. To exploit data features of label-sparse samples more efficiently, we propose an adaptively weighted online optimization algorithm -- AdaWAC -- to incorporate data augmentation consistency regularization in sample reweighting. Our method introduces a set of trainable weights to balance the supervised loss and unsupervised consistency regularization of each sample separately. At the saddle point of the underlying objective, the weights assign label-dense samples to the supervised loss and label-sparse samples to the unsupervised consistency regularization. We provide a convergence guarantee by recasting the optimization as online mirror descent on a saddle point problem. Our empirical results demonstrate that AdaWAC not only enhances the segmentation performance and sample efficiency but also improves the robustness to concept shift on various medical image segmentation tasks with different UNet-style backbones.
    Hierarchically Clustered PCA, LLE, and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v2 [cs.LG] UPDATED)
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.
    Weak Proxies are Sufficient and Preferable for Fairness with Missing Sensitive Attributes. (arXiv:2210.03175v2 [cs.LG] UPDATED)
    Evaluating fairness can be challenging in practice because the sensitive attributes of data are often inaccessible due to privacy constraints. The go-to approach that the industry frequently adopts is using off-the-shelf proxy models to predict the missing sensitive attributes, e.g. Meta [Alao et al., 2021] and Twitter [Belli et al., 2022]. Despite its popularity, there are three important questions unanswered: (1) Is directly using proxies efficacious in measuring fairness? (2) If not, is it possible to accurately evaluate fairness using proxies only? (3) Given the ethical controversy over inferring user private information, is it possible to only use weak (i.e. inaccurate) proxies in order to protect privacy? Our theoretical analyses show that directly using proxy models can give a false sense of (un)fairness. Second, we develop an algorithm that is able to measure fairness (provably) accurately with only three properly identified proxies. Third, we show that our algorithm allows the use of only weak proxies (e.g. with only 68.85%accuracy on COMPAS), adding an extra layer of protection on user privacy. Experiments validate our theoretical analyses and show our algorithm can effectively measure and mitigate bias. Our results imply a set of practical guidelines for practitioners on how to use proxies properly. Code is available at github.com/UCSC-REAL/fair-eval.
    Deep Reinforcement Learning for Cryptocurrency Trading: Practical Approach to Address Backtest Overfitting. (arXiv:2209.05559v6 [q-fin.ST] UPDATED)
    Designing profitable and reliable trading strategies is challenging in the highly volatile cryptocurrency market. Existing works applied deep reinforcement learning methods and optimistically reported increased profits in backtesting, which may suffer from the false positive issue due to overfitting. In this paper, we propose a practical approach to address backtest overfitting for cryptocurrency trading using deep reinforcement learning. First, we formulate the detection of backtest overfitting as a hypothesis test. Then, we train the DRL agents, estimate the probability of overfitting, and reject the overfitted agents, increasing the chance of good trading performance. Finally, on 10 cryptocurrencies over a testing period from 05/01/2022 to 06/27/2022 (during which the crypto market crashed two times), we show that the less overfitted deep reinforcement learning agents have a higher return than that of more overfitted agents, an equal weight strategy, and the S&P DBM Index (market benchmark), offering confidence in possible deployment to a real market.
    Unifying Generative Models with GFlowNets and Beyond. (arXiv:2209.02606v2 [cs.LG] UPDATED)
    There are many frameworks for deep generative modeling, each often presented with their own specific training algorithms and inference methods. Here, we demonstrate the connections between existing deep generative models and the recently introduced GFlowNet framework, a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories. Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models. Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.
    Can Persistent Homology provide an efficient alternative for Evaluation of Knowledge Graph Completion Methods?. (arXiv:2301.12929v2 [cs.LG] UPDATED)
    In this paper we present a novel method, $\textit{Knowledge Persistence}$ ($\mathcal{KP}$), for faster evaluation of Knowledge Graph (KG) completion approaches. Current ranking-based evaluation is quadratic in the size of the KG, leading to long evaluation times and consequently a high carbon footprint. $\mathcal{KP}$ addresses this by representing the topology of the KG completion methods through the lens of topological data analysis, concretely using persistent homology. The characteristics of persistent homology allow $\mathcal{KP}$ to evaluate the quality of the KG completion looking only at a fraction of the data. Experimental results on standard datasets show that the proposed metric is highly correlated with ranking metrics (Hits@N, MR, MRR). Performance evaluation shows that $\mathcal{KP}$ is computationally efficient: In some cases, the evaluation time (validation+test) of a KG completion method has been reduced from 18 hours (using Hits@10) to 27 seconds (using $\mathcal{KP}$), and on average (across methods & data) reduces the evaluation time (validation+test) by $\approx$ $\textbf{99.96}\%$.
    Bias in Machine Learning Models Can Be Significantly Mitigated by Careful Training: Evidence from Neuroimaging Studies. (arXiv:2205.13421v2 [cs.LG] UPDATED)
    Despite the great promise that machine learning has offered in many fields of medicine, it has also raised concerns about potential biases and poor generalization across genders, age distributions, races and ethnicities, hospitals, and data acquisition equipment and protocols. In the current study, and in the context of three brain diseases, we provide evidence which suggests that when properly trained, machine learning models can generalize well across diverse conditions and do not necessarily suffer from bias. Specifically, by using multi-study magnetic resonance imaging consortia for diagnosing Alzheimer's disease, schizophrenia, and autism spectrum disorder, we find that well-trained models have a high area-under-the-curve (AUC) on subjects across different subgroups pertaining to attributes such as gender, age, racial groups, and different clinical studies and are unbiased under multiple fairness metrics such as demographic parity difference, equalized odds difference, equal opportunity difference etc. We find that models that incorporate multi-source data from demographic, clinical, genetic factors and cognitive scores are also unbiased. These models have better predictive AUC across subgroups than those trained only with imaging features but there are also situations when these additional features do not help.
    Revisiting Hyperparameter Tuning with Differential Privacy. (arXiv:2211.01852v2 [cs.LG] UPDATED)
    Hyperparameter tuning is a common practice in the application of machine learning but is a typically ignored aspect in the literature on privacy-preserving machine learning due to its negative effect on the overall privacy parameter. In this paper, we aim to tackle this fundamental yet challenging problem by providing an effective hyperparameter tuning framework with differential privacy. The proposed method allows us to adopt a broader hyperparameter search space and even to perform a grid search over the whole space, since its privacy loss parameter is independent of the number of hyperparameter candidates. Interestingly, it instead correlates with the utility gained from hyperparameter searching, revealing an explicit and mandatory trade-off between privacy and utility. Theoretically, we show that its additional privacy loss bound incurred by hyperparameter tuning is upper-bounded by the squared root of the gained utility. However, we note that the additional privacy loss bound would empirically scale like a squared root of the logarithm of the utility term, benefiting from the design of doubling step.
    Optimal Solutions for Joint Beamforming and Antenna Selection: From Branch and Bound to Graph Neural Imitation Learning. (arXiv:2206.05576v2 [eess.SP] UPDATED)
    This work revisits the joint beamforming (BF) and antenna selection (AS) problem, as well as its robust beamforming (RBF) version under imperfect channel state information (CSI). Such problems arise due to various reasons, e.g., the costly nature of the radio frequency (RF) chains and energy/resource-saving considerations. The joint (R)BF\&AS problem is a mixed integer and nonlinear program, and thus finding {\it optimal solutions} is often costly, if not outright impossible. The vast majority of the prior works tackled these problems using techniques such as continuous approximations, greedy methods, and supervised machine learning -- yet these approaches do not ensure optimality or even feasibility of the solutions. The main contribution of this work is threefold. First, an effective {\it branch and bound} (B\&B) framework for solving the problems of interest is proposed. Leveraging existing BF and RBF solvers, it is shown that the B\&B framework guarantees global optimality of the considered problems. Second, to expedite the potentially costly B\&B algorithm, a machine learning (ML)-based scheme is proposed to help skip intermediate states of the B\&B search tree. The learning model features a {\it graph neural network} (GNN)-based design that is resilient to a commonly encountered challenge in wireless communications, namely, the change of problem size (e.g., the number of users) across the training and test stages. Third, comprehensive performance characterizations are presented, showing that the GNN-based method retains the global optimality of B\&B with provably reduced complexity, under reasonable conditions. Numerical simulations also show that the ML-based acceleration can often achieve an order-of-magnitude speedup relative to B\&B.
    Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization. (arXiv:2212.13556v2 [cs.LG] UPDATED)
    To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
    Learning from many trajectories. (arXiv:2203.17193v2 [cs.LG] UPDATED)
    We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
    Strategyproof Decision-Making in Panel Data Settings and Beyond. (arXiv:2211.14236v2 [econ.EM] UPDATED)
    We propose a framework for decision-making in the presence of strategic agents with panel data, a standard setting in econometrics and statistics where one gets noisy, repeated measurements of multiple units. We consider a setup where there is a pre-intervention period, when the principal observes the outcomes of each unit, after which the principal uses these observations to assign a treatment to each unit. Our model can be thought of as a generalization of the synthetic controls and synthetic interventions frameworks, where units (or agents) may strategically manipulate pre-intervention outcomes to receive a more desirable intervention. We identify necessary and sufficient conditions under which a strategyproof mechanism that assigns interventions in the post-intervention period exists. Under a latent factor model assumption, we show that whenever a strategyproof mechanism exists, there is one with a simple closed form. In the setting where there is a single treatment and control (i.e., no other interventions), we establish that there is always a strategyproof mechanism, and provide an algorithm for learning such a mechanism. For the setting of multiple interventions, we provide an algorithm for learning a strategyproof mechanism, if there exists a sufficiently large gap in rewards between the different interventions. Finally, we empirically evaluate our model using real-world panel data collected from product sales over 18 months. We find that our methods compare favorably to baselines which do not take strategic interactions into consideration -- even in the presence of model misspecification. Along the way, we prove impossibility results for multi-class strategic classification, which may be of independent interest.
    Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. (arXiv:2207.02016v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is recognized as lacking generalization and robustness under environmental perturbations, which excessively restricts its application for real-world robotics. Prior work claimed that adding regularization to the value function is equivalent to learning a robust policy with uncertain transitions. Although the regularization-robustness transformation is appealing for its simplicity and efficiency, it is still lacking in continuous control tasks. In this paper, we propose a new regularizer named $\textbf{U}$ncertainty $\textbf{S}$et $\textbf{R}$egularizer (USR), by formulating the uncertainty set on the parameter space of the transition function. In particular, USR is flexible enough to be plugged into any existing RL framework. To deal with unknown uncertainty sets, we further propose a novel adversarial approach to generate them based on the value function. We evaluate USR on the Real-world Reinforcement Learning (RWRL) benchmark, demonstrating improvements in the robust performance for perturbed testing environments.
    Can we achieve robustness from data alone?. (arXiv:2207.11727v2 [cs.LG] UPDATED)
    We introduce a meta-learning algorithm for adversarially robust classification. The proposed method tries to be as model agnostic as possible and optimizes a dataset prior to its deployment in a machine learning system, aiming to effectively erase its non-robust features. Once the dataset has been created, in principle no specialized algorithm (besides standard gradient descent) is needed to train a robust model. We formulate the data optimization procedure as a bi-level optimization problem on kernel regression, with a class of kernels that describe infinitely wide neural nets (Neural Tangent Kernels). We present extensive experiments on standard computer vision benchmarks using a variety of different models, demonstrating the effectiveness of our method, while also pointing out its current shortcomings. In parallel, we revisit prior work that also focused on the problem of data optimization for robust classification \citep{Ily+19}, and show that being robust to adversarial attacks after standard (gradient descent) training on a suitable dataset is more challenging than previously thought.
    Meta-Learning via Classifier(-free) Diffusion Guidance. (arXiv:2210.08942v2 [cs.LG] UPDATED)
    We introduce meta-learning algorithms that perform zero-shot weight-space adaptation of neural network models to unseen tasks. Our methods repurpose the popular generative image synthesis techniques of natural language guidance and diffusion models to generate neural network weights adapted for tasks. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing multi-task and meta-learning methods in a series of zero-shot learning experiments on our Meta-VQA dataset.
    Fine-tuning or top-tuning? Transfer learning with pretrained features and fast kernel methods. (arXiv:2209.07932v2 [cs.LG] UPDATED)
    The impressive performances of deep learning architectures is associated to massive increase of models complexity. Millions of parameters need be tuned, with training and inference time scaling accordingly. But is massive fine-tuning necessary? In this paper, focusing on image classification, we consider a simple transfer learning approach exploiting pretrained convolutional features as input for a fast kernel method. We refer to this approach as top-tuning, since only the kernel classifier is trained. By performing more than 2500 training processes we show that this top-tuning approach provides comparable accuracy w.r.t. fine-tuning, with a training time that is between one and two orders of magnitude smaller. These results suggest that top-tuning provides a useful alternative to fine-tuning in small/medium datasets, especially when training efficiency is crucial.
    Py-Feat: Python Facial Expression Analysis Toolbox. (arXiv:2104.03509v3 [cs.CV] UPDATED)
    Studying facial expressions is a notoriously difficult endeavor. Recent advances in the field of affective computing have yielded impressive progress in automatically detecting facial expressions from pictures and videos. However, much of this work has yet to be widely disseminated in social science domains such as psychology. Current state of the art models require considerable domain expertise that is not traditionally incorporated into social science training programs. Furthermore, there is a notable absence of user-friendly and open-source software that provides a comprehensive set of tools and functions that support facial expression research. In this paper, we introduce Py-Feat, an open-source Python toolbox that provides support for detecting, preprocessing, analyzing, and visualizing facial expression data. Py-Feat makes it easy for domain experts to disseminate and benchmark computer vision models and also for end users to quickly process, analyze, and visualize face expression data. We hope this platform will facilitate increased use of facial expression data in human behavior research.
    A Dynamic Programming Algorithm for Finding an Optimal Sequence of Informative Measurements. (arXiv:2109.11808v4 [cs.LG] UPDATED)
    An informative measurement is the most efficient way to gain information about an unknown state. We present a first-principles derivation of a general-purpose dynamic programming algorithm that returns an optimal sequence of informative measurements by sequentially maximizing the entropy of possible measurement outcomes. This algorithm can be used by an autonomous agent or robot to decide where best to measure next, planning a path corresponding to an optimal sequence of informative measurements. The algorithm is applicable to states and controls that are either continuous or discrete, and agent dynamics that is either stochastic or deterministic; including Markov decision processes and Gaussian processes. Recent results from the fields of approximate dynamic programming and reinforcement learning, including on-line approximations such as rollout and Monte Carlo tree search, allow the measurement task to be solved in real time. The resulting solutions include non-myopic paths and measurement sequences that can generally outperform, sometimes substantially, commonly used greedy approaches. This is demonstrated for a global search task, where on-line planning for a sequence of local searches is found to reduce the number of measurements in the search by approximately half. A variant of the algorithm is derived for Gaussian processes for active sensing.
    PINCH: An Adversarial Extraction Attack Framework for Deep Learning Models. (arXiv:2209.06300v2 [cs.CR] UPDATED)
    Adversarial extraction attacks constitute an insidious threat against Deep Learning (DL) models in-which an adversary aims to steal the architecture, parameters, and hyper-parameters of a targeted DL model. Existing extraction attack literature have observed varying levels of attack success for different DL models and datasets, yet the underlying cause(s) behind their susceptibility often remain unclear, and would help facilitate creating secure DL systems. In this paper we present PINCH: an efficient and automated extraction attack framework capable of designing, deploying, and analyzing extraction attack scenarios across heterogeneous hardware platforms. Using PINCH, we perform extensive experimental evaluation of extraction attacks against 21 model architectures to explore new extraction attack scenarios and further attack staging. Our findings show (1) key extraction characteristics whereby particular model configurations exhibit strong resilience against specific attacks, (2) even partial extraction success enables further staging for other adversarial attacks, and (3) equivalent stolen models uncover differences in expressive power, yet exhibit similar captured knowledge.
    Benchmarking Large Language Models for News Summarization. (arXiv:2301.13848v1 [cs.CL])
    Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
    Reinforcement learning and decision making via single-photon quantum walks. (arXiv:2301.13669v1 [quant-ph])
    Variational quantum algorithms represent a promising approach to quantum machine learning where classical neural networks are replaced by parametrized quantum circuits. Here, we present a variational approach to quantize projective simulation (PS), a reinforcement learning model aimed at interpretable artificial intelligence. Decision making in PS is modeled as a random walk on a graph describing the agent's memory. To implement the quantized model, we consider quantum walks of single photons in a lattice of tunable Mach-Zehnder interferometers. We propose variational algorithms tailored to reinforcement learning tasks, and we show, using an example from transfer learning, that the quantized PS learning model can outperform its classical counterpart. Finally, we discuss the role of quantum interference for training and decision making, paving the way for realizations of interpretable quantum learning agents.
    What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness?. (arXiv:2210.05577v2 [cs.LG] UPDATED)
    The adversarial vulnerability of neural nets, and subsequent techniques to create robust models have attracted significant attention; yet we still lack a full understanding of this phenomenon. Here, we study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods, namely the Neural Tangent Kernel (NTK), following a growing body of work that leverages the NTK approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. We show how NTKs allow to generate adversarial examples in a ``training-free'' fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the ``lazy'' regime. We leverage this connection to provide an alternative view on robust and non-robust features, which have been suggested to underlie the adversarial brittleness of neural nets. Specifically, we define and study features induced by the eigendecomposition of the kernel to better understand the role of robust and non-robust features, the reliance on both for standard classification and the robustness-accuracy trade-off. We find that such features are surprisingly consistent across architectures, and that robust features tend to correspond to the largest eigenvalues of the model, and thus are learned early during training. Our framework allows us to identify and visualize non-robust yet useful features. Finally, we shed light on the robustness mechanism underlying adversarial training of neural nets used in practice: quantifying the evolution of the associated empirical NTK, we demonstrate that its dynamics falls much earlier into the ``lazy'' regime and manifests a much stronger form of the well known bias to prioritize learning features within the top eigenspaces of the kernel, compared to standard training.
    Bayesian Learning for Neural Networks: an algorithmic survey. (arXiv:2211.11865v4 [stat.ML] UPDATED)
    The last decade witnessed a growing interest in Bayesian learning. Yet, the technicality of the topic and the multitude of ingredients involved therein, besides the complexity of turning theory into practical implementations, limit the use of the Bayesian learning paradigm, preventing its widespread adoption across different fields and applications. This self-contained survey engages and introduces readers to the principles and algorithms of Bayesian Learning for Neural Networks. It provides an introduction to the topic from an accessible, practical-algorithmic perspective. Upon providing a general introduction to Bayesian Neural Networks, we discuss and present both standard and recent approaches for Bayesian inference, with an emphasis on solutions relying on Variational Inference and the use of Natural gradients. We also discuss the use of manifold optimization as a state-of-the-art approach to Bayesian learning. We examine the characteristic properties of all the discussed methods, and provide pseudo-codes for their implementation, paying attention to practical aspects, such as the computation of the gradients.
    Cutting Plane Selection with Analytic Centers and Multiregression. (arXiv:2212.07231v3 [math.OC] UPDATED)
    Cutting planes are a crucial component of state-of-the-art mixed-integer programming solvers, with the choice of which subset of cuts to add being vital for solver performance. We propose new distance-based measures to qualify the value of a cut by quantifying the extent to which it separates relevant parts of the relaxed feasible set. For this purpose, we use the analytic centers of the relaxation polytope or of its optimal face, as well as alternative optimal solutions of the linear programming relaxation. We assess the impact of the choice of distance measure on root node performance and throughout the whole branch-and-bound tree, comparing our measures against those prevalent in the literature. Finally, by a multi-output regression, we predict the relative performance of each measure, using static features readily available before the separation process. Our results indicate that analytic center-based methods help to significantly reduce the number of branch-and-bound nodes needed to explore the search space and that our multiregression approach can further improve on any individual method.
    Direct-Effect Risk Minimization for Domain Generalization. (arXiv:2211.14594v2 [cs.LG] UPDATED)
    We study the problem of out-of-distribution (o.o.d.) generalization where spurious correlations of attributes vary across training and test domains. This is known as the problem of correlation shift and has posed concerns on the reliability of machine learning. In this work, we introduce the concepts of direct and indirect effects from causal inference to the domain generalization problem. We argue that models that learn direct effects minimize the worst-case risk across correlation-shifted domains. To eliminate the indirect effects, our algorithm consists of two stages: in the first stage, we learn an indirect-effect representation by minimizing the prediction error of domain labels using the representation and the class label; in the second stage, we remove the indirect effects learned in the first stage by matching each data with another data of similar indirect-effect representation but of different class label. We also propose a new model selection method by matching the validation set in the same way, which is shown to improve the generalization performance of existing models on correlation-shifted datasets. Experiments on 5 correlation-shifted datasets and the DomainBed benchmark verify the effectiveness of our approach.
    Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation. (arXiv:2212.09412v2 [cs.CL] UPDATED)
    Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies and analyze the challenges between the continuous data space and the embedding space which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embeddings varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find the normal level of noise causes insufficient training of the model. To address the above challenges, we propose Difformer, an embedding diffusion model based on Transformer, which consists of three essential modules including an additional anchor loss function, a layer normalization module for embeddings, and a noise factor to the Gaussian noise. Experiments on two seminal text generation tasks including machine translation and text summarization show the superiority of Difformer over compared embedding diffusion baselines.
    Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences. (arXiv:2210.11794v2 [cs.LG] UPDATED)
    Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.
    Forecasting COVID- 19 cases using Statistical Models and Ontology-based Semantic Modelling: A real time data analytics approach. (arXiv:2206.02795v2 [q-bio.PE] UPDATED)
    SARS-COV-19 is the most prominent issue which many countries face today. The frequent changes in infections, recovered and deaths represents the dynamic nature of this pandemic. It is very crucial to predict the spreading rate of this virus for accurate decision making against fighting with the situation of getting infected through the virus, tracking and controlling the virus transmission in the community. We develop a prediction model using statistical time series models such as SARIMA and FBProphet to monitor the daily active, recovered and death cases of COVID-19 accurately. Then with the help of various details across each individual patient (like height, weight, gender etc.), we designed a set of rules using Semantic Web Rule Language and some mathematical models for dealing with COVID19 infected cases on an individual basis. After combining all the models, a COVID-19 Ontology is developed and performs various queries using SPARQL query on designed Ontology which accumulate the risk factors, provide appropriate diagnosis, precautions and preventive suggestions for COVID Patients. After comparing the performance of SARIMA and FBProphet, it is observed that the SARIMA model performs better in forecasting of COVID cases. On individual basis COVID case prediction, approx. 497 individual samples have been tested and classified into five different levels of COVID classes such as Having COVID, No COVID, High Risk COVID case, Medium to High Risk case, and Control needed case.
    Recipro-CAM: Gradient-free reciprocal class activation map. (arXiv:2209.14074v2 [cs.CV] UPDATED)
    Convolutional neural network (CNN) becomes one of the most popular and prominent deep learning architectures for computer vision, but its black box feature hides the internal prediction process. For this reason, AI practitioners have shed light on explainable AI to provide the interpretability of the model behavior. In particular, class activation map (CAM) and Grad-CAM based methods have shown promise results, but they have architectural limitation or gradient computing burden. To resolve these, Score-CAM has been suggested as a gradient-free method, however, it requires more execution time compared to CAM or Grad-CAM based methods. Therefore, we propose a lightweight architecture and gradient free Reciprocal CAM (Recipro-CAM) by spatially masking the extracted feature maps to exploit the correlation between activation maps and network outputs. With the proposed method, we achieved the gains of 1.78 - 3.72% in the ResNet family compared to Score-CAM in Average Drop- Coherence-Complexity (ADCC) metric, excluding the VGG-16 (1.39% drop). In addition, Recipro-CAM exhibits a saliency map generation rate similar to Grad-CAM and approximately 148 times faster than Score-CAM. The source code of Recipro-CAM is available at our data analysis framework.
    Discovery of Single Independent Latent Variable. (arXiv:2110.05887v2 [stat.ML] UPDATED)
    Latent variable discovery is a central problem in data analysis with a broad range of applications in applied science. In this work, we consider data given as an invertible mixture of two statistically independent components, and assume that one of the components is observed while the other is hidden. Our goal is to recover the hidden component. For this purpose, we propose an autoencoder equipped with a discriminator. Unlike the standard nonlinear ICA problem, which was shown to be non-identifiable, in the special case of ICA we consider here, we show that our approach can recover the component of interest up to entropy-preserving transformation. We demonstrate the performance of the proposed approach on several datasets, including image synthesis, voice cloning, and fetal ECG extraction.
    Learning in POMDPs is Sample-Efficient with Hindsight Observability. (arXiv:2301.13857v1 [cs.LG])
    POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
    Policy Gradient for s-Rectangular Robust Markov Decision Processes. (arXiv:2301.13589v1 [cs.LG])
    We present a novel robust policy gradient method (RPG) for s-rectangular robust Markov Decision Processes (MDPs). We are the first to derive the adversarial kernel in a closed form and demonstrate that it is a one-rank perturbation of the nominal kernel. This allows us to derive an RPG that is similar to the one used in non-robust MDPs, except with a robust Q-value function and an additional correction term. Both robust Q-values and correction terms are efficiently computable, thus the time complexity of our method matches that of non-robust MDPs, which is significantly faster compared to existing black box methods.
    Causal Graph Discovery from Self and Mutually Exciting Time Series. (arXiv:2106.02600v3 [cs.LG] UPDATED)
    We present a generalized linear structural causal model, coupled with a novel data-adaptive linear regularization, to recover causal directed acyclic graphs (DAGs) from time series. By leveraging a recently developed stochastic monotone Variational Inequality (VI) formulation, we cast the causal discovery problem as a general convex optimization. Furthermore, we develop a non-asymptotic recovery guarantee and quantifiable uncertainty by solving a linear program to establish confidence intervals for a wide range of non-linear monotone link functions. We validate our theoretical results and show the competitive performance of our method via extensive numerical experiments. Most importantly, we demonstrate the effectiveness of our approach in recovering highly interpretable causal DAGs over Sepsis Associated Derangements (SADs) while achieving comparable prediction performance to powerful ``black-box'' models such as XGBoost. Thus, the future adoption of our proposed method to conduct continuous surveillance of high-risk patients by clinicians is much more likely.
    Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks. (arXiv:2301.13799v1 [cs.LG])
    From natural language processing to genome sequencing, large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and instead must be distributed across multiple devices. This has motivated the research of new compute and network systems capable of handling such tasks. In particular, recent work has focused on developing management schemes which decide how to allocate distributed resources such that some overall objective, such as minimising the job completion time (JCT), is optimised. However, such studies omit explicit consideration of how much a job should be distributed, usually assuming that maximum distribution is desirable. In this work, we show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate. To address this, we propose PAC-ML (partitioning for asynchronous computing with machine learning). PAC-ML leverages a graph neural network and reinforcement learning to learn how much to partition computation graphs such that the number of jobs which meet arbitrary user-defined JCT requirements is maximised. In experiments with five real deep learning computation graphs on a recently proposed optical architecture across four user-defined JCT requirement distributions, we demonstrate PAC-ML achieving up to 56.2% lower blocking rates in dynamic job arrival settings than the canonical maximum parallelisation strategy used by most prior works.
    An Empirical Study of Quantum Dynamics as a Ground State Problem with Neural Quantum States. (arXiv:2206.09241v2 [quant-ph] UPDATED)
    We consider the Feynman-Kitaev formalism applied to a spin chain described by the transverse field Ising model. This formalism consists of building a Hamiltonian whose ground state encodes the time evolution of the spin chain at discrete time steps. To find this ground state, variational wave functions parameterised by artificial neural networks -- also known as neural quantum states (NQSs) -- are used. Our work focuses on assessing, in the context of the Feynman-Kitaev formalism, two properties of NQSs: expressivity (the possibility that variational parameters can be set to values such that the NQS is faithful to the true ground state of the system) and trainability (the process of reaching said values). We find that the considered NQSs are capable of accurately approximating the true ground state of the system, i.e., they are expressive enough ans\"atze. However, extensive hyperparameter tuning experiments show that, empirically, reaching the set of values for the variational parameters that correctly describe the ground state becomes ever more difficult as the number of time steps increase because the true ground state becomes more entangled, and the probability distribution starts to spread across the Hilbert space canonical basis.
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v2 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important model class in spatial statistics. They encode prior information about the modeled function and can be used for exact or approximate Bayesian inference. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
    Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles. (arXiv:2205.14116v2 [cs.LG] UPDATED)
    Counterfactual explanations describe how to modify a feature vector in order to flip the outcome of a trained classifier. Obtaining robust counterfactual explanations is essential to provide valid algorithmic recourse and meaningful explanations. We study the robustness of explanations of randomized ensembles, which are always subject to algorithmic uncertainty even when the training data is fixed. We formalize the generation of robust counterfactual explanations as a probabilistic problem and show the link between the robustness of ensemble models and the robustness of base learners. We develop a practical method with good empirical performance and support it with theoretical guarantees for ensembles of convex base learners. Our results show that existing methods give surprisingly low robustness: the validity of naive counterfactuals is below $50\%$ on most data sets and can fall to $20\%$ on problems with many features. In contrast, our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
    PADL: Language-Directed Physics-Based Character Control. (arXiv:2301.13868v1 [cs.LG])
    Developing systems that can synthesize natural and life-like motions for simulated characters has long been a focus for computer animation. But in order for these systems to be useful for downstream applications, they need not only produce high-quality motions, but must also provide an accessible and versatile interface through which users can direct a character's behaviors. Natural language provides a simple-to-use and expressive medium for specifying a user's intent. Recent breakthroughs in natural language processing (NLP) have demonstrated effective use of language-based interfaces for applications such as image generation and program synthesis. In this work, we present PADL, which leverages recent innovations in NLP in order to take steps towards developing language-directed controllers for physics-based character animation. PADL allows users to issue natural language commands for specifying both high-level tasks and low-level skills that a character should perform. We present an adversarial imitation learning approach for training policies to map high-level language commands to low-level controls that enable a character to perform the desired task and skill specified by a user's commands. Furthermore, we propose a multi-task aggregation method that leverages a language-based multiple-choice question-answering approach to determine high-level task objectives from language commands. We show that our framework can be applied to effectively direct a simulated humanoid character to perform a diverse array of complex motor skills.
    Generating Synthetic Mixed-type Longitudinal Electronic Health Records for Artificial Intelligent Applications. (arXiv:2112.12047v2 [cs.LG] UPDATED)
    The recent availability of electronic health records (EHRs) have provided enormous opportunities to develop artificial intelligence (AI) algorithms. However, patient privacy has become a major concern that limits data sharing across hospital settings and subsequently hinders the advances in AI. Synthetic data, which benefits from the development and proliferation of generative models, has served as a promising substitute for real patient EHR data. However, the current generative models are limited as they only generate single type of clinical data for a synthetic patient, i.e., either continuous-valued or discrete-valued. To mimic the nature of clinical decision-making which encompasses various data types/sources, in this study, we propose a generative adversarial network (GAN) entitled EHR-M-GAN which simultaneously synthesizes mixed-type timeseries EHR data. EHR-M-GAN is capable of capturing the multidimensional, heterogeneous, and correlated temporal dynamics in patient trajectories. We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients, and performed privacy risk evaluation of the proposed model. EHR-M-GAN has demonstrated its superiority over state-of-the-art benchmarks for synthesizing clinical timeseries with high fidelity, while addressing the limitations regarding data types and dimensionality in the current generative models. Notably, prediction models for outcomes of intensive care performed significantly better when training data was augmented with the addition of EHR-M-GAN-generated timeseries. EHR-M-GAN may have use in developing AI algorithms in resource-limited settings, lowering the barrier for data acquisition while preserving patient privacy.
    Disentangling Model Multiplicity in Deep Learning. (arXiv:2206.08890v2 [cs.LG] UPDATED)
    Model multiplicity is a well-known but poorly understood phenomenon that undermines the generalisation guarantees of machine learning models. It appears when two models with similar training-time performance differ in their predictions and real-world performance characteristics. This observed 'predictive' multiplicity (PM) also implies elusive differences in the internals of the models, their 'representational' multiplicity (RM). We introduce a conceptual and experimental setup for analysing RM by measuring activation similarity via singular vector canonical correlation analysis (SVCCA). We show that certain differences in training methods systematically result in larger RM than others and evaluate RM and PM over a finite sample as predictors for generalizability. We further correlate RM with PM measured by the variance in i.i.d. and out-of-distribution test predictions in four standard image data sets. Finally, instead of attempting to eliminate RM, we call for its systematic measurement and maximal exposure.
    FedBC: Calibrating Global and Local Models via Federated Learning Beyond Consensus. (arXiv:2206.10815v3 [cs.LG] UPDATED)
    In this work, we quantitatively calibrate the performance of global and local models in federated learning through a multi-criterion optimization-based framework, which we cast as a constrained program. The objective of a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considering the Lagrangian relaxation of this problem, we develop a novel primal-dual method called Federated Learning Beyond Consensus (\texttt{FedBC}). Theoretically, we establish that \texttt{FedBC} converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on a tolerance parameter introduced to scalarize the multi-criterion formulation. Finally, we demonstrate that \texttt{FedBC} balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with state-of-the-art.
    Toward Efficient Gradient-Based Value Estimation. (arXiv:2301.13757v1 [cs.LG])
    Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
    Salient Conditional Diffusion for Defending Against Backdoor Attacks. (arXiv:2301.13862v1 [cs.LG])
    We propose a novel algorithm, Salient Conditional Diffusion (Sancdifi), a state-of-the-art defense against backdoor attacks. Sancdifi uses a denoising diffusion probabilistic model (DDPM) to degrade an image with noise and then recover said image using the learned reverse diffusion. Critically, we compute saliency map-based masks to condition our diffusion, allowing for stronger diffusion on the most salient pixels by the DDPM. As a result, Sancdifi is highly effective at diffusing out triggers in data poisoned by backdoor attacks. At the same time, it reliably recovers salient features when applied to clean data. This performance is achieved without requiring access to the model parameters of the Trojan network, meaning Sancdifi operates as a black-box defense.
    FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning. (arXiv:2205.07246v3 [cs.LG] UPDATED)
    Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization. However, we argue that existing methods might fail to utilize the unlabeled data more effectively since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to obtain intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty to encourage the model for diverse predictions during the early training stage. Extensive experiments indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. Moreover, FreeMatch can also boost the performance of imbalanced SSL. The codes can be found at https://github.com/microsoft/Semi-supervised-learning.
    Normalizing Flows for Interventional Density Estimation. (arXiv:2209.06203v3 [cs.LG] UPDATED)
    Existing machine learning methods for causal inference usually estimate quantities expressed via the mean of potential outcomes (e.g., average treatment effect). However, such quantities do not capture the full information about the distribution of potential outcomes. In this work, we estimate the density of potential outcomes after interventions from observational data. For this, we propose a novel, fully-parametric deep learning method called Interventional Normalizing Flows. Specifically, we combine two normalizing flows, namely (i) a teacher flow for estimating nuisance parameters and (ii) a student flow for a parametric estimation of the density of potential outcomes. We further develop a tractable optimization objective based on a one-step bias correction for an efficient and doubly robust estimation of the student flow parameters. As a result our Interventional Normalizing Flows offer a properly normalized density estimator. Across various experiments, we demonstrate that our Interventional Normalizing Flows are expressive and highly effective, and scale well with both sample size and high-dimensional confounding. To the best of our knowledge, our Interventional Normalizing Flows are the first fully-parametric, deep learning method for density estimation of potential outcomes.
    A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser. (arXiv:2301.13731v1 [stat.ML])
    This paper presents a new convergent Plug-and-Play (PnP) algorithm. PnP methods are efficient iterative algorithms for solving image inverse problems formulated as the minimization of the sum of a data-fidelity term and a regularization term. PnP methods perform regularization by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD). To ensure convergence of PnP schemes, many works study specific parametrizations of deep denoisers. However, existing results require either unverifiable or suboptimal hypotheses on the denoiser, or assume restrictive conditions on the parameters of the inverse problem. Observing that these limitations can be due to the proximal algorithm in use, we study a relaxed version of the PGD algorithm for minimizing the sum of a convex function and a weakly convex one. When plugged with a relaxed proximal denoiser, we show that the proposed PnP-$\alpha$PGD algorithm converges for a wider range of regularization parameters, thus allowing more accurate image restoration.
    OPT-GAN: A Broad-Spectrum Global Optimizer for Black-box Problems by Learning Distribution. (arXiv:2102.03888v6 [cs.LG] UPDATED)
    Black-box optimization (BBO) algorithms are concerned with finding the best solutions for problems with missing analytical details. Most classical methods for such problems are based on strong and fixed a priori assumptions, such as Gaussianity. However, the complex real-world problems, especially when the global optimum is desired, could be very far from the a priori assumptions because of their diversities, causing unexpected obstacles. In this study, we propose a generative adversarial net-based broad-spectrum global optimizer (OPT-GAN) which estimates the distribution of optimum gradually, with strategies to balance exploration-exploitation trade-off. It has potential to better adapt to the regularity and structure of diversified landscapes than other methods with fixed prior, e.g., Gaussian assumption or separability. Experiments on diverse BBO benchmarks and high dimensional real world applications exhibit that OPT-GAN outperforms other traditional and neural net-based BBO algorithms.
    Subgroup Fairness in Two-Sided Markets. (arXiv:2106.02702v2 [cs.AI] UPDATED)
    It is well known that two-sided markets are unfair in a number of ways. For instance, female workers at Uber earn less than their male colleagues per mile driven. Similar observations have been made for other minority subgroups in other two-sided markets. Here, we suggest a novel market-clearing mechanism for two-sided markets, which promotes equalisation of the pay per hour worked across multiple subgroups, as well as within each subgroup. In the process, we introduce a novel notion of subgroup fairness (which we call Inter-fairness), which can be combined with other notions of fairness within each subgroup (called Intra-fairness), and the utility for the customers (Customer-Care) in the objective of the market-clearing problem. While the novel non-linear terms in the objective complicate market clearing by making the problem non-convex, we show that a certain non-convex augmented Lagrangian relaxation can be approximated to any precision in time polynomial in the number of market participants using semi-definite programming. This makes it possible to implement the market-clearing mechanism efficiently. On the example of driver-ride assignment in an Uber-like system, we demonstrate the efficacy and scalability of the approach, and trade-offs between Inter- and Intra-fairness.
    Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors. (arXiv:2201.06463v4 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference, Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs describe (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.
    SecGNN: Privacy-Preserving Graph Neural Network Training and Inference as a Cloud Service. (arXiv:2202.07835v2 [cs.CR] UPDATED)
    Graphs are widely used to model the complex relationships among entities. As a powerful tool for graph analytics, graph neural networks (GNNs) have recently gained wide attention due to its end-to-end processing capabilities. With the proliferation of cloud computing, it is increasingly popular to deploy the services of complex and resource-intensive model training and inference in the cloud due to its prominent benefits. However, GNN training and inference services, if deployed in the cloud, will raise critical privacy concerns about the information-rich and proprietary graph data (and the resulting model). While there has been some work on secure neural network training and inference, they all focus on convolutional neural networks handling images and text rather than complex graph data with rich structural information. In this paper, we design, implement, and evaluate SecGNN, the first system supporting privacy-preserving GNN training and inference services in the cloud. SecGNN is built from a synergy of insights on lightweight cryptography and machine learning techniques. We deeply examine the procedure of GNN training and inference, and devise a series of corresponding secure customized protocols to support the holistic computation. Extensive experiments demonstrate that SecGNN achieves comparable plaintext training and inference accuracy, with promising performance.
    Optimal precision for GANs. (arXiv:2207.10541v2 [cs.LG] UPDATED)
    Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnected distributions. We investigate the relationship between the performance of these models and the geometry of their latent space. Building on recent developments in geometric measure theory, we prove a sufficient condition for optimality in the case where the dimension of the latent space is larger than the number of modes. Through experiments on GANs, we demonstrate the validity of our theoretical results and gain new insights into the latent space geometry of these models. Additionally, we propose a truncation method that enforces a simplicial cluster structure in the latent space and improves the performance of GANs.
    Fairness and Accuracy under Domain Generalization. (arXiv:2301.13323v1 [cs.LG])
    As machine learning (ML) algorithms are increasingly used in high-stakes applications, concerns have arisen that they may be biased against certain social groups. Although many approaches have been proposed to make ML models fair, they typically rely on the assumption that data distributions in training and deployment are identical. Unfortunately, this is commonly violated in practice and a model that is fair during training may lead to an unexpected outcome during its deployment. Although the problem of designing robust ML models under dataset shifts has been widely studied, most existing works focus only on the transfer of accuracy. In this paper, we study the transfer of both fairness and accuracy under domain generalization where the data at test time may be sampled from never-before-seen domains. We first develop theoretical bounds on the unfairness and expected loss at deployment, and then derive sufficient conditions under which fairness and accuracy can be perfectly transferred via invariant representation learning. Guided by this, we design a learning algorithm such that fair ML models learned with training data still have high fairness and accuracy when deployment environments change. Experiments on real-world data validate the proposed algorithm. Model implementation is available at https://github.com/pth1993/FATDM.  ( 2 min )
    Transport with Support: Data-Conditional Diffusion Bridges. (arXiv:2301.13636v1 [cs.LG])
    The dynamic Schr\"odinger bridge problem provides an appealing setting for solving optimal transport problems by learning non-linear diffusion processes using efficient iterative solvers. Recent works have demonstrated state-of-the-art results (eg. in modelling single-cell embryo RNA sequences or sampling from complex posteriors) but are limited to learning bridges with only initial and terminal constraints. Our work extends this paradigm by proposing the Iterative Smoothing Bridge (ISB). We integrate Bayesian filtering and optimal control into learning the diffusion process, enabling constrained stochastic processes governed by sparse observations at intermediate stages and terminal constraints. We assess the effectiveness of our method on synthetic and real-world data and show that the ISB generalises well to high-dimensional data, is computationally efficient, and provides accurate estimates of the marginals at intermediate and terminal times.
    Structure Learning and Parameter Estimation for Graphical Models via Penalized Maximum Likelihood Methods. (arXiv:2301.13269v1 [stat.ML])
    Probabilistic graphical models (PGMs) provide a compact and flexible framework to model very complex real-life phenomena. They combine the probability theory which deals with uncertainty and logical structure represented by a graph which allows one to cope with the computational complexity and also interpret and communicate the obtained knowledge. In the thesis, we consider two different types of PGMs: Bayesian networks (BNs) which are static, and continuous time Bayesian networks which, as the name suggests, have a temporal component. We are interested in recovering their true structure, which is the first step in learning any PGM. This is a challenging task, which is interesting in itself from the causal point of view, for the purposes of interpretation of the model and the decision-making process. All approaches for structure learning in the thesis are united by the same idea of maximum likelihood estimation with the LASSO penalty. The problem of structure learning is reduced to the problem of finding non-zero coefficients in the LASSO estimator for a generalized linear model. In the case of CTBNs, we consider the problem both for complete and incomplete data. We support the theoretical results with experiments.  ( 2 min )
    BRAIxDet: Learning to Detect Malignant Breast Lesion with Incomplete Annotations. (arXiv:2301.13418v1 [cs.CV])
    Methods to detect malignant lesions from screening mammograms are usually trained with fully annotated datasets, where images are labelled with the localisation and classification of cancerous lesions. However, real-world screening mammogram datasets commonly have a subset that is fully annotated and another subset that is weakly annotated with just the global classification (i.e., without lesion localisation). Given the large size of such datasets, researchers usually face a dilemma with the weakly annotated subset: to not use it or to fully annotate it. The first option will reduce detection accuracy because it does not use the whole dataset, and the second option is too expensive given that the annotation needs to be done by expert radiologists. In this paper, we propose a middle-ground solution for the dilemma, which is to formulate the training as a weakly- and semi-supervised learning problem that we refer to as malignant breast lesion detection with incomplete annotations. To address this problem, our new method comprises two stages, namely: 1) pre-training a multi-view mammogram classifier with weak supervision from the whole dataset, and 2) extending the trained classifier to become a multi-view detector that is trained with semi-supervised student-teacher learning, where the training set contains fully and weakly-annotated mammograms. We provide extensive detection results on two real-world screening mammogram datasets containing incomplete annotations, and show that our proposed approach achieves state-of-the-art results in the detection of malignant breast lesions with incomplete annotations.
    Towards Learned Emulation of Interannual Water Isotopologue Variations in General Circulation Models. (arXiv:2301.13462v1 [physics.ao-ph])
    Simulating abundances of stable water isotopologues, i.e. molecules differing in their isotopic composition, within climate models allows for comparisons with proxy data and, thus, for testing hypotheses about past climate and validating climate models under varying climatic conditions. However, many models are run without explicitly simulating water isotopologues. We investigate the possibility to replace the explicit physics-based simulation of oxygen isotopic composition in precipitation using machine learning methods. These methods estimate isotopic composition at each time step for given fields of surface temperature and precipitation amount. We implement convolutional neural networks (CNNs) based on the successful UNet architecture and test whether a spherical network architecture outperforms the naive approach of treating Earth's latitude-longitude grid as a flat image. Conducting a case study on a last millennium run with the iHadCM3 climate model, we find that roughly 40\% of the temporal variance in the isotopic composition is explained by the emulations on interannual and monthly timescale, with spatially varying emulation quality. A modified version of the standard UNet architecture for flat images yields results that are equally good as the predictions by the spherical CNN. We test generalization to last millennium runs of other climate models and find that while the tested deep learning methods yield the best results on iHadCM3 data, the performance drops when predicting on other models and is comparable to simple pixel-wise linear regression. An extended choice of predictor variables and improving the robustness of learned climate--oxygen isotope relationships should be explored in future work.
    STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. (arXiv:2207.05022v3 [cs.LG] UPDATED)
    Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays. To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy. Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines.
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v2 [stat.ML] UPDATED)
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings: valid inference after data peeking requires correcting for multiple testing but such corrections generally result in low power. Following the principle of testing by betting, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g, the Hilbert-Schmidt independence criterion. Our test is valid under non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.
    Automated speech- and text-based classification of neuropsychiatric conditions in a multidiagnostic setting. (arXiv:2301.06916v2 [cs.CL] UPDATED)
    Speech patterns have been identified as potential diagnostic markers for neuropsychiatric conditions. However, most studies only compare a single clinical group to healthy controls, whereas clinical practice often requires differentiating between multiple potential diagnoses (multiclass settings). To address this, we assembled a dataset of repeated recordings from 420 participants (67 with major depressive disorder, 106 with schizophrenia and 46 with autism, as well as matched controls), and tested the performance of a range of conventional machine learning models and advanced Transformer models on both binary and multiclass classification, based on voice and text features. While binary models performed comparably to previous research (F1 scores between 0.54-0.75 for autism spectrum disorder, ASD; 0.67-0.92 for major depressive disorder, MDD; and 0.71-0.83 for schizophrenia); when differentiating between multiple diagnostic groups performance decreased markedly (F1 scores between 0.35-0.44 for ASD, 0.57-0.75 for MDD, 0.15-0.66 for schizophrenia, and 0.38-0.52 macro F1). Combining voice and text-based models yielded increased performance, suggesting that they capture complementary diagnostic information. Our results indicate that models trained on binary classification may learn to rely on markers of generic differences between clinical and non-clinical populations, or markers of clinical features that overlap across conditions, rather than identifying markers specific to individual conditions. We provide recommendations for future research in the field, suggesting increased focus on developing larger transdiagnostic datasets that include more fine-grained clinical features, and that can support the development of models that better capture the complexity of neuropsychiatric conditions and naturalistic diagnostic assessment.
    Low Complexity Adaptive Machine Learning Approaches for End-to-End Latency Prediction. (arXiv:2301.13536v1 [cs.NI])
    Software Defined Networks have opened the door to statistical and AI-based techniques to improve efficiency of networking. Especially to ensure a certain Quality of Service (QoS) for specific applications by routing packets with awareness on content nature (VoIP, video, files, etc.) and its needs (latency, bandwidth, etc.) to use efficiently resources of a network. Monitoring and predicting various Key Performance Indicators (KPIs) at any level may handle such problems while preserving network bandwidth. The question addressed in this work is the design of efficient, low-cost adaptive algorithms for KPI estimation, monitoring and prediction. We focus on end-to-end latency prediction, for which we illustrate our approaches and results on data obtained from a public generator provided after the recent international challenge on GNN [12]. In this paper, we improve our previously proposed low-cost estimators [6] by adding the adaptive dimension, and show that the performances are minimally modified while gaining the ability to track varying networks.
    Inference Time Evidences of Adversarial Attacks for Forensic on Transformers. (arXiv:2301.13356v1 [cs.CV])
    Vision Transformers (ViTs) are becoming a very popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification. However, although early works implied that this network structure had increased robustness against adversarial attacks, some works argue ViTs are still vulnerable. This paper presents our first attempt toward detecting adversarial attacks during inference time using the network's input and outputs as well as latent features. We design four quantifications (or derivatives) of input, output, and latent vectors of ViT-based models that provide a signature of the inference, which could be beneficial for the attack detection, and empirically study their behavior over clean samples and adversarial samples. The results demonstrate that the quantifications from input (images) and output (posterior probabilities) are promising for distinguishing clean and adversarial samples, while latent vectors offer less discriminative power, though they give some insights on how adversarial perturbations work.
    Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees. (arXiv:2301.13375v1 [cs.LG])
    Robustness and safety are critical for the trustworthy deployment of deep reinforcement learning in real-world decision making applications. In particular, we require algorithms that can guarantee robust, safe performance in the presence of general environment disturbances, while making limited assumptions on the data collection process during training. In this work, we propose a safe reinforcement learning framework with robustness guarantees through the use of an optimal transport cost uncertainty set. We provide an efficient, theoretically supported implementation based on Optimal Transport Perturbations, which can be applied in a completely offline fashion using only data collected in a nominal training environment. We demonstrate the robust, safe performance of our approach on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v1 [cs.LG])
    Motivated by time series forecasting, we study Online Linear Optimization (OLO) under the coupling of two problem structures: the domain is unbounded, and the performance of an algorithm is measured by its dynamic regret. Handling either of them requires the regret bound to depend on certain complexity measure of the comparator sequence -- specifically, the comparator norm in unconstrained OLO, and the path length in dynamic regret. In contrast to a recent work (Jacobsen & Cutkosky, 2022) that adapts to the combination of these two complexity measures, we propose an alternative complexity measure by recasting the problem into sparse coding. Adaptivity can be achieved by a simple modular framework, which naturally exploits more intricate prior knowledge of the environment. Along the way, we also present a new gradient adaptive algorithm for static unconstrained OLO, designed using novel continuous time machinery. This could be of independent interest.
    Sequential Strategic Screening. (arXiv:2301.13397v1 [cs.LG])
    We initiate the study of strategic behavior in screening processes with multiple classifiers. We focus on two contrasting settings: a conjunctive setting in which an individual must satisfy all classifiers simultaneously, and a sequential setting in which an individual to succeed must satisfy classifiers one at a time. In other words, we introduce the combination of strategic classification with screening processes. We show that sequential screening pipelines exhibit new and surprising behavior where individuals can exploit the sequential ordering of the tests to zig-zag between classifiers without having to simultaneously satisfy all of them. We demonstrate an individual can obtain a positive outcome using a limited manipulation budget even when far from the intersection of the positive regions of every classifier. Finally, we consider a learner whose goal is to design a sequential screening process that is robust to such manipulations, and provide a construction for the learner that optimizes a natural objective.
    Learning Against Distributional Uncertainty: On the Trade-off Between Robustness and Specificity. (arXiv:2301.13565v1 [cs.LG])
    Trustworthy machine learning aims at combating distributional uncertainties in training data distributions compared to population distributions. Typical treatment frameworks include the Bayesian approach, (min-max) distributionally robust optimization (DRO), and regularization. However, two issues have to be raised: 1) All these methods are biased estimators of the true optimal cost; 2) the prior distribution in the Bayesian method, the radius of the distributional ball in the DRO method, and the regularizer in the regularization method are difficult to specify. This paper studies a new framework that unifies the three approaches and that addresses the two challenges mentioned above. The asymptotic properties (e.g., consistency and asymptotic normalities), non-asymptotic properties (e.g., unbiasedness and generalization error bound), and a Monte--Carlo-based solution method of the proposed model are studied. The new model reveals the trade-off between the robustness to the unseen data and the specificity to the training data.
    An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation. (arXiv:2301.13383v1 [cs.SD])
    Pitch and meter are two fundamental music features for symbolic music generation tasks, where researchers usually choose different encoding methods depending on specific goals. However, the advantages and drawbacks of different encoding methods have not been frequently discussed. This paper presents a integrated analysis of the influence of two low-level feature, pitch and meter, on the performance of a token-based sequential music generation model. First, the commonly used MIDI number encoding and a less used class-octave encoding are compared. Second, an dense intra-bar metric grid is imposed to the encoded sequence as auxiliary features. Different complexity and resolutions of the metric grid are compared. For complexity, the single token approach and the multiple token approach are compared; for grid resolution, 0 (ablation), 1 (bar-level), 4 (downbeat-level) 12, (8th-triplet-level) up to 64 (64th-note-grid-level) are compared; for duration resolution, 4, 8, 12 and 16 subdivisions per beat are compared. All different encodings are tested on separately trained Transformer-XL models for a melody generation task. Regarding distribution similarity of several objective evaluation metrics to the test dataset, results suggest that the class-octave encoding significantly outperforms the taken-for-granted MIDI encoding on pitch-related metrics; finer grids and multiple-token grids improve the rhythmic quality, but also suffer from over-fitting at early training stage. Results display a general phenomenon of over-fitting from two aspects, the pitch embedding space and the test loss of the single-token grid encoding. From a practical perspective, we both demonstrate the feasibility and raise the concern of easy over-fitting problem of using smaller networks and lower embedding dimensions on the generation task. The findings can also contribute to futural models in terms of feature engineering.
    Classified as unknown: A novel Bayesian neural network. (arXiv:2301.13401v1 [cs.LG])
    We establish estimations for the parameters of the output distribution for the softmax activation function using the probit function. As an application, we develop a new efficient Bayesian learning algorithm for fully connected neural networks, where training and predictions are performed within the Bayesian inference framework in closed-form. This approach allows sequential learning and requires no computationally expensive gradient calculation and Monte Carlo sampling. Our work generalizes the Bayesian algorithm for a single perceptron for binary classification in \cite{H} to multi-layer perceptrons for multi-class classification.
    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (arXiv:2301.13688v1 [cs.AI])
    We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.
    Fast Resolution Agnostic Neural Techniques to Solve Partial Differential Equations. (arXiv:2301.13331v1 [cs.AI])
    Numerical approximations of partial differential equations (PDEs) are routinely employed to formulate the solution of physics, engineering and mathematical problems involving functions of several variables, such as the propagation of heat or sound, fluid flow, elasticity, electrostatics, electrodynamics, and more. While this has led to solving many complex phenomena, there are still significant limitations. Conventional approaches such as Finite Element Methods (FEMs) and Finite Differential Methods (FDMs) require considerable time and are computationally expensive. In contrast, machine learning-based methods such as neural networks are faster once trained, but tend to be restricted to a specific discretization. This article aims to provide a comprehensive summary of conventional methods and recent machine learning-based methods to approximate PDEs numerically. Furthermore, we highlight several key architectures centered around the neural operator, a novel and fast approach (1000x) to learning the solution operator of a PDE. We will note how these new computational approaches can bring immense advantages in tackling many problems in fundamental and applied physics.
    DNN Explanation for Safety Analysis: an Empirical Evaluation of Clustering-based Approaches. (arXiv:2301.13506v1 [cs.SE])
    The adoption of deep neural networks (DNNs) in safety-critical contexts is often prevented by the lack of effective means to explain their results, especially when they are erroneous. In our previous work, we proposed a white-box approach (HUDD) and a black-box approach (SAFE) to automatically characterize DNN failures. They both identify clusters of similar images from a potentially large set of images leading to DNN failures. However, the analysis pipelines for HUDD and SAFE were instantiated in specific ways according to common practices, deferring the analysis of other pipelines to future work. In this paper, we report on an empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. They combine transfer learning, autoencoders, heatmaps of neuron relevance, dimensionality reduction techniques, and different clustering algorithms. Our results show that the best pipeline combines transfer learning, DBSCAN, and UMAP. It leads to clusters almost exclusively capturing images of the same failure scenario, thus facilitating root cause analysis. Further, it generates distinct clusters for each root cause of failure, thus enabling engineers to detect all the unsafe scenarios. Interestingly, these results hold even for failure scenarios that are only observed in a small percentage of the failing images.
    Holistic Graph-based Motion Prediction. (arXiv:2301.13545v1 [cs.RO])
    Motion prediction for automated vehicles in complex environments is a difficult task that is to be mastered when automated vehicles are to be used in arbitrary situations. Many factors influence the future motion of traffic participants starting with traffic rules and reaching from the interaction between each other to personal habits of human drivers. Therefore we present a novel approach for a graph-based prediction based on a heterogeneous holistic graph representation that combines temporal information, properties and relations between traffic participants as well as relations with static elements like the road network. The information are encoded through different types of nodes and edges that both are enriched with arbitrary features. We evaluated the approach on the INTERACTION and the Argoverse dataset and conducted an informative ablation study to demonstrate the benefit of different types of information for the motion prediction quality.
    Contrast and Clustering: Learning Neighborhood Pair Representation for Source-free Domain Adaptation. (arXiv:2301.13428v1 [cs.CV])
    Domain adaptation has attracted a great deal of attention in the machine learning community, but it requires access to source data, which often raises concerns about data privacy. We are thus motivated to address these issues and propose a simple yet efficient method. This work treats domain adaptation as an unsupervised clustering problem and trains the target model without access to the source data. Specifically, we propose a loss function called contrast and clustering (CaC), where a positive pair term pulls neighbors belonging to the same class together in the feature space to form clusters, while a negative pair term pushes samples of different classes apart. In addition, extended neighbors are taken into account by querying the nearest neighbor indexes in the memory bank to mine for more valuable negative pairs. Extensive experiments on three common benchmarks, VisDA, Office-Home and Office-31, demonstrate that our method achieves state-of-the-art performance. The code will be made publicly available at https://github.com/yukilulu/CaC.
    DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. (arXiv:2301.13629v1 [cs.LG])
    Spatio-temporal graph neural networks (STGNN) have emerged as the dominant model for spatio-temporal graph (STG) forecasting. Despite their success, they fail to model intrinsic uncertainties within STG data, which cripples their practicality in downstream tasks for decision-making. To this end, this paper focuses on probabilistic STG forecasting, which is challenging due to the difficulty in modeling uncertainties and complex ST dependencies. In this study, we present the first attempt to generalize the popular denoising diffusion probabilistic models to STGs, leading to a novel non-autoregressive framework called DiffSTG, along with the first denoising network UGnet for STG in the framework. Our approach combines the spatio-temporal learning capabilities of STGNNs with the uncertainty measurements of diffusion models. Extensive experiments validate that DiffSTG reduces the Continuous Ranked Probability Score (CRPS) by 4%-14%, and Root Mean Squared Error (RMSE) by 2%-7% over existing methods on three real-world datasets.
    Support Exploration Algorithm for Sparse Support Recovery. (arXiv:2301.13584v1 [cs.LG])
    We introduce a new algorithm promoting sparsity called {\it Support Exploration Algorithm (SEA)} and analyze it in the context of support recovery/model selection problems.The algorithm can be interpreted as an instance of the {\it straight-through estimator (STE)} applied to the resolution of a sparse linear inverse problem. SEA uses a non-sparse exploratory vector and makes it evolve in the input space to select the sparse support. We put to evidence an oracle update rule for the exploratory vector and consider the STE update. The theoretical analysis establishes general sufficient conditions of support recovery. The general conditions are specialized to the case where the matrix $A$ performing the linear measurements satisfies the {\it Restricted Isometry Property (RIP)}.Experiments show that SEA can efficiently improve the results of any algorithm. Because of its exploratory nature, SEA also performs remarkably well when the columns of $A$ are strongly coherent.
    Stabilize Deep ResNet with A Sharp Scaling Factor $\tau$. (arXiv:1903.07120v5 [cs.LG] UPDATED)
    We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $\tau =O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $\tau>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $\tau$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $\tau$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
    Superhuman Fairness. (arXiv:2301.13420v1 [cs.LG])
    The fairness of machine learning-based decisions has become an increasingly important focus in the design of supervised machine learning methods. Most fairness approaches optimize a specified trade-off between performance measure(s) (e.g., accuracy, log loss, or AUC) and fairness metric(s) (e.g., demographic parity, equalized odds). This begs the question: are the right performance-fairness trade-offs being specified? We instead re-cast fair machine learning as an imitation learning task by introducing superhuman fairness, which seeks to simultaneously outperform human decisions on multiple predictive performance and fairness measures. We demonstrate the benefits of this approach given suboptimal decisions.
    Exploring QSAR Models for Activity-Cliff Prediction. (arXiv:2301.13644v1 [cs.LG])
    Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
    Collision-aware In-hand 6D Object Pose Estimation using Multiple Vision-based Tactile Sensors. (arXiv:2301.13667v1 [cs.RO])
    In this paper, we address the problem of estimating the in-hand 6D pose of an object in contact with multiple vision-based tactile sensors. We reason on the possible spatial configurations of the sensors along the object surface. Specifically, we filter contact hypotheses using geometric reasoning and a Convolutional Neural Network (CNN), trained on simulated object-agnostic images, to promote those that better comply with the actual tactile images from the sensors. We use the selected sensors configurations to optimize over the space of 6D poses using a Gradient Descent-based approach. We finally rank the obtained poses by penalizing those that are in collision with the sensors. We carry out experiments in simulation using the DIGIT vision-based sensor with several objects, from the standard YCB model set. The results demonstrate that our approach estimates object poses that are compatible with actual object-sensor contacts in $87.5\%$ of cases while reaching an average positional error in the order of $2$ centimeters. Our analysis also includes qualitative results of experiments with a real DIGIT sensor.
    The passive symmetries of machine learning. (arXiv:2301.13724v1 [stat.ML])
    Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry and units covariance, all of which have led to important results in physics. Our goal is to understand the implications of passive symmetries for machine learning: Which passive symmetries play a role (e.g., permutation symmetry in graph neural networks)? What are dos and don'ts in machine learning practice? We assay conditions under which passive symmetries can be implemented as group equivariances. We also discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. While this paper is purely conceptual, we believe that it can have a significant impact on helping machine learning make the transition that took place for modern physics in the first half of the Twentieth century.
    Learning, Fast and Slow: A Goal-Directed Memory-Based Approach for Dynamic Environments. (arXiv:2301.13758v1 [cs.AI])
    Model-based next state prediction and state value prediction are slow to converge. To address these challenges, we do the following: i) Instead of a neural network, we do model-based planning using a parallel memory retrieval system (which we term the slow mechanism); ii) Instead of learning state values, we guide the agent's actions using goal-directed exploration, by using a neural network to choose the next action given the current state and the goal state (which we term the fast mechanism). The goal-directed exploration is trained online using hippocampal replay of visited states and future imagined states every single time step, leading to fast and efficient training. Empirical studies show that our proposed method has a 92% solve rate across 100 episodes in a dynamically changing grid world, significantly outperforming state-of-the-art actor critic mechanisms such as PPO (54%), TRPO (50%) and A2C (24%). Ablation studies demonstrate that both mechanisms are crucial. We posit that the future of Reinforcement Learning (RL) will be to model goals and sub-goals for various tasks, and plan it out in a goal-directed memory-based approach.
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R. (arXiv:2103.09603v4 [stat.ML] UPDATED)
    The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.
    Review of methods for automatic cerebral microbleeds detection. (arXiv:2301.13549v1 [cs.CV])
    Cerebral microbleeds detection is an important and challenging task. With the gaining popularity of the MRI, the ability to detect cerebral microbleeds also raises. Unfortunately, for radiologists, it is a time-consuming and laborious procedure. For this reason, various solutions to automate this process have been proposed for several years, but none of them is currently used in medical practice. In this context, the need to systematize the existing knowledge and best practices has been recognized as a factor facilitating the imminent synthesis of a real CMBs detection system practically applicable in medicine. To the best of our knowledge, all available publications regarding automatic cerebral microbleeds detection have been gathered, described, and assessed in this paper in order to distinguish the current research state and provide a starting point for future studies.
    A Mathematical Model for Curriculum Learning. (arXiv:2301.13833v1 [cs.LG])
    Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples, involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. We conduct experiments to support our analysis. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial, while we conjecture that CL with unbounded many curriculum steps can learn this class efficiently.
    What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods. (arXiv:2112.04417v3 [cs.CV] UPDATED)
    A multitude of explainability methods and associated fidelity performance metrics have been proposed to help better understand how modern AI systems make decisions. However, much of the current work has remained theoretical -- without much consideration for the human end-user. In particular, it is not yet known (1) how useful current explainability methods are in practice for more real-world scenarios and (2) how well associated performance metrics accurately predict how much knowledge individual explanations contribute to a human end-user trying to understand the inner-workings of the system. To fill this gap, we conducted psychophysics experiments at scale to evaluate the ability of human participants to leverage representative attribution methods for understanding the behavior of different image classifiers representing three real-world scenarios: identifying bias in an AI system, characterizing the visual strategy it uses for tasks that are too difficult for an untrained non-expert human observer as well as understanding its failure cases. Our results demonstrate that the degree to which individual attribution methods help human participants better understand an AI system varied widely across these scenarios. This suggests a critical need for the field to move past quantitative improvements of current attribution methods towards the development of complementary approaches that provide qualitatively different sources of information to human end-users.
    Towards a Defense Against Federated Backdoor Attacks Under Continuous Training. (arXiv:2205.11736v4 [cs.LG] UPDATED)
    Backdoor attacks are dangerous and difficult to prevent in federated learning (FL), where training data is sourced from untrusted clients over long periods of time. These difficulties arise because: (a) defenders in FL do not have access to raw training data, and (b) a new phenomenon we identify called backdoor leakage causes models trained continuously to eventually suffer from backdoors due to cumulative errors in defense mechanisms. We propose shadow learning, a framework for defending against backdoor attacks in the FL setting under long-range training. Shadow learning trains two models in parallel: a backbone model and a shadow model. The backbone is trained without any defense mechanism to obtain good performance on the main task. The shadow model combines filtering of malicious clients with early-stopping to control the attack success rate even as the data distribution changes. We theoretically motivate our design and show experimentally that our framework significantly improves upon existing defenses against backdoor attacks.
    Company-as-Tribe: Company Financial Risk Assessment on Tribe-Style Graph with Hierarchical Graph Neural Networks. (arXiv:2301.13492v1 [cs.LG])
    Company financial risk is ubiquitous and early risk assessment for listed companies can avoid considerable losses. Traditional methods mainly focus on the financial statements of companies and lack the complex relationships among them. However, the financial statements are often biased and lagged, making it difficult to identify risks accurately and timely. To address the challenges, we redefine the problem as \textbf{company financial risk assessment on tribe-style graph} by taking each listed company and its shareholders as a tribe and leveraging financial news to build inter-tribe connections. Such tribe-style graphs present different patterns to distinguish risky companies from normal ones. However, most nodes in the tribe-style graph lack attributes, making it difficult to directly adopt existing graph learning methods (e.g., Graph Neural Networks(GNNs)). In this paper, we propose a novel Hierarchical Graph Neural Network (TH-GNN) for Tribe-style graphs via two levels, with the first level to encode the structure pattern of the tribes with contrastive learning, and the second level to diffuse information based on the inter-tribe relations, achieving effective and efficient risk assessment. Extensive experiments on the real-world company dataset show that our method achieves significant improvements on financial risk assessment over previous competing methods. Also, the extensive ablation studies and visualization comprehensively show the effectiveness of our method.
    Image Shortcut Squeezing: Countering Perturbative Availability Poisons with Compression. (arXiv:2301.13838v1 [cs.CR])
    Perturbative availability poisoning (PAP) adds small changes to images to prevent their use for model training. Current research adopts the belief that practical and effective approaches to countering such poisons do not exist. In this paper, we argue that it is time to abandon this belief. We present extensive experiments showing that 12 state-of-the-art PAP methods are vulnerable to Image Shortcut Squeezing (ISS), which is based on simple compression. For example, on average, ISS restores the CIFAR-10 model accuracy to $81.73\%$, surpassing the previous best preprocessing-based countermeasures by $37.97\%$ absolute. ISS also (slightly) outperforms adversarial training and has higher generalizability to unseen perturbation norms and also higher efficiency. Our investigation reveals that the property of PAP perturbations depends on the type of surrogate model used for poison generation, and it explains why a specific ISS compression yields the best performance for a specific type of PAP perturbation. We further test stronger, adaptive poisoning, and show it falls short of being an ideal defense against ISS. Overall, our results demonstrate the importance of considering various (simple) countermeasures to ensure the meaningfulness of analysis carried out during the development of availability poisons.
    LogAI: A Library for Log Analytics and Intelligence. (arXiv:2301.13415v1 [cs.AI])
    Software and System logs record runtime information about processes executing within a system. These logs have become the most critical and ubiquitous forms of observability data that help developers understand system behavior, monitor system health and resolve issues. However, the volume of logs generated can be humongous (of the order of petabytes per day) especially for complex distributed systems, such as cloud, search engine, social media, etc. This has propelled a lot of research on developing AI-based log based analytics and intelligence solutions that can process huge volume of raw logs and generate insights. In order to enable users to perform multiple types of AI-based log analysis tasks in a uniform manner, we introduce LogAI (https://github.com/salesforce/logai), a one-stop open source library for log analytics and intelligence. LogAI supports tasks such as log summarization, log clustering and log anomaly detection. It adopts the OpenTelemetry data model, to enable compatibility with different log management platforms. LogAI provides a unified model interface and provides popular time-series, statistical learning and deep learning models. Alongside this, LogAI also provides an out-of-the-box GUI for users to conduct interactive analysis. With LogAI, we can also easily benchmark popular deep learning algorithms for log anomaly detection without putting in redundant effort to process the logs. We have opensourced LogAI to cater to a wide range of applications benefiting both academic research and industrial prototyping.
    Fourier Sensitivity and Regularization of Computer Vision Models. (arXiv:2301.13514v1 [cs.CV])
    Recent work has empirically shown that deep neural networks latch on to the Fourier statistics of training data and show increased sensitivity to Fourier-basis directions in the input. Understanding and modifying this Fourier-sensitivity of computer vision models may help improve their robustness. Hence, in this paper we study the frequency sensitivity characteristics of deep neural networks using a principled approach. We first propose a basis trick, proving that unitary transformations of the input-gradient of a function can be used to compute its gradient in the basis induced by the transformation. Using this result, we propose a general measure of any differentiable model's Fourier-sensitivity using the unitary Fourier-transform of its input-gradient. When applied to deep neural networks, we find that computer vision models are consistently sensitive to particular frequencies dependent on the dataset, training method and architecture. Based on this measure, we further propose a Fourier-regularization framework to modify the Fourier-sensitivities and frequency bias of models. Using our proposed regularizer-family, we demonstrate that deep neural networks obtain improved classification accuracy on robustness evaluations.
    Multicalibration as Boosting for Regression. (arXiv:2301.13767v1 [cs.LG])
    We study the connection between multicalibration and boosting for squared error regression. First we prove a useful characterization of multicalibration in terms of a ``swap regret'' like condition on squared error. Using this characterization, we give an exceedingly simple algorithm that can be analyzed both as a boosting algorithm for regression and as a multicalibration algorithm for a class H that makes use only of a standard squared error regression oracle for H. We give a weak learning assumption on H that ensures convergence to Bayes optimality without the need to make any realizability assumptions -- giving us an agnostic boosting algorithm for regression. We then show that our weak learning assumption on H is both necessary and sufficient for multicalibration with respect to H to imply Bayes optimality. We also show that if H satisfies our weak learning condition relative to another class C then multicalibration with respect to H implies multicalibration with respect to C. Finally we investigate the empirical performance of our algorithm experimentally using an open source implementation that we make available. Our code repository can be found at https://github.com/Declancharrison/Level-Set-Boosting.
    CMLCompiler: A Unified Compiler for Classical Machine Learning. (arXiv:2301.13441v1 [cs.LG])
    Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38x speedup on CPU, 3.31x speedup on GPU, and 5.09x speedup on IoT devices, compared to the state-of-the-art solutions -- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations.
    Real-Time Outlier Detection with Dynamic Process Limits. (arXiv:2301.13527v1 [cs.LG])
    Anomaly detection methods are part of the systems where rare events may endanger an operation's profitability, safety, and environmental aspects. Although many state-of-the-art anomaly detection methods were developed to date, their deployment is limited to the operation conditions present during the model training. Online anomaly detection brings the capability to adapt to data drifts and change points that may not be represented during model development resulting in prolonged service life. This paper proposes an online anomaly detection algorithm for existing real-time infrastructures where low-latency detection is required and novel patterns in data occur unpredictably. The online inverse cumulative distribution-based approach is introduced to eliminate common problems of offline anomaly detectors, meanwhile providing dynamic process limits to normal operation. The benefit of the proposed method is the ease of use, fast computation, and deployability as shown in two case studies of real microgrid operation data.
    Retiring $\Delta$DP: New Distribution-Level Metrics for Demographic Parity. (arXiv:2301.13443v1 [cs.LG])
    Demographic parity is the most widely recognized measure of group fairness in machine learning, which ensures equal treatment of different demographic groups. Numerous works aim to achieve demographic parity by pursuing the commonly used metric $\Delta DP$. Unfortunately, in this paper, we reveal that the fairness metric $\Delta DP$ can not precisely measure the violation of demographic parity, because it inherently has the following drawbacks: \textit{i)} zero-value $\Delta DP$ does not guarantee zero violation of demographic parity, \textit{ii)} $\Delta DP$ values can vary with different classification thresholds. To this end, we propose two new fairness metrics, \textsf{A}rea \textsf{B}etween \textsf{P}robability density function \textsf{C}urves (\textsf{ABPC}) and \textsf{A}rea \textsf{B}etween \textsf{C}umulative density function \textsf{C}urves (\textsf{ABCC}), to precisely measure the violation of demographic parity in distribution level. The new fairness metrics directly measure the difference between the distributions of the prediction probability for different demographic groups. Thus our proposed new metrics enjoy: \textit{i)} zero-value \textsf{ABCC}/\textsf{ABPC} guarantees zero violation of demographic parity; \textit{ii)} \textsf{ABCC}/\textsf{ABPC} guarantees demographic parity while the classification threshold adjusted. We further re-evaluate the existing fair models with our proposed fairness metrics and observe different fairness behaviors of those models under the new metrics.
    Quantum contextual bandits and recommender systems for quantum data. (arXiv:2301.13524v1 [quant-ph])
    We study a recommender system for quantum data using the linear contextual bandit framework. In each round, a learner receives an observable (the context) and has to recommend from a finite set of unknown quantum states (the actions) which one to measure. The learner has the goal of maximizing the reward in each round, that is the outcome of the measurement on the unknown state. Using this model we formulate the low energy quantum state recommendation problem where the context is a Hamiltonian and the goal is to recommend the state with the lowest energy. For this task, we study two families of contexts: the Ising model and a generalized cluster model. We observe that if we interpret the actions as different phases of the models then the recommendation is done by classifying the correct phase of the given Hamiltonian and the strategy can be interpreted as an online quantum phase classifier.
    Learning Data Representations with Joint Diffusion Models. (arXiv:2301.13622v1 [cs.LG])
    We introduce a joint diffusion model that simultaneously learns meaningful internal representations fit for both generative and predictive tasks. Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models in both generative and predictive settings. We then introduce an extension of the vanilla diffusion model with a classifier that allows for stable joint training with shared parametrization between those objectives. The resulting joint diffusion model offers superior performance across various tasks, including generative modeling, semi-supervised classification, and domain adaptation.
    Domain-Generalizable Multiple-Domain Clustering. (arXiv:2301.13530v1 [cs.LG])
    Accurately clustering high-dimensional measurements is vital for adequately analyzing scientific data. Deep learning machinery has remarkably improved clustering capabilities in recent years due to its ability to extract meaningful representations. In this work, we are given unlabeled samples from multiple source domains, and we aim to learn a shared classifier that assigns the examples to various clusters. Evaluation is done by using the classifier for predicting cluster assignments in a previously unseen domain. This setting generalizes the problem of unsupervised domain generalization to the case in which no supervised learning samples are given (completely unsupervised). Towards this goal, we present an end-to-end model and evaluate its capabilities on several multi-domain image datasets. Specifically, we demonstrate that our model is more accurate than schemes that require fine-tuning using samples from the target domain or some level of supervision.
    BALANCE: Bayesian Linear Attribution for Root Cause Localization. (arXiv:2301.13572v1 [cs.LG])
    Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations, as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimensional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a forward manner while promoting sparsity and concurrently paying attention to the correlation between the candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of accuracy with the least amount of running time, and achieves at least $6\%$ notably higher accuracy than SOTA methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and the online results further advocate its usage for real-time diagnosis in distributed data systems.
    Affinity Uncertainty-based Hard Negative Mining in Graph Contrastive Learning. (arXiv:2301.13340v1 [cs.LG])
    Hard negative mining has shown effective in enhancing self-supervised contrastive learning (CL) on diverse data types, including graph contrastive learning (GCL). Existing hardness-aware CL methods typically treat negative instances that are most similar to the anchor instance as hard negatives, which helps improve the CL performance, especially on image data. However, this approach often fails to identify the hard negatives but leads to many false negatives on graph data. This is mainly due to that the learned graph representations are not sufficiently discriminative due to over-smooth representations and/or non-i.i.d. issues in graph data. To tackle this problem, this paper proposes a novel approach that builds a discriminative model on collective affinity information (i.e, two sets of pairwise affinities between the negative instances and the anchor instance) to mine hard negatives in GCL. In particular, the proposed approach evaluates how confident/uncertain the discriminative model is about the affinity of each negative instance to an anchor instance to determine its hardness weight relative to the anchor instance. This uncertainty information is then incorporated into existing GCL loss functions via a weighting term to enhance their performance. The enhanced GCL is theoretically grounded that the resulting GCL loss is equivalent to a triplet loss with an adaptive margin being exponentially proportional to the learned uncertainty of each negative instance. Extensive experiments on 10 graph datasets show that our approach i) consistently enhances different state-of-the-art GCL methods in both graph and node classification tasks, and ii) significantly improves their robustness against adversarial attacks.
    An investigation of challenges encountered when specifying training data and runtime monitors for safety critical ML applications. (arXiv:2301.13476v1 [cs.SE])
    Context and motivation: The development and operation of critical software that contains machine learning (ML) models requires diligence and established processes. Especially the training data used during the development of ML models have major influences on the later behaviour of the system. Runtime monitors are used to provide guarantees for that behaviour. Question / problem: We see major uncertainty in how to specify training data and runtime monitoring for critical ML models and by this specifying the final functionality of the system. In this interview-based study we investigate the underlying challenges for these difficulties. Principal ideas/results: Based on ten interviews with practitioners who develop ML models for critical applications in the automotive and telecommunication sector, we identified 17 underlying challenges in 6 challenge groups that relate to the challenge of specifying training data and runtime monitoring. Contribution: The article provides a list of the identified underlying challenges related to the difficulties practitioners experience when specifying training data and runtime monitoring for ML models. Furthermore, interconnection between the challenges were found and based on these connections recommendation proposed to overcome the root causes for the challenges.
    Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning. (arXiv:2301.13618v1 [cs.LG])
    Many real-time applications (e.g., Augmented/Virtual Reality, cognitive assistance) rely on Deep Neural Networks (DNNs) to process inference tasks. Edge computing is considered a key infrastructure to deploy such applications, as moving computation close to the data sources enables us to meet stringent latency and throughput requirements. However, the constrained nature of edge networks poses several additional challenges to the management of inference workloads: edge clusters can not provide unlimited processing power to DNN models, and often a trade-off between network and processing time should be considered when it comes to end-to-end delay requirements. In this paper, we focus on the problem of scheduling inference queries on DNN models in edge networks at short timescales (i.e., few milliseconds). By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP, highlighting the need for a dynamic scheduling policy that can adapt to network conditions and workloads. We therefore design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions. Our results show that ASET effectively provides the best performance compared to static policies when scheduling over a distributed pool of edge resources.
    Faster Predict-and-Optimize with Three-Operator Splitting. (arXiv:2301.13395v1 [cs.LG])
    In many practical settings, a combinatorial problem must be repeatedly solved with similar, but distinct parameters w. Yet, w is not directly observed; only contextual data d that correlates with w is available. It is tempting to use a neural network to predict w given d, but training such a model requires reconciling the discrete nature of combinatorial optimization with the gradient-based frameworks used to train neural networks. One approach to overcoming this issue is to consider a continuous relaxation of the combinatorial problem. While existing such approaches have shown to be highly effective on small problems (10-100 variables) they do not scale well to large problems. In this work, we show how recent results in operator splitting can be used to design such a system which is easy to train and scales effortlessly to problems with thousands of variables.
    Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond. (arXiv:2301.13486v1 [stat.ML])
    In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.
    Fine Robotic Manipulation without Force/Torque Sensor. (arXiv:2301.13413v1 [cs.RO])
    Force Sensing and Force Control are essential to many industrial applications. Typically, a 6-axis Force/Torque (F/T) sensor is mounted between the robot's wrist and the end-effector in order to measure the forces and torques exerted by the environment onto the robot (the external wrench). Although a typical 6-axis F/T sensor can provide highly accurate measurements, it is expensive and vulnerable to drift and external impacts. Existing methods aiming at estimating the external wrench using only the robot's internal signals are limited in scope: for example, wrench estimation accuracy was mostly validated in free-space motions and simple contacts as opposed to tasks like assembly that require high-precision force control. Here we present a Neural Network based method and argue that by devoting particular attention to the training data structure, it is possible to accurately estimate the external wrench in a wide range of scenarios based solely on internal signals. As an illustration, we demonstrate a pin insertion experiment with 100-micron clearance and a hand-guiding experiment, both performed without external F/T sensors or joint torque sensors. Our result opens the possibility of equipping the existing 2.7 million industrial robots with Force Sensing and Force Control capabilities without any additional hardware.
    Training with Mixed-Precision Floating-Point Assignments. (arXiv:2301.13464v1 [cs.LG])
    When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy loss. Hence, it is important to use a precision assignment -- a mapping from all tensors (arising in training) to precision levels (high or low) -- that keeps most of the tensors in low precision and leads to sufficiently accurate models. We provide a technique that explores this memory-accuracy tradeoff by generating precision assignments that (i) use less memory and (ii) lead to more accurate models at the same time, compared to the precision assignments considered by prior work in low-precision floating-point training. Our method typically provides > 2x memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading off accuracy. Compared to other baselines which sometimes cause training to diverge, our method provides similar or better memory reduction while avoiding divergence.
    Incorporating Recurrent Reinforcement Learning into Model Predictive Control for Adaptive Control in Autonomous Driving. (arXiv:2301.13313v1 [cs.LG])
    Model Predictive Control (MPC) is attracting tremendous attention in the autonomous driving task as a powerful control technique. The success of an MPC controller strongly depends on an accurate internal dynamics model. However, the static parameters, usually learned by system identification, often fail to adapt to both internal and external perturbations in real-world scenarios. In this paper, we firstly (1) reformulate the problem as a Partially Observed Markov Decision Process (POMDP) that absorbs the uncertainties into observations and maintains Markov property into hidden states; and (2) learn a recurrent policy continually adapting the parameters of the dynamics model via Recurrent Reinforcement Learning (RRL) for optimal and adaptive control; and (3) finally evaluate the proposed algorithm (referred as $\textit{MPC-RRL}$) in CARLA simulator and leading to robust behaviours under a wide range of perturbations.
    Population-wise Labeling of Sulcal Graphs using Multi-graph Matching. (arXiv:2301.13532v1 [stat.ML])
    Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
    Causality-based CTR Prediction using Graph Neural Networks. (arXiv:2301.12762v1 [cs.IR] CROSS LISTED)
    As a prevalent problem in online advertising, CTR prediction has attracted plentiful attention from both academia and industry. Recent studies have been reported to establish CTR prediction models in the graph neural networks (GNNs) framework. However, most of GNNs-based models handle feature interactions in a complete graph, while ignoring causal relationships among features, which results in a huge drop in the performance on out-of-distribution data. This paper is dedicated to developing a causality-based CTR prediction model in the GNNs framework (Causal-GNN) integrating representations of feature graph, user graph and ad graph in the context of online advertising. In our model, a structured representation learning method (GraphFwFM) is designed to capture high-order representations on feature graph based on causal discovery among field features in gated graph neural networks (GGNNs), and GraphSAGE is employed to obtain graph representations of users and ads. Experiments conducted on three public datasets demonstrate the superiority of Causal-GNN in AUC and Logloss and the effectiveness of GraphFwFM in capturing high-order representations on causal feature graph.
    Proxy-based Zero-Shot Entity Linking by Effective Candidate Retrieval. (arXiv:2301.13318v1 [cs.LG])
    A recent advancement in the domain of biomedical Entity Linking is the development of powerful two-stage algorithms, an initial candidate retrieval stage that generates a shortlist of entities for each mention, followed by a candidate ranking stage. However, the effectiveness of both stages are inextricably dependent on computationally expensive components. Specifically, in candidate retrieval via dense representation retrieval it is important to have hard negative samples, which require repeated forward passes and nearest neighbour searches across the entire entity label set throughout training. In this work, we show that pairing a proxy-based metric learning loss with an adversarial regularizer provides an efficient alternative to hard negative sampling in the candidate retrieval stage. In particular, we show competitive performance on the recall@1 metric, thereby providing the option to leave out the expensive candidate ranking step. Finally, we demonstrate how the model can be used in a zero-shot setting to discover out of knowledge base biomedical entities.
    A Framework for Adapting Offline Algorithms to Solve Combinatorial Multi-Armed Bandit Problems with Bandit Feedback. (arXiv:2301.13326v1 [cs.LG])
    We investigate the problem of stochastic, combinatorial multi-armed bandits where the learner only has access to bandit feedback and the reward function can be non-linear. We provide a general framework for adapting discrete offline approximation algorithms into sublinear $\alpha$-regret methods that only require bandit feedback, achieving $\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$ expected cumulative $\alpha$-regret dependence on the horizon $T$. The framework only requires the offline algorithms to be robust to small errors in function evaluation. The adaptation procedure does not even require explicit knowledge of the offline approximation algorithm -- the offline algorithm can be used as black box subroutine. To demonstrate the utility of the proposed framework, the proposed framework is applied to multiple problems in submodular maximization, adapting approximation algorithms for cardinality and for knapsack constraints. The new CMAB algorithms for knapsack constraints outperform a full-bandit method developed for the adversarial setting in experiments with real-world data.
    On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. (arXiv:2301.13370v1 [cs.LG])
    Recent work has shown that automatic differentiation over the reals is almost always correct in a mathematically precise sense. However, actual programs work with machine-representable numbers (e.g., floating-point numbers), not reals. In this paper, we study the correctness of automatic differentiation when the parameter space of a neural network consists solely of machine-representable numbers. For a neural network with bias parameters, we prove that automatic differentiation is correct at all parameters where the network is differentiable. In contrast, it is incorrect at all parameters where the network is non-differentiable, since it never informs non-differentiability. To better understand this non-differentiable set of parameters, we prove a tight bound on its size, which is linear in the number of non-differentiabilities in activation functions, and provide a simple necessary and sufficient condition for a parameter to be in this set. We further prove that automatic differentiation always computes a Clarke subderivative, even on the non-differentiable set. We also extend these results to neural networks possibly without bias parameters.
    Understanding Self-Distillation in the Presence of Label Noise. (arXiv:2301.13304v1 [cs.LG])
    Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.
    Continuous Spatiotemporal Transformers. (arXiv:2301.13338v1 [cs.LG])
    Modeling spatiotemporal dynamical systems is a fundamental challenge in machine learning. Transformer models have been very successful in NLP and computer vision where they provide interpretable representations of data. However, a limitation of transformers in modeling continuous dynamical systems is that they are fundamentally discrete time and space models and thus have no guarantees regarding continuous sampling. To address this challenge, we present the Continuous Spatiotemporal Transformer (CST), a new transformer architecture that is designed for the modeling of continuous systems. This new framework guarantees a continuous and smooth output via optimization in Sobolev space. We benchmark CST against traditional transformers as well as other spatiotemporal dynamics modeling methods and achieve superior performance in a number of tasks on synthetic and real systems, including learning brain dynamics from calcium imaging data.
    Deep Learning for Reference-Free Geolocation for Poplar Trees. (arXiv:2301.13387v1 [q-bio.GN])
    A core task in precision agriculture is the identification of climatic and ecological conditions that are advantageous for a given crop. The most succinct approach is geolocation, which is concerned with locating the native region of a given sample based on its genetic makeup. Here, we investigate genomic geolocation of Populus trichocarpa, or poplar, which has been identified by the US Department of Energy as a fast-rotation biofuel crop to be harvested nationwide. In particular, we approach geolocation from a reference-free perspective, circumventing the need for compute-intensive processes such as variant calling and alignment. Our model, MashNet, predicts latitude and longitude for poplar trees from randomly-sampled, unaligned sequence fragments. We show that our model performs comparably to Locator, a state-of-the-art method based on aligned whole-genome sequence data. MashNet achieves an error of 34.0 km^2 compared to Locator's 22.1 km^2. MashNet allows growers to quickly and efficiently identify natural varieties that will be most productive in their growth environment based on genotype. This paper explores geolocation for precision agriculture while providing a framework and data source for further development by the machine learning community.
    Combinatorial Causal Bandits without Graph Skeleton. (arXiv:2301.13392v1 [cs.LG])
    In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we sacrifice the regret to $O(T^{\frac{2}{3}}\ln T)$ to remove the weight gap covered by the asymptotic notation. At last, we give some discussions and algorithms for pure exploration of the CCB problem without the graph structure.
    Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization. (arXiv:2301.13303v1 [stat.ML])
    To achieve scalable and accurate inference for latent Gaussian processes, we propose a variational approximation based on a family of Gaussian distributions whose covariance matrices have sparse inverse Cholesky (SIC) factors. We combine this variational approximation of the posterior with a similar and efficient SIC-restricted Kullback-Leibler-optimal approximation of the prior. We then focus on a particular SIC ordering and nearest-neighbor-based sparsity pattern resulting in highly accurate prior and posterior approximations. For this setting, our variational approximation can be computed via stochastic gradient descent in polylogarithmic time per iteration. We provide numerical comparisons showing that the proposed double-Kullback-Leibler-optimal Gaussian-process approximation (DKLGP) can sometimes be vastly more accurate than alternative approaches such as inducing-point and mean-field approximations at similar computational complexity.
    Scaling laws for single-agent reinforcement learning. (arXiv:2301.13442v1 [cs.LG])
    Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.
    GDOD: Effective Gradient Descent using Orthogonal Decomposition for Multi-Task Learning. (arXiv:2301.13465v1 [cs.LG])
    Multi-task learning (MTL) aims at solving multiple related tasks simultaneously and has experienced rapid growth in recent years. However, MTL models often suffer from performance degeneration with negative transfer due to learning several tasks simultaneously. Some related work attributed the source of the problem is the conflicting gradients. In this case, it is needed to select useful gradient updates for all tasks carefully. To this end, we propose a novel optimization approach for MTL, named GDOD, which manipulates gradients of each task using an orthogonal basis decomposed from the span of all task gradients. GDOD decomposes gradients into task-shared and task-conflict components explicitly and adopts a general update rule for avoiding interference across all task gradients. This allows guiding the update directions depending on the task-shared components. Moreover, we prove the convergence of GDOD theoretically under both convex and non-convex assumptions. Experiment results on several multi-task datasets not only demonstrate the significant improvement of GDOD performed to existing MTL models but also prove that our algorithm outperforms state-of-the-art optimization methods in terms of AUC and Logloss metrics.
    Demystifying Disagreement-on-the-Line in High Dimensions. (arXiv:2301.13371v1 [stat.ML])
    Evaluating the performance of machine learning models under distribution shift is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackle this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have lead to the discovery of the disagreement-on-the-line phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny ImageNet-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings.
    Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance. (arXiv:2301.13376v1 [cs.LG])
    We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during training using accumulator bit width bounds that we derive. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline. We then show that this reduction translates to increased design efficiency for custom FPGA-based accelerators. Finally, we show that our algorithm not only constrains weights to fit into an accumulator of user-defined bit width, but also increases the sparsity and compressibility of the resulting weights. Across all of our benchmark models trained with 8-bit weights and activations, we observe that constraining the hidden layers of quantized neural networks to fit into 16-bit accumulators yields an average 98.2% sparsity with an estimated compression rate of 46.5x all while maintaining 99.2% of the floating-point performance.
    CRISP: Curriculum based Sequential Neural Decoders for Polar Code Family. (arXiv:2210.00313v2 [cs.IT] UPDATED)
    Polar codes are widely used state-of-the-art codes for reliable communication that have recently been included in the 5th generation wireless standards (5G). However, there remains room for the design of polar decoders that are both efficient and reliable in the short blocklength regime. Motivated by recent successes of data-driven channel decoders, we introduce a novel $\textbf{C}$ur$\textbf{RI}$culum based $\textbf{S}$equential neural decoder for $\textbf{P}$olar codes (CRISP). We design a principled curriculum, guided by information-theoretic insights, to train CRISP and show that it outperforms the successive-cancellation (SC) decoder and attains near-optimal reliability performance on the Polar(32,16) and Polar(64,22) codes. The choice of the proposed curriculum is critical in achieving the accuracy gains of CRISP, as we show by comparing against other curricula. More notably, CRISP can be readily extended to Polarization-Adjusted-Convolutional (PAC) codes, where existing SC decoders are significantly less reliable. To the best of our knowledge, CRISP constructs the first data-driven decoder for PAC codes and attains near-optimal performance on the PAC(32,16) code.
    Autobidders with Budget and ROI Constraints: Efficiency, Regret, and Pacing Dynamics. (arXiv:2301.13306v1 [cs.GT])
    We study a game between autobidding algorithms that compete in an online advertising platform. Each autobidder is tasked with maximizing its advertiser's total value over multiple rounds of a repeated auction, subject to budget and/or return-on-investment constraints. We propose a gradient-based learning algorithm that is guaranteed to satisfy all constraints and achieves vanishing individual regret. Our algorithm uses only bandit feedback and can be used with the first- or second-price auction, as well as with any "intermediate" auction format. Our main result is that when these autobidders play against each other, the resulting expected liquid welfare over all rounds is at least half of the expected optimal liquid welfare achieved by any allocation. This holds whether or not the bidding dynamics converges to an equilibrium and regardless of the correlation structure between advertiser valuations.
    GeneFormer: Learned Gene Compression using Transformer-based Context Modeling. (arXiv:2212.08379v3 [cs.LG] UPDATED)
    With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like G-zip, which fails to utilize the interrelation of nucleotide sequence. Recently, many researchers begin to investigate deep learning based gene data compression method. In this paper, we propose a transformer-based gene compression method named GeneFormer. Specifically, we first introduce a modified transformer structure to fully explore the nucleotide sequence dependency. Then, we propose fixed-length parallel grouping to accelerate the decoding speed of our autoregressive model. Experimental results on real-world datasets show that our method saves 29.7% bit rate compared with the state-of-the-art method, and the decoding speed is significantly faster than all existing learning-based gene compression methods.
    Time Series Forecasting via Semi-Asymmetric Convolutional Architecture with Global Atrous Sliding Window. (arXiv:2301.13691v1 [cs.AI])
    The proposed method in this paper is designed to address the problem of time series forecasting. Although some exquisitely designed models achieve excellent prediction performances, how to extract more useful information and make accurate predictions is still an open issue. Most of modern models only focus on a short range of information, which are fatal for problems such as time series forecasting which needs to capture long-term information characteristics. As a result, the main concern of this work is to further mine relationship between local and global information contained in time series to produce more precise predictions. In this paper, to satisfactorily realize the purpose, we make three main contributions that are experimentally verified to have performance advantages. Firstly, original time series is transformed into difference sequence which serves as input to the proposed model. And secondly, we introduce the global atrous sliding window into the forecasting model which references the concept of fuzzy time series to associate relevant global information with temporal data within a time period and utilizes central-bidirectional atrous algorithm to capture underlying-related features to ensure validity and consistency of captured data. Thirdly, a variation of widely-used asymmetric convolution which is called semi-asymmetric convolution is devised to more flexibly extract relationships in adjacent elements and corresponding associated global features with adjustable ranges of convolution on vertical and horizontal directions. The proposed model in this paper achieves state-of-the-art on most of time series datasets provided compared with competitive modern models.
    Grounding Language Models to Images for Multimodal Generation. (arXiv:2301.13823v1 [cs.CL])
    We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process and generate arbitrarily interleaved image-and-text data. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
    Active Learning-based Domain Adaptive Localized Polynomial Chaos Expansion. (arXiv:2301.13635v1 [cs.LG])
    The paper presents a novel methodology to build surrogate models of complicated functions by an active learning-based sequential decomposition of the input random space and construction of localized polynomial chaos expansions, referred to as domain adaptive localized polynomial chaos expansion (DAL-PCE). The approach utilizes sequential decomposition of the input random space into smaller sub-domains approximated by low-order polynomial expansions. This allows approximation of functions with strong nonlinearties, discontinuities, and/or singularities. Decomposition of the input random space and local approximations alleviates the Gibbs phenomenon for these types of problems and confines error to a very small vicinity near the non-linearity. The global behavior of the surrogate model is therefore significantly better than existing methods as shown in numerical examples. The whole process is driven by an active learning routine that uses the recently proposed $\Theta$ criterion to assess local variance contributions. The proposed approach balances both \emph{exploitation} of the surrogate model and \emph{exploration} of the input random space and thus leads to efficient and accurate approximation of the original mathematical model. The numerical results show the superiority of the DAL-PCE in comparison to (i) a single global polynomial chaos expansion and (ii) the recently proposed stochastic spectral embedding (SSE) method developed as an accurate surrogate model and which is based on a similar domain decomposition process. This method represents general framework upon which further extensions and refinements can be based, and which can be combined with any technique for non-intrusive polynomial chaos expansion construction.
    Enhancing Hyper-To-Real Space Projections Through Euclidean Norm Meta-Heuristic Optimization. (arXiv:2301.13671v1 [cs.LG])
    The continuous computational power growth in the last decades has made solving several optimization problems significant to humankind a tractable task; however, tackling some of them remains a challenge due to the overwhelming amount of candidate solutions to be evaluated, even by using sophisticated algorithms. In such a context, a set of nature-inspired stochastic methods, called meta-heuristic optimization, can provide robust approximate solutions to different kinds of problems with a small computational burden, such as derivative-free real function optimization. Nevertheless, these methods may converge to inadequate solutions if the function landscape is too harsh, e.g., enclosing too many local optima. Previous works addressed this issue by employing a hypercomplex representation of the search space, like quaternions, where the landscape becomes smoother and supposedly easier to optimize. Under this approach, meta-heuristic computations happen in the hypercomplex space, whereas variables are mapped back to the real domain before function evaluation. Despite this latter operation being performed by the Euclidean norm, we have found that after the optimization procedure has finished, it is usually possible to obtain even better solutions by employing the Minkowski $p$-norm instead and fine-tuning $p$ through an auxiliary sub-problem with neglecting additional cost and no hyperparameters. Such behavior was observed in eight well-established benchmarking functions, thus fostering a new research direction for hypercomplex meta-heuristic optimization.
    Complete Neural Networks for Euclidean Graphs. (arXiv:2301.13821v1 [cs.LG])
    We propose a 2-WL-like geometric graph isomorphism test and prove it is complete when applied to Euclidean Graphs in $\mathbb{R}^3$. We then use recent results on multiset embeddings to devise an efficient geometric GNN model with equivalent separation power. We verify empirically that our GNN model is able to separate particularly challenging synthetic examples, and demonstrate its usefulness for a chemical property prediction problem.
    Archetypal Analysis++: Rethinking the Initialization Strategy. (arXiv:2301.13748v1 [cs.LG])
    Archetypal analysis is a matrix factorization method with convexity constraints. Due to local minima, a good initialization is essential. Frequently used initialization methods yield either sub-optimal starting points or are prone to get stuck in poor local minima. In this paper, we propose archetypal analysis++ (AA++), a probabilistic initialization strategy for archetypal analysis that sequentially samples points based on their influence on the objective, similar to $k$-means++. In fact, we argue that $k$-means++ already approximates the proposed initialization method. Furthermore, we suggest to adapt an efficient Monte Carlo approximation of $k$-means++ to AA++. In an extensive empirical evaluation of 13 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost consistently outperforms all baselines, including the most frequently used ones.
    Deep learning-based lung segmentation and automatic regional template in chest X-ray images for pediatric tuberculosis. (arXiv:2301.13786v1 [eess.IV])
    Tuberculosis (TB) is still considered a leading cause of death and a substantial threat to global child health. Both TB infection and disease are curable using antibiotics. However, most children who die of TB are never diagnosed or treated. In clinical practice, experienced physicians assess TB by examining chest X-rays (CXR). Pediatric CXR has specific challenges compared to adult CXR, which makes TB diagnosis in children more difficult. Computer-aided diagnosis systems supported by Artificial Intelligence have shown performance comparable to experienced radiologist TB readings, which could ease mass TB screening and reduce clinical burden. We propose a multi-view deep learning-based solution which, by following a proposed template, aims to automatically regionalize and extract lung and mediastinal regions of interest from pediatric CXR images where key TB findings may be present. Experimental results have shown accurate region extraction, which can be used for further analysis to confirm TB finding presence and severity assessment. Code publicly available at https://github.com/dani-capellan/pTB_LungRegionExtractor.
    An Efficient Solution to s-Rectangular Robust Markov Decision Processes. (arXiv:2301.13642v1 [cs.LG])
    We present an efficient robust value iteration for \texttt{s}-rectangular robust Markov Decision Processes (MDPs) with a time complexity comparable to standard (non-robust) MDPs which is significantly faster than any existing method. We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma. We unveil the exact form of the optimal policies, which turn out to be novel threshold policies with the probability of playing an action proportional to its advantage.
    Improved distinct bone segmentation in upper-body CT through multi-resolution networks. (arXiv:2301.13674v1 [eess.IV])
    Purpose: Automated distinct bone segmentation from CT scans is widely used in planning and navigation workflows. U-Net variants are known to provide excellent results in supervised semantic segmentation. However, in distinct bone segmentation from upper body CTs a large field of view and a computationally taxing 3D architecture are required. This leads to low-resolution results lacking detail or localisation errors due to missing spatial context when using high-resolution inputs. Methods: We propose to solve this problem by using end-to-end trainable segmentation networks that combine several 3D U-Nets working at different resolutions. Our approach, which extends and generalizes HookNet and MRN, captures spatial information at a lower resolution and skips the encoded information to the target network, which operates on smaller high-resolution inputs. We evaluated our proposed architecture against single resolution networks and performed an ablation study on information concatenation and the number of context networks. Results: Our proposed best network achieves a median DSC of 0.86 taken over all 125 segmented bone classes and reduces the confusion among similar-looking bones in different locations. These results outperform our previously published 3D U-Net baseline results on the task and distinct-bone segmentation results reported by other groups. Conclusion: The presented multi-resolution 3D U-Nets address current shortcomings in bone segmentation from upper-body CT scans by allowing for capturing a larger field of view while avoiding the cubic growth of the input pixels and intermediate computations that quickly outgrow the computational capacities in 3D. The approach thus improves the accuracy and efficiency of distinct bone segmentation from upper-body CT.
    Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. (arXiv:2301.13826v1 [cs.CV])
    Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.
    Anti-Exploration by Random Network Distillation. (arXiv:2301.13616v1 [cs.LG])
    Despite the success of Random Network Distillation (RND) in various domains, it was shown as not discriminative enough to be used as an uncertainty estimator for penalizing out-of-distribution actions in offline reinforcement learning. In this paper, we revisit these results and show that, with a naive choice of conditioning for the RND prior, it becomes infeasible for the actor to effectively minimize the anti-exploration bonus and discriminativity is not an issue. We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM), resulting in a simple and efficient ensemble-free algorithm based on Soft Actor-Critic. We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
    UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers. (arXiv:2301.13741v1 [cs.CV])
    Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.
    Semi-Supervised Classification with Graph Convolutional Kernel Machines. (arXiv:2301.13764v1 [cs.LG])
    We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. First, we introduce an unsupervised kernel machine propagating the node features in a one-hop neighbourhood. Then, we specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. The deep graph convolutional kernel machine is obtained by stacking multiple shallow kernel machines. After showing that unsupervised and semi-supervised layer corresponds to an eigenvalue problem and a linear system on the aggregated node features, respectively, we derive an efficient end-to-end training algorithm in the dual variables. Numerical experiments demonstrate that our approach is competitive with state-of-the-art graph neural networks for homophilious and heterophilious benchmark datasets. Notably, GCKM achieves superior performance when very few labels are available.
    Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning. (arXiv:2301.13703v1 [cs.LG])
    Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $\alpha$ are varied. For gradient descent, $\alpha$ is a key parameter that controls if the network is `lazy' ($\alpha\gg 1$) or instead learns features ($\alpha\ll 1$). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the $(\alpha,T)$ plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing $T$ or decreasing $\alpha$ both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that key dynamical quantities (including the total variations of weights during training) depend on both $T$ and $P$ as power laws, and the characteristic temperature $T_c$, where the noise of SGD starts affecting performance, is a power law of $P$. These observations indicate that a key effect of SGD noise occurs late in training, by affecting the stopping process whereby all data are fitted. We argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. The same effect occurs at larger training set $P$. We confirm this view in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.
    Simplex Random Features. (arXiv:2301.13856v1 [stat.ML])
    We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels by geometrical correlation of random projection vectors. We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels among the class of weight-independent geometrically-coupled positive random feature (PRF) mechanisms, substantially outperforming the previously most accurate Orthogonal Random Features at no observable extra cost. We present a more computationally expensive SimRFs+ variant, which we prove is asymptotically optimal in the broader family of weight-dependent geometrical coupling schemes (which permit correlations between random vector directions and norms). In extensive empirical studies, we show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
    Alternating Updates for Efficient Transformers. (arXiv:2301.13310v1 [cs.LG])
    It is well established that increasing scale in deep transformer networks leads to improved quality and performance. This increase in scale often comes with an increase in compute cost and inference latency. Consequently, research into methods which help realize the benefits of increased scale without leading to an increase in the compute cost becomes important. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation without increasing the computation time by working on a subblock of the representation at each layer. Our experiments on various transformer models and language tasks demonstrate the consistent effectiveness of alternating updates on a diverse set of benchmarks. Finally, we present extensions of AltUp to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity.
    Differentially Private Distributed Bayesian Linear Regression with MCMC. (arXiv:2301.13778v1 [stat.ML])
    We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-preserving noise. We develop a novel generative statistical model for privately shared statistics, which exploits a useful distributional relation between the summary statistics of linear regression. Bayesian estimation of the regression coefficients is conducted mainly using Markov chain Monte Carlo algorithms, while we also provide a fast version to perform Bayesian estimation in one iteration. The proposed methods have computational advantages over their competitors. We provide numerical results on both real and simulated data, which demonstrate that the proposed algorithms provide well-rounded estimation and prediction.
    Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback. (arXiv:2301.13791v1 [stat.ML])
    We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
    A Data-Driven Modeling and Control Framework for Physics-Based Building Emulators. (arXiv:2301.13447v1 [eess.SY])
    We present a data-driven modeling and control framework for physics-based building emulators. Our approach comprises: (a) Offline training of differentiable surrogate models that speed up model evaluations, provide cheap gradients, and have good predictive accuracy for the receding horizon in Model Predictive Control (MPC) and (b) Formulating and solving nonlinear building HVAC MPC problems. We extensively verify the modeling and control performance using multiple surrogate models and optimization frameworks for different available test cases in the Building Optimization Testing Framework (BOPTEST). The framework is compatible with other modeling techniques and customizable with different control formulations. The modularity makes the approach future-proof for test cases currently in development for physics-based building emulators and provides a path toward prototyping predictive controllers in large buildings.
    Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos for MediaEval 2022. (arXiv:2301.13576v1 [cs.AI])
    Sports video analysis is a widespread research topic. Its applications are very diverse, like events detection during a match, video summary, or fine-grained movement analysis of athletes. As part of the MediaEval 2022 benchmarking initiative, this task aims at detecting and classifying subtle movements from sport videos. We focus on recordings of table tennis matches. Conducted since 2019, this task provides a classification challenge from untrimmed videos recorded under natural conditions with known temporal boundaries for each stroke. Since 2021, the task also provides a stroke detection challenge from unannotated, untrimmed videos. This year, the training, validation, and test sets are enhanced to ensure that all strokes are represented in each dataset. The dataset is now similar to the one used in [1, 2]. This research is intended to build tools for coaches and athletes who want to further evaluate their sport performances.
    Execution-based Code Generation using Deep Reinforcement Learning. (arXiv:2301.13816v1 [cs.LG])
    The utilization of programming language (PL) models, pretrained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting specific sequence-level features of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that combines pretrained PL models with Proximal Policy Optimization (PPO) deep reinforcement learning and employs execution feedback as the external source of knowledge into the model optimization. PPOCoder is transferable across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, improving the success rate of compilation and functional correctness over different PLs. Our code can be found at https://github.com/reddy-lab-code-research/PPOCoder .
    Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis. (arXiv:2301.13819v1 [cs.CL])
    ChatGPT has demonstrated exceptional proficiency in natural language conversation, e.g., it can answer a wide range of questions while no previous large language models can. Thus, we would like to push its limit and explore its ability to answer causal discovery questions by using a medical benchmark (Tu et al. 2019) in causal discovery.
    A Bayesian Generative Adversarial Network (GAN) to Generate Synthetic Time-Series Data, Application in Combined Sewer Flow Prediction. (arXiv:2301.13733v1 [cs.LG])
    Despite various breakthroughs in machine learning and data analysis techniques for improving smart operation and management of urban water infrastructures, some key limitations obstruct this progress. Among these shortcomings, the absence of freely available data due to data privacy or high costs of data gathering and the nonexistence of adequate rare or extreme events in the available data plays a crucial role. Here, Generative Adversarial Networks (GANs) can help overcome these challenges. In machine learning, generative models are a class of methods capable of learning data distribution to generate artificial data. In this study, we developed a GAN model to generate synthetic time series to balance our limited recorded time series data and improve the accuracy of a data-driven model for combined sewer flow prediction. We considered the sewer system of a small town in Germany as the test case. Precipitation and inflow to the storage tanks are used for the Data-Driven model development. The aim is to predict the flow using precipitation data and examine the impact of data augmentation using synthetic data in model performance. Results show that GAN can successfully generate synthetic time series from real data distribution, which helps more accurate peak flow prediction. However, the model without data augmentation works better for dry weather prediction. Therefore, an ensemble model is suggested to combine the advantages of both models.
    Improving Monte Carlo Evaluation with Offline Data. (arXiv:2301.13734v1 [cs.LG])
    Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
    A Survey of Explainable AI in Deep Visual Modeling: Methods and Metrics. (arXiv:2301.13445v1 [cs.CV])
    Deep visual models have widespread applications in high-stake domains. Hence, their black-box nature is currently attracting a large interest of the research community. We present the first survey in Explainable AI that focuses on the methods and metrics for interpreting deep visual models. Covering the landmark contributions along the state-of-the-art, we not only provide a taxonomic organization of the existing techniques, but also excavate a range of evaluation metrics and collate them as measures of different properties of model explanations. Along the insightful discussion on the current trends, we also discuss the challenges and future avenues for this research direction.
    Learning Generalized Hybrid Proximity Representation for Image Recognition. (arXiv:2301.13459v1 [cs.CV])
    Recently, deep metric learning techniques received attention, as the learned distance representations are useful to capture the similarity relationship among samples and further improve the performance of various of supervised or unsupervised learning tasks. We propose a novel supervised metric learning method that can learn the distance metrics in both geometric and probabilistic space for image recognition. In contrast to the previous metric learning methods which usually focus on learning the distance metrics in Euclidean space, our proposed method is able to learn better distance representation in a hybrid approach. To achieve this, we proposed a Generalized Hybrid Metric Loss (GHM-Loss) to learn the general hybrid proximity features from the image data by controlling the trade-off between geometric proximity and probabilistic proximity. To evaluate the effectiveness of our method, we first provide theoretical derivations and proofs of the proposed loss function, then we perform extensive experiments on two public datasets to show the advantage of our method compared to other state-of-the-art metric learning methods.
    Convolutional autoencoder for the spatiotemporal latent representation of turbulence. (arXiv:2301.13728v1 [physics.flu-dyn])
    Turbulence is characterised by chaotic dynamics and a high-dimensional state space, which make the phenomenon challenging to predict. However, turbulent flows are often characterised by coherent spatiotemporal structures, such as vortices or large-scale modes, which can help obtain a latent description of turbulent flows. However, current approaches are often limited by either the need to use some form of thresholding on quantities defining the isosurfaces to which the flow structures are associated or the linearity of traditional modal flow decomposition approaches, such as those based on proper orthogonal decomposition. This problem is exacerbated in flows that exhibit extreme events, which are rare and sudden changes in a turbulent state. The goal of this paper is to obtain an efficient and accurate reduced-order latent representation of a turbulent flow that exhibits extreme events. Specifically, we employ a three-dimensional multiscale convolutional autoencoder (CAE) to obtain such latent representation. We apply it to a three-dimensional turbulent flow. We show that the Multiscale CAE is efficient, requiring less than 10% degrees of freedom than proper orthogonal decomposition for compressing the data and is able to accurately reconstruct flow states related to extreme events. The proposed deep learning architecture opens opportunities for nonlinear reduced-order modeling of turbulent flows from data.
    Skill Decision Transformer. (arXiv:2301.13573v1 [cs.LG])
    Recent work has shown that Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL) by representing the traditional RL problem as a sequence modelling problem (Chen et al., 2021; Janner et al., 2021). However many of these methods only optimize for high returns, and may not extract much information from a diverse dataset of trajectories. Generalized Decision Transformers (GDTs) (Furuta et al., 2021) have shown that utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data. Building upon this, we propose Skill Decision Transformer (Skill DT). Skill DT draws inspiration from hindsight relabelling (Andrychowicz et al., 2017) and skill discovery methods to discover a diverse set of primitive behaviors, or skills. We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled. Furthermore, we show that through purely reward-free optimization, Skill DT is still competitive with supervised offline RL approaches on the D4RL benchmark. The code and videos can be found on our project page: https://github.com/shyamsn97/skill-dt
    On the Initialisation of Wide Low-Rank Feedforward Neural Networks. (arXiv:2301.13710v1 [stat.ML])
    The edge-of-chaos dynamics of wide randomly initialized low-rank feedforward networks are analyzed. Formulae for the optimal weight and bias variances are extended from the full-rank to low-rank setting and are shown to follow from multiplicative scaling. The principle second order effect, the variance of the input-output Jacobian, is derived and shown to increase as the rank to width ratio decreases. These results inform practitioners how to randomly initialize feedforward networks with a reduced number of learnable parameters while in the same ambient dimension, allowing reductions in the computational cost and memory constraints of the associated network.
    Mathematical Capabilities of ChatGPT. (arXiv:2301.13867v1 [cs.LG])
    We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as well as hand-crafted ones, and measuring its performance against other models trained on a mathematical corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional mathematicians by emulating various use cases that come up in the daily professional activities of mathematicians (question answering, theorem searching). In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, only cover elementary mathematics. We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new dataset publicly available to assist a community-driven comparison of ChatGPT with (future) large language models in terms of advanced mathematical comprehension. We conclude that contrary to many positive reports in the media (a potential case of selection bias), ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Our results show that ChatGPT often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to pass a university exam, you would be better off copying from your average peer!
    Large Music Recommendation Studies for Small Teams. (arXiv:2301.13388v1 [cs.HC])
    Running live music recommendation studies without direct industry partnerships can be a prohibitively daunting task, especially for small teams. In order to help future researchers interested in such evaluations, we present a number of struggles we faced in the process of generating our own such evaluation system alongside potential solutions. These problems span the topics of users, data, computation, and application architecture.
    Few-Shot Image-to-Semantics Translation for Policy Transfer in Reinforcement Learning. (arXiv:2301.13343v1 [cs.LG])
    We investigate policy transfer using image-to-semantics translation to mitigate learning difficulties in vision-based robotics control agents. This problem assumes two environments: a simulator environment with semantics, that is, low-dimensional and essential information, as the state space, and a real-world environment with images as the state space. By learning mapping from images to semantics, we can transfer a policy, pre-trained in the simulator, to the real world, thereby eliminating real-world on-policy agent interactions to learn, which are costly and risky. In addition, using image-to-semantics mapping is advantageous in terms of the computational efficiency to train the policy and the interpretability of the obtained policy over other types of sim-to-real transfer strategies. To tackle the main difficulty in learning image-to-semantics mapping, namely the human annotation cost for producing a training dataset, we propose two techniques: pair augmentation with the transition function in the simulator environment and active learning. We observed a reduction in the annotation cost without a decline in the performance of the transfer, and the proposed approach outperformed the existing approach without annotation.
    A Scalable, Interpretable, Verifiable & Differentiable Logic Gate Convolutional Neural Network Architecture From Truth Tables. (arXiv:2208.08609v2 [cs.AI] UPDATED)
    We propose $\mathcal{T}$ruth $\mathcal{T}$able net ($\mathcal{TT}$net), a novel Convolutional Neural Network (CNN) architecture that addresses, by design, the open challenges of interpretability, formal verification, and logic gate conversion. $\mathcal{TT}$net is built using CNNs' filters that are equivalent to tractable truth tables and that we call Learning Truth Table (LTT) blocks. The dual form of LTT blocks allows the truth tables to be easily trained with gradient descent and makes these CNNs easy to interpret, verify and infer. Specifically, $\mathcal{TT}$net is a deep CNN model that can be automatically represented, after post-training transformation, as a sum of Boolean decision trees, or as a sum of Disjunctive/Conjunctive Normal Form (DNF/CNF) formulas, or as a compact Boolean logic circuit. We demonstrate the effectiveness and scalability of $\mathcal{TT}$net on multiple datasets, showing comparable interpretability to decision trees, fast complete/sound formal verification, and scalable logic gate representation, all compared to state-of-the-art methods. We believe this work represents a step towards making CNNs more transparent and trustworthy for real-world critical applications.
    Self-Compressing Neural Networks. (arXiv:2301.13142v2 [cs.LG] UPDATED)
    This work focuses on reducing neural network size, which is a major driver of neural network execution time, power consumption, bandwidth, and memory footprint. A key challenge is to reduce size in a manner that can be exploited readily for efficient training and inference without the need for specialized hardware. We propose Self-Compression: a simple, general method that simultaneously achieves two goals: (1) removing redundant weights, and (2) reducing the number of bits required to represent the remaining weights. This is achieved using a generalized loss function to minimize overall network size. In our experiments we demonstrate floating point accuracy with as few as 3% of the bits and 18% of the weights remaining in the network.
    A Reinforcement Learning Framework for Dynamic Mediation Analysis. (arXiv:2301.13348v1 [stat.ML])
    Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset.
    Clustering the Sketch: A Novel Approach to Embedding Table Compression. (arXiv:2210.05974v2 [cs.LG] UPDATED)
    Embedding tables are used by machine learning systems to work with categorical features. These tables can become exceedingly large in modern recommendation systems, necessitating the development of new methods for fitting them in memory, even during training. The best previous methods for table compression are so called "post training" quantization schemes such as "product" and "residual" quantization (Gray & Neuhoff, 1998). These methods replace table rows with references to k-means clustered "codewords". Unfortunately, clustering requires prior knowledge of the table to be compressed, which limits the memory savings to inference time and not training time. Hence, recent work, like the QR method (Shi et al., 2020), has used random references (linear sketching), which can be computed with hash functions before training. Unfortunately, the compression achieved is inferior to that achieved by post-training quantization. The new algorithm, CQR, shows how to get the best of two worlds by combining clustering and sketching: First IDs are randomly assigned to a codebook and codewords are trained (end to end) for an epoch. Next, we expand the codebook and apply clustering to reduce the size again. Finally, we add new random references and continue training. We show experimentally close to those of post-training quantization with the training time memory reductions of sketch-based methods, and we prove that our method always converges to the optimal embedding table for least-squares training.
    The Fair Value of Data Under Heterogeneous Privacy Constraints. (arXiv:2301.13336v1 [cs.LG])
    Modern data aggregation often takes the form of a platform collecting data from a network of users. More than ever, these users are now requesting that the data they provide is protected with a guarantee of privacy. This has led to the study of optimal data acquisition frameworks, where the optimality criterion is typically the maximization of utility for the agent trying to acquire the data. This involves determining how to allocate payments to users for the purchase of their data at various privacy levels. The main goal of this paper is to characterize a fair amount to pay users for their data at a given privacy level. We propose an axiomatic definition of fairness, analogous to the celebrated Shapley value. Two concepts for fairness are introduced. The first treats the platform and users as members of a common coalition and provides a complete description of how to divide the utility among the platform and users. In the second concept, fairness is defined only among users, leading to a potential fairness-constrained mechanism design problem for the platform. We consider explicit examples involving private heterogeneous data and show how these notions of fairness can be applied. To the best of our knowledge, these are the first fairness concepts for data that explicitly consider privacy constraints.
    Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference. (arXiv:2301.13330v1 [cs.LG])
    For effective and efficient deep neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. It is generally desirable to quantize as aggressively as possible without incurring significant accuracy degradation. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers of a network to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50 and ResNet-101 classification networks, demonstrating improved performance across the entire accuracy-throughput frontier, and equivalent performance for the PSPNet segmentation network in our own commensurate comparison over leading mixed precision layer selection techniques, while requiring orders of magnitude less compute time to reach a solution.
    Automated Sentiment and Hate Speech Analysis of Facebook Data by Employing Multilingual Transformer Models. (arXiv:2301.13668v1 [cs.CL])
    In recent years, there has been a heightened consensus within academia and in the public discourse that Social Media Platforms (SMPs), amplify the spread of hateful and negative sentiment content. Researchers have identified how hateful content, political propaganda, and targeted messaging contributed to real-world harms including insurrections against democratically elected governments, genocide, and breakdown of social cohesion due to heightened negative discourse towards certain communities in parts of the world. To counter these issues, SMPs have created semi-automated systems that can help identify toxic speech. In this paper we analyse the statistical distribution of hateful and negative sentiment contents within a representative Facebook dataset (n= 604,703) scrapped through 648 public Facebook pages which identify themselves as proponents (and followers) of far-right Hindutva actors. These pages were identified manually using keyword searches on Facebook and on CrowdTangleand classified as far-right Hindutva pages based on page names, page descriptions, and discourses shared on these pages. We employ state-of-the-art, open-source XLM-T multilingual transformer-based language models to perform sentiment and hate speech analysis of the textual contents shared on these pages over a period of 5.5 years. The result shows the statistical distributions of the predicted sentiment and the hate speech labels; top actors, and top page categories. We further discuss the benchmark performances and limitations of these pre-trained language models.
    Quantifying and Managing Impacts of Concept Drifts on IoT Traffic Inference in Residential ISP Networks. (arXiv:2301.06695v2 [cs.LG] UPDATED)
    Millions of vulnerable consumer IoT devices in home networks are the enabler for cyber crimes putting user privacy and Internet security at risk. Internet service providers (ISPs) are best poised to play key roles in mitigating risks by automatically inferring active IoT devices per household and notifying users of vulnerable ones. Developing a scalable inference method that can perform robustly across thousands of home networks is a non-trivial task. This paper focuses on the challenges of developing and applying data-driven inference models when labeled data of device behaviors is limited and the distribution of data changes (concept drift) across time and space domains. Our contributions are three-fold: (1) We collect and analyze network traffic of 24 types of consumer IoT devices from 12 real homes over six weeks to highlight the challenge of temporal and spatial concept drifts in network behavior of IoT devices; (2) We analyze the performance of two inference strategies, namely "global inference" (a model trained on a combined set of all labeled data from training homes) and "contextualized inference" (several models each trained on the labeled data from a training home) in the presence of concept drifts; and (3) To manage concept drifts, we develop a method that dynamically applies the ``closest'' model (from a set) to network traffic of unseen homes during the testing phase, yielding better performance in 20% of scenarios.
    A Unified Causal View of Domain Invariant Representation Learning. (arXiv:2208.06987v3 [stat.ML] UPDATED)
    Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. There are a wide range of proposals for mitigating this problem by learning representations that are ``invariant'' in some sense.However, these methods generally contradict each other, and none of them consistently improve performance on real-world domain shift benchmarks. There are two main questions that must be addressed to understand when, if ever, we should use each method. First, how does each ad hoc notion of ``invariance'' relate to the structure of real-world problems? And, second, when does learning invariant representations actually yield robust models? To address these issues, we introduce a broad formal notion of what it means for a real-world domain shift to admit invariant structure. Then, we characterize the causal structures that are compatible with this notion of invariance.With this in hand, we find conditions under which method-specific invariance notions correspond to real-world invariant structure, and we clarify the relationship between invariant structure and robustness to domain shifts. For both questions, we find that the true underlying causal structure of the data plays a critical role.
    Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning. (arXiv:2301.11916v1 [cs.CL] CROSS LISTED)
    In recent years, pre-trained large language models have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. The underlying mechanisms by which this capability arises from regular language model pretraining objectives remain poorly understood. In this study, we aim to examine the in-context learning phenomenon through a Bayesian lens, viewing large language models as topic models that implicitly infer task-related information from demonstrations. On this premise, we propose an algorithm for selecting optimal demonstrations from a set of annotated data and demonstrate a significant 12.5% improvement relative to the random selection baseline, averaged over eight GPT2 and GPT3 models on eight different real-world text classification datasets. Our empirical findings support our hypothesis that large language models implicitly infer a latent concept variable.
    Single-Loop Switching Subgradient Methods for Non-Smooth Weakly Convex Optimization with Non-Smooth Convex Constraints. (arXiv:2301.13314v1 [math.OC])
    In this paper, we consider a general non-convex constrained optimization problem, where the objective function is weakly convex and the constraint function is convex while they can both be non-smooth. This class of problems arises from many applications in machine learning such as fairness-aware supervised learning. To solve this problem, we consider the classical switching subgradient method by Polyak (1965), which is an intuitive and easily implementable first-order method. Before this work, its iteration complexity was only known for convex optimization. We prove its oracle complexity for finding a nearly stationary point when the objective function is non-convex. The analysis is derived separately when the constraint function is deterministic and stochastic. Compared to existing methods, especially the double-loop methods, the switching gradient method can be applied to non-smooth problems and only has a single loop, which saves the effort on tuning the number of inner iterations.
    Conversational Automated Program Repair. (arXiv:2301.13246v1 [cs.SE])
    Automated Program Repair (APR) can help developers automatically generate patches for bugs. Due to the impressive performance obtained using Large Pre-Trained Language Models (LLMs) on many code related tasks, researchers have started to directly use LLMs for APR. However, prior approaches simply repeatedly sample the LLM given the same constructed input/prompt created from the original buggy code, which not only leads to generating the same incorrect patches repeatedly but also miss the critical information in testcases. To address these limitations, we propose conversational APR, a new paradigm for program repair that alternates between patch generation and validation in a conversational manner. In conversational APR, we iteratively build the input to the model by combining previously generated patches with validation feedback. As such, we leverage the long-term context window of LLMs to not only avoid generating previously incorrect patches but also incorporate validation feedback to help the model understand the semantic meaning of the program under test. We evaluate 10 different LLM including the newly developed ChatGPT model to demonstrate the improvement of conversational APR over the prior LLM for APR approach.
    On the Statistical Benefits of Temporal Difference Learning. (arXiv:2301.13289v1 [cs.LG])
    Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.
    MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning. (arXiv:2301.13287v1 [cs.LG])
    Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consuming subset selection step, which involves computing model-dependent gradients and feature embeddings and applies greedy maximization of submodular objectives. Our key insight is that removing the reliance on downstream model parameters enables subset selection as a pre-processing step and enables one to train multiple models at no additional cost. In this work, we propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training while enabling superior model convergence and performance by using an easy-to-hard curriculum. Our empirical results indicate that MILO can train models $3\times - 10 \times$ faster and tune hyperparameters $20\times - 75 \times$ faster than full-dataset training or tuning without compromising performance.
    Temporal Consistency Loss for Physics-Informed Neural Networks. (arXiv:2301.13262v1 [physics.flu-dyn])
    Physics-informed neural networks (PINNs) have been widely used to solve partial differential equations in a forward and inverse manner using deep neural networks. However, training these networks can be challenging for multiscale problems. While statistical methods can be employed to scale the regression loss on data, it is generally challenging to scale the loss terms for equations. This paper proposes a method for scaling the mean squared loss terms in the objective function used to train PINNs. Instead of using automatic differentiation to calculate the temporal derivative, we use backward Euler discretization. This provides us with a scaling term for the equations. In this work, we consider the two and three-dimensional Navier-Stokes equations and determine the kinematic viscosity using the spatio-temporal data on the velocity and pressure fields. We first consider numerical datasets to test our method. We test the sensitivity of our method to the time step size, the number of timesteps, noise in the data, and spatial resolution. Finally, we use the velocity field obtained using Particle Image Velocimetry (PIV) experiments to generate a reference pressure field. We then test our framework using the velocity and reference pressure field.
    Online Loss Function Learning. (arXiv:2301.13247v1 [cs.LG])
    Loss function learning is a new meta-learning paradigm that aims to automate the essential task of designing a loss function for a machine learning model. Existing techniques for loss function learning have shown promising results, often improving a model's training dynamics and final inference performance. However, a significant limitation of these techniques is that the loss functions are meta-learned in an offline fashion, where the meta-objective only considers the very first few steps of training, which is a significantly shorter time horizon than the one typically used for training deep neural networks. This causes significant bias towards loss functions that perform well at the very start of training but perform poorly at the end of training. To address this issue we propose a new loss function learning technique for adaptively updating the loss function online after each update to the base model parameters. The experimental results show that our proposed method consistently outperforms the cross-entropy loss and offline loss function learning techniques on a diverse range of neural network architectures and datasets.
    SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Search. (arXiv:2301.13236v1 [cs.LG])
    Despite the popularity of policy gradient methods, they are known to suffer from large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax -- a generalization of softmax that takes planning into account. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We consider two variants of SoftTreeMax, one for cumulative reward and one for exponentiated reward. For both, we analyze the gradient variance and reveal for the first time the role of a tree expansion policy in mitigating this variance. We prove that the resulting variance decays exponentially with the planning horizon as a function of the expansion policy. Specifically, we show that the closer the resulting state transitions are to uniform, the faster the decay. In a practical implementation, we utilize a parallelized GPU-based simulator for fast and efficient tree search. Our differentiable tree-based policy leverages all gradients at the tree leaves in each environment step instead of the traditional single-sample-based gradient. We then show in simulation how the variance of the gradient is reduced by three orders of magnitude, leading to better sample complexity compared to the standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in a faster run time compared to distributed PPO. Lastly, we demonstrate that high reward correlates with lower variance.
    Probabilistic Neural Data Fusion for Learning from an Arbitrary Number of Multi-fidelity Data Sets. (arXiv:2301.13271v1 [cs.LG])
    In many applications in engineering and sciences analysts have simultaneous access to multiple data sources. In such cases, the overall cost of acquiring information can be reduced via data fusion or multi-fidelity (MF) modeling where one leverages inexpensive low-fidelity (LF) sources to reduce the reliance on expensive high-fidelity (HF) data. In this paper, we employ neural networks (NNs) for data fusion in scenarios where data is very scarce and obtained from an arbitrary number of sources with varying levels of fidelity and cost. We introduce a unique NN architecture that converts MF modeling into a nonlinear manifold learning problem. Our NN architecture inversely learns non-trivial (e.g., non-additive and non-hierarchical) biases of the LF sources in an interpretable and visualizable manifold where each data source is encoded via a low-dimensional distribution. This probabilistic manifold quantifies model form uncertainties such that LF sources with small bias are encoded close to the HF source. Additionally, we endow the output of our NN with a parametric distribution not only to quantify aleatoric uncertainties, but also to reformulate the network's loss function based on strictly proper scoring rules which improve robustness and accuracy on unseen HF data. Through a set of analytic and engineering examples, we demonstrate that our approach provides a high predictive power while quantifying various sources uncertainties.
    Retrosynthetic Planning with Dual Value Networks. (arXiv:2301.13755v1 [cs.AI])
    Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes while retaining single-step accuracy. Desirable routes should be both synthesizable and of low cost. We propose an online training algorithm, called Planning with Dual Value Networks (PDVN), in which two value networks predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Furthermore, PDVN finds shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph).
    Emergence of Maps in the Memories of Blind Navigation Agents. (arXiv:2301.13261v1 [cs.AI])
    Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to $\Delta$ x, $\Delta$ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.
    Interpreting Robustness Proofs of Deep Neural Networks. (arXiv:2301.13845v1 [cs.LG])
    In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs behavior, it is not clear whether the proofs generated by these methods are human-interpretable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable interpretations of the proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely on spurious input features, while the proofs of DNNs trained to be provably robust filter out even the semantically meaningful features. The proofs for the DNNs combining adversarial and provably robust training are the most effective at selectively filtering out spurious features as well as relying on human-understandable input features.
    Contextual Pandora's Box. (arXiv:2205.13114v2 [cs.LG] UPDATED)
    Pandora's Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate distributions are given for the values of all the alternatives, while recent work studies the online variant of Pandora's Box where the distributions are originally unknown. In this work, we study Pandora's Box in the online setting, while incorporating context. At every round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well to the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is a novel modification of the realizability condition in contextual bandits that connects a context to a sufficient statistic of each alternative's distribution (its "reservation value") rather than its mean.
    Multi-fidelity covariance estimation in the log-Euclidean geometry. (arXiv:2301.13749v1 [stat.CO])
    We introduce a multi-fidelity estimator of covariance matrices that employs the log-Euclidean geometry of the symmetric positive-definite manifold. The estimator fuses samples from a hierarchy of data sources of differing fidelities and costs for variance reduction while guaranteeing definiteness, in contrast with previous approaches. The new estimator makes covariance estimation tractable in applications where simulation or data collection is expensive; to that end, we develop an optimal sample allocation scheme that minimizes the mean-squared error of the estimator given a fixed budget. Guaranteed definiteness is crucial to metric learning, data assimilation, and other downstream tasks. Evaluations of our approach using data from physical applications (heat conduction, fluid dynamics) demonstrate more accurate metric learning and speedups of more than one order of magnitude compared to benchmarks.
    Personalized Subgraph Federated Learning. (arXiv:2206.10206v2 [cs.LG] UPDATED)
    Subgraphs of a larger global graph may be distributed across multiple devices, and only locally accessible due to privacy restrictions, although there may be links between subgraphs. Recently proposed subgraph Federated Learning (FL) methods deal with those missing links across local subgraphs while distributively training Graph Neural Networks (GNNs) on them. However, they have overlooked the inevitable heterogeneity between subgraphs comprising different communities of a global graph, consequently collapsing the incompatible knowledge from local GNN models. To this end, we introduce a new subgraph FL problem, personalized subgraph FL, which focuses on the joint improvement of the interrelated local GNNs rather than learning a single global model, and propose a novel framework, FEDerated Personalized sUBgraph learning (FED-PUB), to tackle it. Since the server cannot access the subgraph in each client, FED-PUB utilizes functional embeddings of the local GNNs using random graphs as inputs to compute similarities between them, and use the similarities to perform weighted averaging for server-side aggregation. Further, it learns a personalized sparse mask at each client to select and update only the subgraph-relevant subset of the aggregated parameters. We validate our FED-PUB for its subgraph FL performance on six datasets, considering both non-overlapping and overlapping subgraphs, on which it significantly outperforms relevant baselines.
    Preserving local densities in low-dimensional embeddings. (arXiv:2301.13732v1 [cs.LG])
    Low-dimensional embeddings and visualizations are an indispensable tool for analysis of high-dimensional data. State-of-the-art methods, such as tSNE and UMAP, excel in unveiling local structures hidden in high-dimensional data and are therefore routinely applied in standard analysis pipelines in biology. We show, however, that these methods fail to reconstruct local properties, such as relative differences in densities (Fig. 1) and that apparent differences in cluster size can arise from computational artifact caused by differing sample sizes (Fig. 2). Providing a theoretical analysis of this issue, we then suggest dtSNE, which approximately conserves local densities. In an extensive study on synthetic benchmark and real world data comparing against five state-of-the-art methods, we empirically show that dtSNE provides similar global reconstruction, but yields much more accurate depictions of local distances and relative densities.
    Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion. (arXiv:2301.13743v1 [cs.CV])
    Cross-modality data translation has attracted great interest in image computing. Deep generative models (\textit{e.g.}, GANs) show performance improvement in tackling those problems. Nevertheless, as a fundamental challenge in image translation, the problem of Zero-shot-Learning Cross-Modality Data Translation with fidelity remains unanswered. This paper proposes a new unsupervised zero-shot-learning method named Mutual Information guided Diffusion cross-modality data translation Model (MIDiffusion), which learns to translate the unseen source data to the target domain. The MIDiffusion leverages a score-matching-based generative model, which learns the prior knowledge in the target domain. We propose a differentiable local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling. The $LMI$ captures the identical cross-modality features in the statistical domain for the diffusion guidance; thus, our method does not require retraining when the source domain is changed, as it does not rely on any direct mapping between the source and target domains. This advantage is critical for applying cross-modality data translation methods in practice, as a reasonable amount of source domain dataset is not always available for supervised training. We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models, including adversarial-based and other score-matching-based models.
    Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments. (arXiv:2301.13446v1 [cs.LG])
    We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.
    Are Defenses for Graph Neural Networks Robust?. (arXiv:2301.13694v1 [cs.LG])
    A cursory reading of the literature suggests that we have made a lot of progress in designing effective adversarial defenses for Graph Neural Networks (GNNs). Yet, the standard methodology has a serious flaw - virtually all of the defenses are evaluated against non-adaptive attacks leading to overly optimistic robustness estimates. We perform a thorough robustness analysis of 7 of the most popular defenses spanning the entire spectrum of strategies, i.e., aimed at improving the graph, the architecture, or the training. The results are sobering - most defenses show no or only marginal improvement compared to an undefended baseline. We advocate using custom adaptive attacks as a gold standard and we outline the lessons we learned from successfully designing such attacks. Moreover, our diverse collection of perturbed graphs forms a (black-box) unit test offering a first glance at a model's robustness.
    Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-Evolutionary Search. (arXiv:2301.13807v1 [cs.SE])
    In Machine Learning (ML)-enabled autonomous systems (MLASs), it is essential to identify the hazard boundary of ML Components (MLCs) in the MLAS under analysis. Given that such boundary captures the conditions in terms of MLC behavior and system context that can lead to hazards, it can then be used to, for example, build a safety monitor that can take any predefined fallback mechanisms at runtime when reaching the hazard boundary. However, determining such hazard boundary for an ML component is challenging. This is due to the space combining system contexts (i.e., scenarios) and MLC behaviors (i.e., inputs and outputs) being far too large for exhaustive exploration and even to handle using conventional metaheuristics, such as genetic algorithms. Additionally, the high computational cost of simulations required to determine any MLAS safety violations makes the problem even more challenging. Furthermore, it is unrealistic to consider a region in the problem space deterministically safe or unsafe due to the uncontrollable parameters in simulations and the non-linear behaviors of ML models (e.g., deep neural networks) in the MLAS under analysis. To address the challenges, we propose MLCSHE (ML Component Safety Hazard Envelope), a novel method based on a Cooperative Co-Evolutionary Algorithm (CCEA), which aims to tackle a high-dimensional problem by decomposing it into two lower-dimensional search subproblems. Moreover, we take a probabilistic view of safe and unsafe regions and define a novel fitness function to measure the distance from the probabilistic hazard boundary and thus drive the search effectively. We evaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous Vehicle (AV) case study. Our evaluation results show that MLCSHE is significantly more effective and efficient compared to a standard genetic algorithm and random search.
    An $l_1$-oracle inequality for the Lasso in high-dimensional mixtures of experts models. (arXiv:2009.10622v5 [math.ST] UPDATED)
    Mixtures of experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.
    Video Influencers: Unboxing the Mystique. (arXiv:2012.12311v2 [cs.LG] UPDATED)
    Influencer marketing has become a very popular tool to reach customers. Despite the rapid growth in influencer videos, there has been little research on the effectiveness of their constituent elements in explaining video engagement. We study YouTube influencers and analyze their unstructured video data across text, audio and images using a novel "interpretable deep learning" framework that accomplishes both goals of prediction and interpretation. Our prediction-based approach analyzes unstructured data and finds that "what is said" in words (text) is more influential than "how it is said" in imagery (images) followed by acoustics (audio). Our interpretation-based approach is implemented after completion of model prediction by analyzing the same source of unstructured data to measure importance attributed to the video elements. We eliminate several spurious and confounded relationships, and identify a smaller subset of theory-based relationships. We uncover novel findings that establish distinct effects for measures of shallow and deep engagement which are based on the dual-system framework of human thinking. Our approach is validated using simulated data, and we discuss the learnings from our findings for influencers and brands.
    Unsupervised Music Source Separation Using Differentiable Parametric Source Models. (arXiv:2201.09592v2 [cs.SD] UPDATED)
    Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models' parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent.
    Auxiliary Learning as an Asymmetric Bargaining Game. (arXiv:2301.13501v1 [cs.LG])
    Auxiliary learning is an effective method for enhancing the generalization capabilities of trained models, particularly when dealing with small datasets. However, this approach may present several difficulties: (i) optimizing multiple objectives can be more challenging, and (ii) how to balance the auxiliary tasks to best assist the main task is unclear. In this work, we propose a novel approach, named AuxiNash, for balancing tasks in auxiliary learning by formalizing the problem as generalized bargaining game with asymmetric task bargaining power. Furthermore, we describe an efficient procedure for learning the bargaining power of tasks based on their contribution to the performance of the main task and derive theoretical guarantees for its convergence. Finally, we evaluate AuxiNash on multiple multi-task benchmarks and find that it consistently outperforms competing methods.
    An Analysis of Classification Approaches for Hit Song Prediction using Engineered Metadata Features with Lyrics and Audio Features. (arXiv:2301.13507v1 [cs.IR])
    Hit song prediction, one of the emerging fields in music information retrieval (MIR), remains a considerable challenge. Being able to understand what makes a given song a hit is clearly beneficial to the whole music industry. Previous approaches to hit song prediction have focused on using audio features of a record. This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata, including song audio features provided by Spotify, song lyrics, and novel metadata-based features (title topic, popularity continuity and genre class). Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron. Our results show that Random Forest (RF) and Logistic Regression (LR) with all features (including novel features, song audio features and lyrics features) outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively. Our findings also demonstrate the utility of our novel music metadata features, which contributed most to the models' discriminative performance.
    NP-Match: Towards a New Probabilistic Model for Semi-Supervised Learning. (arXiv:2301.13569v1 [cs.CV])
    Semi-supervised learning (SSL) has been widely explored in recent years, and it is an effective way of leveraging unlabeled data to reduce the reliance on labeled data. In this work, we adjust neural processes (NPs) to the semi-supervised image classification task, resulting in a new method named NP-Match. NP-Match is suited to this task for two reasons. Firstly, NP-Match implicitly compares data points when making predictions, and as a result, the prediction of each unlabeled data point is affected by the labeled data points that are similar to it, which improves the quality of pseudo-labels. Secondly, NP-Match is able to estimate uncertainty that can be used as a tool for selecting unlabeled samples with reliable pseudo-labels. Compared with uncertainty-based SSL methods implemented with Monte-Carlo (MC) dropout, NP-Match estimates uncertainty with much less computational overhead, which can save time at both the training and the testing phases. We conducted extensive experiments on five public datasets under three semi-supervised image classification settings, namely, the standard semi-supervised image classification, the imbalanced semi-supervised image classification, and the multi-label semi-supervised image classification, and NP-Match outperforms state-of-the-art (SOTA) approaches or achieves competitive results on them, which shows the effectiveness of NP-Match and its potential for SSL. The codes are at https://github.com/Jianf-Wang/NP-Match
    Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits. (arXiv:2301.13393v1 [cs.LG])
    Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
    Recurrences reveal shared causal drivers of complex time series. (arXiv:2301.13516v1 [cs.LG])
    Many experimental time series measurements share an unobserved causal driver. Examples include genes targeted by transcription factors, ocean flows influenced by large-scale atmospheric currents, and motor circuits steered by descending neurons. Reliably inferring this unseen driving force is necessary to understand the intermittent nature of top-down control schemes in diverse biological and engineered systems. Here, we introduce a new unsupervised learning algorithm that uses recurrences in time series measurements to gradually reconstruct an unobserved driving signal. Drawing on the mathematical theory of skew-product dynamical systems, we identify recurrence events shared across response time series, which implicitly define a recurrence graph with glass-like structure. As the amount or quality of observed data improves, this recurrence graph undergoes a percolation transition manifesting as weak ergodicity breaking for random walks on the induced landscape -- revealing the shared driver's dynamics, even in the presence of strongly corrupted or noisy measurements. Across several thousand random dynamical systems, we empirically quantify the dependence of reconstruction accuracy on the rate of information transfer from a chaotic driver to the response systems, and we find that effective reconstruction proceeds through gradual approximation of the driver's dominant unstable periodic orbits. Through extensive benchmarks against classical and neural-network-based signal processing techniques, we demonstrate our method's strong ability to extract causal driving signals from diverse real-world datasets spanning neuroscience, genomics, fluid dynamics, and physiology.
    Tricking AI chips into Simulating the Human Brain: A Detailed Performance Analysis. (arXiv:2301.13637v1 [cs.LG])
    Challenging the Nvidia monopoly, dedicated AI-accelerator chips have begun emerging for tackling the computational challenge that the inference and, especially, the training of modern deep neural networks (DNNs) poses to modern computers. The field has been ridden with studies assessing the performance of these contestants across various DNN model types. However, AI-experts are aware of the limitations of current DNNs and have been working towards the fourth AI wave which will, arguably, rely on more biologically inspired models, predominantly on spiking neural networks (SNNs). At the same time, GPUs have been heavily used for simulating such models in the field of computational neuroscience, yet AI-chips have not been tested on such workloads. The current paper aims at filling this important gap by evaluating multiple, cutting-edge AI-chips (Graphcore IPU, GroqChip, Nvidia GPU with Tensor Cores and Google TPU) on simulating a highly biologically detailed model of a brain region, the inferior olive (IO). This IO application stress-tests the different AI-platforms for highlighting architectural tradeoffs by varying its compute density, memory requirements and floating-point numerical accuracy. Our performance analysis reveals that the simulation problem maps extremely well onto the GPU and TPU architectures, which for networks of 125,000 cells leads to a 28x respectively 1,208x speedup over CPU runtimes. At this speed, the TPU sets a new record for largest real-time IO simulation. The GroqChip outperforms both platforms for small networks but, due to implementing some floating-point operations at reduced accuracy, is found not yet usable for brain simulation.
    V2N Service Scaling with Deep Reinforcement Learning. (arXiv:2301.13324v1 [cs.LG])
    The fifth generation (5G) of wireless networks is set out to meet the stringent requirements of vehicular use cases. Edge computing resources can aid in this direction by moving processing closer to end-users, reducing latency. However, given the stochastic nature of traffic loads and availability of physical resources, appropriate auto-scaling mechanisms need to be employed to support cost-efficient and performant services. To this end, we employ Deep Reinforcement Learning (DRL) for vertical scaling in Edge computing to support vehicular-to-network communications. We address the problem using Deep Deterministic Policy Gradient (DDPG). As DDPG is a model-free off-policy algorithm for learning continuous actions, we introduce a discretization approach to support discrete scaling actions. Thus we address scalability problems inherent to high-dimensional discrete action spaces. Employing a real-world vehicular trace data set, we show that DDPG outperforms existing solutions, reducing (at minimum) the average number of active CPUs by 23% while increasing the long-term reward by 24%.
    When Source-Free Domain Adaptation Meets Learning with Noisy Labels. (arXiv:2301.13381v1 [cs.LG])
    Recent state-of-the-art source-free domain adaptation (SFDA) methods have focused on learning meaningful cluster structures in the feature space, which have succeeded in adapting the knowledge from source domain to unlabeled target domain without accessing the private source data. However, existing methods rely on the pseudo-labels generated by source models that can be noisy due to domain shift. In this paper, we study SFDA from the perspective of learning with label noise (LLN). Unlike the label noise in the conventional LLN scenario, we prove that the label noise in SFDA follows a different distribution assumption. We also prove that such a difference makes existing LLN methods that rely on their distribution assumptions unable to address the label noise in SFDA. Empirical evidence suggests that only marginal improvements are achieved when applying the existing LLN methods to solve the SFDA problem. On the other hand, although there exists a fundamental difference between the label noise in the two scenarios, we demonstrate theoretically that the early-time training phenomenon (ETP), which has been previously observed in conventional label noise settings, can also be observed in the SFDA problem. Extensive experiments demonstrate significant improvements to existing SFDA algorithms by leveraging ETP to address the label noise in SFDA.
    Automated Time-frequency Domain Audio Crossfades using Graph Cuts. (arXiv:2301.13380v1 [cs.SD])
    The problem of transitioning smoothly from one audio clip to another arises in many music consumption scenarios, especially as music consumption has moved from professionally curated and live-streamed radios to personal playback devices and services. we present the first steps toward a new method of automatically transitioning from one audio clip to another by discretizing the frequency spectrum into bins and then finding transition times for each bin. We phrase the problem as one of graph flow optimization; specifically min-cut/max-flow.
    Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation. (arXiv:2301.13389v1 [cs.LG])
    While it is tempting to believe that data distillation preserves privacy, distilled data's empirical robustness against known attacks does not imply a provable privacy guarantee. Here, we develop a provably privacy-preserving data distillation algorithm, called differentially private kernel inducing points (DP-KIP). DP-KIP is an instantiation of DP-SGD on kernel ridge regression (KRR). Following a recent work, we use neural tangent kernels and minimize the KRR loss to estimate the distilled datapoints (i.e., kernel inducing points). We provide a computationally efficient JAX implementation of DP-KIP, which we test on several popular image and tabular datasets to show its efficacy in data distillation with differential privacy guarantees.
    Optimizing DDPM Sampling with Shortcut Fine-Tuning. (arXiv:2301.13362v1 [cs.LG])
    In this study, we propose Shortcut Fine-tuning (SFT), a new approach for addressing the challenge of fast sampling of pretrained Denoising Diffusion Probabilistic Models (DDPMs). SFT advocates for the fine-tuning of DDPM samplers through the direct minimization of Integral Probability Metrics (IPM), instead of learning the backward diffusion process. This enables samplers to discover an alternative and more efficient sampling shortcut, deviating from the backward diffusion process. We also propose a new algorithm that is similar to the policy gradient method for fine-tuning DDPMs by proving that under certain assumptions, the gradient descent of diffusion models is equivalent to the policy gradient approach. Through empirical evaluation, we demonstrate that our fine-tuning method can further enhance existing fast DDPM samplers, resulting in sample quality comparable to or even surpassing that of the full-step model across various datasets.
    Misspecification-robust Sequential Neural Likelihood. (arXiv:2301.13368v1 [stat.ME])
    Simulation-based inference (SBI) techniques are now an essential tool for the parameter estimation of mechanistic and simulatable models with intractable likelihoods. Statistical approaches to SBI such as approximate Bayesian computation and Bayesian synthetic likelihood have been well studied in the well specified and misspecified settings. However, most implementations are inefficient in that many model simulations are wasted. Neural approaches such as sequential neural likelihood (SNL) have been developed that exploit all model simulations to build a surrogate of the likelihood function. However, SNL approaches have been shown to perform poorly under model misspecification. In this paper, we develop a new method for SNL that is robust to model misspecification and can identify areas where the model is deficient. We demonstrate the usefulness of the new approach on several illustrative examples.
    Self-Consistent Velocity Matching of Probability Flows. (arXiv:2301.13737v1 [cs.LG])
    We present a discretization-free scalable framework for solving a large class of mass-conserving partial differential equations (PDEs), including the time-dependent Fokker-Planck equation and the Wasserstein gradient flow. The main observation is that the time-varying velocity field of the PDE solution needs to be self-consistent: it must satisfy a fixed-point equation involving the flow characterized by the same velocity field. By parameterizing the flow as a time-dependent neural network, we propose an end-to-end iterative optimization framework called self-consistent velocity matching to solve this class of PDEs. Compared to existing approaches, our method does not suffer from temporal or spatial discretization, covers a wide range of PDEs, and scales to high dimensions. Experimentally, our method recovers analytical solutions accurately when they are available and achieves comparable or better performance in high dimensions with less training time compared to recent large-scale JKO-based methods that are designed for solving a more restrictive family of PDEs.
    Near Optimal Private and Robust Linear Regression. (arXiv:2301.13273v1 [cs.LG])
    We study the canonical statistical estimation problem of linear regression from $n$ i.i.d.~examples under $(\varepsilon,\delta)$-differential privacy when some response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves a near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to guarantee both $(\varepsilon,\delta)$-DP and robustness. Synthetic experiments confirm the superiority of our approach.
    Learning Coordination Policies over Heterogeneous Graphs for Human-Robot Teams via Recurrent Neural Schedule Propagation. (arXiv:2301.13279v1 [cs.AI])
    As human-robot collaboration increases in the workforce, it becomes essential for human-robot teams to coordinate efficiently and intuitively. Traditional approaches for human-robot scheduling either utilize exact methods that are intractable for large-scale problems and struggle to account for stochastic, time varying human task performance, or application-specific heuristics that require expert domain knowledge to develop. We propose a deep learning-based framework, called HybridNet, combining a heterogeneous graph-based encoder with a recurrent schedule propagator for scheduling stochastic human-robot teams under upper- and lower-bound temporal constraints. The HybridNet's encoder leverages Heterogeneous Graph Attention Networks to model the initial environment and team dynamics while accounting for the constraints. By formulating task scheduling as a sequential decision-making process, the HybridNet's recurrent neural schedule propagator leverages Long Short-Term Memory (LSTM) models to propagate forward consequences of actions to carry out fast schedule generation, removing the need to interact with the environment between every task-agent pair selection. The resulting scheduling policy network provides a computationally lightweight yet highly expressive model that is end-to-end trainable via Reinforcement Learning algorithms. We develop a virtual task scheduling environment for mixed human-robot teams in a multi-round setting, capable of modeling the stochastic learning behaviors of human workers. Experimental results showed that HybridNet outperformed other human-robot scheduling solutions across problem sizes for both deterministic and stochastic human performance, with faster runtime compared to pure-GNN-based schedulers.  ( 2 min )
    Sifer: Overcoming simplicity bias in deep networks using a feature sieve. (arXiv:2301.13293v1 [cs.LG])
    Simplicity bias is the concerning tendency of deep networks to over-depend on simple, weakly predictive features, to the exclusion of stronger, more complex features. This causes biased, incorrect model predictions in many real-world applications, exacerbated by incomplete training data containing spurious feature-label correlations. We propose a direct, interventional method for addressing simplicity bias in DNNs, which we call the feature sieve. We aim to automatically identify and suppress easily-computable spurious features in lower layers of the network, thereby allowing the higher network levels to extract and utilize richer, more meaningful representations. We provide concrete evidence of this differential suppression & enhancement of relevant features on both controlled datasets and real-world images, and report substantial gains on many real-world debiasing benchmarks (11.4% relative gain on Imagenet-A; 3.2% on BAR, etc). Crucially, we outperform many baselines that incorporate knowledge about known spurious or biased attributes, despite our method not using any such information. We believe that our feature sieve work opens up exciting new research directions in automated adversarial feature extraction & representation learning for deep networks.  ( 2 min )
  • Open

    Preserving local densities in low-dimensional embeddings. (arXiv:2301.13732v1 [cs.LG])
    Low-dimensional embeddings and visualizations are an indispensable tool for analysis of high-dimensional data. State-of-the-art methods, such as tSNE and UMAP, excel in unveiling local structures hidden in high-dimensional data and are therefore routinely applied in standard analysis pipelines in biology. We show, however, that these methods fail to reconstruct local properties, such as relative differences in densities (Fig. 1) and that apparent differences in cluster size can arise from computational artifact caused by differing sample sizes (Fig. 2). Providing a theoretical analysis of this issue, we then suggest dtSNE, which approximately conserves local densities. In an extensive study on synthetic benchmark and real world data comparing against five state-of-the-art methods, we empirically show that dtSNE provides similar global reconstruction, but yields much more accurate depictions of local distances and relative densities.  ( 2 min )
    Bayesian Bilinear Neural Network for Predicting the Mid-price Dynamics in Limit-Order Book Markets. (arXiv:2203.03613v2 [econ.EM] UPDATED)
    The prediction of financial markets is a challenging yet important task. In modern electronically-driven markets, traditional time-series econometric methods often appear incapable of capturing the true complexity of the multi-level interactions driving the price dynamics. While recent research has established the effectiveness of traditional machine learning (ML) models in financial applications, their intrinsic inability to deal with uncertainties, which is a great concern in econometrics research and real business applications, constitutes a major drawback. Bayesian methods naturally appear as a suitable remedy conveying the predictive ability of ML methods with the probabilistically-oriented practice of econometric research. By adopting a state-of-the-art second-order optimization algorithm, we train a Bayesian bilinear neural network with temporal attention, suitable for the challenging time-series task of predicting mid-price movements in ultra-high-frequency limit-order book markets. We thoroughly compare our Bayesian model with traditional ML alternatives by addressing the use of predictive distributions to analyze errors and uncertainties associated with the estimated parameters and model forecasts. Our results underline the feasibility of the Bayesian deep-learning approach and its predictive and decisional advantages in complex econometric tasks, prompting future research in this direction.
    Learning from many trajectories. (arXiv:2203.17193v2 [cs.LG] UPDATED)
    We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
    Learning Generalized Hybrid Proximity Representation for Image Recognition. (arXiv:2301.13459v1 [cs.CV])
    Recently, deep metric learning techniques received attention, as the learned distance representations are useful to capture the similarity relationship among samples and further improve the performance of various of supervised or unsupervised learning tasks. We propose a novel supervised metric learning method that can learn the distance metrics in both geometric and probabilistic space for image recognition. In contrast to the previous metric learning methods which usually focus on learning the distance metrics in Euclidean space, our proposed method is able to learn better distance representation in a hybrid approach. To achieve this, we proposed a Generalized Hybrid Metric Loss (GHM-Loss) to learn the general hybrid proximity features from the image data by controlling the trade-off between geometric proximity and probabilistic proximity. To evaluate the effectiveness of our method, we first provide theoretical derivations and proofs of the proposed loss function, then we perform extensive experiments on two public datasets to show the advantage of our method compared to other state-of-the-art metric learning methods.
    Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback. (arXiv:2301.13791v1 [stat.ML])
    We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
    A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser. (arXiv:2301.13731v1 [stat.ML])
    This paper presents a new convergent Plug-and-Play (PnP) algorithm. PnP methods are efficient iterative algorithms for solving image inverse problems formulated as the minimization of the sum of a data-fidelity term and a regularization term. PnP methods perform regularization by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD). To ensure convergence of PnP schemes, many works study specific parametrizations of deep denoisers. However, existing results require either unverifiable or suboptimal hypotheses on the denoiser, or assume restrictive conditions on the parameters of the inverse problem. Observing that these limitations can be due to the proximal algorithm in use, we study a relaxed version of the PGD algorithm for minimizing the sum of a convex function and a weakly convex one. When plugged with a relaxed proximal denoiser, we show that the proposed PnP-$\alpha$PGD algorithm converges for a wider range of regularization parameters, thus allowing more accurate image restoration.
    Stabilize Deep ResNet with A Sharp Scaling Factor $\tau$. (arXiv:1903.07120v5 [cs.LG] UPDATED)
    We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $\tau =O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $\tau>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $\tau$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $\tau$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
    Real-Time Outlier Detection with Dynamic Process Limits. (arXiv:2301.13527v1 [cs.LG])
    Anomaly detection methods are part of the systems where rare events may endanger an operation's profitability, safety, and environmental aspects. Although many state-of-the-art anomaly detection methods were developed to date, their deployment is limited to the operation conditions present during the model training. Online anomaly detection brings the capability to adapt to data drifts and change points that may not be represented during model development resulting in prolonged service life. This paper proposes an online anomaly detection algorithm for existing real-time infrastructures where low-latency detection is required and novel patterns in data occur unpredictably. The online inverse cumulative distribution-based approach is introduced to eliminate common problems of offline anomaly detectors, meanwhile providing dynamic process limits to normal operation. The benefit of the proposed method is the ease of use, fast computation, and deployability as shown in two case studies of real microgrid operation data.
    Discovery of Single Independent Latent Variable. (arXiv:2110.05887v2 [stat.ML] UPDATED)
    Latent variable discovery is a central problem in data analysis with a broad range of applications in applied science. In this work, we consider data given as an invertible mixture of two statistically independent components, and assume that one of the components is observed while the other is hidden. Our goal is to recover the hidden component. For this purpose, we propose an autoencoder equipped with a discriminator. Unlike the standard nonlinear ICA problem, which was shown to be non-identifiable, in the special case of ICA we consider here, we show that our approach can recover the component of interest up to entropy-preserving transformation. We demonstrate the performance of the proposed approach on several datasets, including image synthesis, voice cloning, and fetal ECG extraction.
    A Reinforcement Learning Framework for Dynamic Mediation Analysis. (arXiv:2301.13348v1 [stat.ML])
    Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset.
    Physics-constrained 3D Convolutional Neural Networks for Electrodynamics. (arXiv:2301.13715v1 [physics.acc-ph])
    We present a physics-constrained neural network (PCNN) approach to solving Maxwell's equations for the electromagnetic fields of intense relativistic charged particle beams. We create a 3D convolutional PCNN to map time-varying current and charge densities J(r,t) and p(r,t) to vector and scalar potentials A(r,t) and V(r,t) from which we generate electromagnetic fields according to Maxwell's equations: B=curl(A), E=-div(V)-dA/dt. Our PCNNs satisfy hard constraints, such as div(B)=0, by construction. Soft constraints push A and V towards satisfying the Lorenz gauge.
    What can be learnt with wide convolutional neural networks?. (arXiv:2208.01003v4 [stat.ML] UPDATED)
    Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.
    A Bias-Variance-Privacy Trilemma for Statistical Estimation. (arXiv:2301.13334v1 [math.ST])
    The canonical algorithm for differentially private mean estimation is to first clip the samples to a bounded range and then add noise to their empirical mean. Clipping controls the sensitivity and, hence, the variance of the noise that we add for privacy. But clipping also introduces statistical bias. We prove that this tradeoff is inherent: no algorithm can simultaneously have low bias, low variance, and low privacy loss for arbitrary distributions. On the positive side, we show that unbiased mean estimation is possible under approximate differential privacy if we assume that the distribution is symmetric. Furthermore, we show that, even if we assume that the data is sampled from a Gaussian, unbiased mean estimation is impossible under pure or concentrated differential privacy.
    Learning Data Representations with Joint Diffusion Models. (arXiv:2301.13622v1 [cs.LG])
    We introduce a joint diffusion model that simultaneously learns meaningful internal representations fit for both generative and predictive tasks. Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models in both generative and predictive settings. We then introduce an extension of the vanilla diffusion model with a classifier that allows for stable joint training with shared parametrization between those objectives. The resulting joint diffusion model offers superior performance across various tasks, including generative modeling, semi-supervised classification, and domain adaptation.
    Kernel Stein Discrepancy thinning: a theoretical perspective of pathologies and a practical fix with regularization. (arXiv:2301.13528v1 [math.ST])
    Stein thinning is a promising algorithm proposed by (Riabiz et al., 2022) for post-processing outputs of Markov chain Monte Carlo (MCMC). The main principle is to greedily minimize the kernelized Stein discrepancy (KSD), which only requires the gradient of the log-target distribution, and is thus well-suited for Bayesian inference. The main advantages of Stein thinning are the automatic remove of the burn-in period, the correction of the bias introduced by recent MCMC algorithms, and the asymptotic properties of convergence towards the target distribution. Nevertheless, Stein thinning suffers from several empirical pathologies, which may result in poor approximations, as observed in the literature. In this article, we conduct a theoretical analysis of these pathologies, to clearly identify the mechanisms at stake, and suggest improved strategies. Then, we introduce the regularized Stein thinning algorithm to alleviate the identified pathologies. Finally, theoretical guarantees and extensive experiments show the high efficiency of the proposed algorithm.
    Differentially Private Distributed Bayesian Linear Regression with MCMC. (arXiv:2301.13778v1 [stat.ML])
    We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-preserving noise. We develop a novel generative statistical model for privately shared statistics, which exploits a useful distributional relation between the summary statistics of linear regression. Bayesian estimation of the regression coefficients is conducted mainly using Markov chain Monte Carlo algorithms, while we also provide a fast version to perform Bayesian estimation in one iteration. The proposed methods have computational advantages over their competitors. We provide numerical results on both real and simulated data, which demonstrate that the proposed algorithms provide well-rounded estimation and prediction.
    Combinatorial Causal Bandits without Graph Skeleton. (arXiv:2301.13392v1 [cs.LG])
    In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we sacrifice the regret to $O(T^{\frac{2}{3}}\ln T)$ to remove the weight gap covered by the asymptotic notation. At last, we give some discussions and algorithms for pure exploration of the CCB problem without the graph structure.
    Misspecification-robust Sequential Neural Likelihood. (arXiv:2301.13368v1 [stat.ME])
    Simulation-based inference (SBI) techniques are now an essential tool for the parameter estimation of mechanistic and simulatable models with intractable likelihoods. Statistical approaches to SBI such as approximate Bayesian computation and Bayesian synthetic likelihood have been well studied in the well specified and misspecified settings. However, most implementations are inefficient in that many model simulations are wasted. Neural approaches such as sequential neural likelihood (SNL) have been developed that exploit all model simulations to build a surrogate of the likelihood function. However, SNL approaches have been shown to perform poorly under model misspecification. In this paper, we develop a new method for SNL that is robust to model misspecification and can identify areas where the model is deficient. We demonstrate the usefulness of the new approach on several illustrative examples.
    On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. (arXiv:2301.13370v1 [cs.LG])
    Recent work has shown that automatic differentiation over the reals is almost always correct in a mathematically precise sense. However, actual programs work with machine-representable numbers (e.g., floating-point numbers), not reals. In this paper, we study the correctness of automatic differentiation when the parameter space of a neural network consists solely of machine-representable numbers. For a neural network with bias parameters, we prove that automatic differentiation is correct at all parameters where the network is differentiable. In contrast, it is incorrect at all parameters where the network is non-differentiable, since it never informs non-differentiability. To better understand this non-differentiable set of parameters, we prove a tight bound on its size, which is linear in the number of non-differentiabilities in activation functions, and provide a simple necessary and sufficient condition for a parameter to be in this set. We further prove that automatic differentiation always computes a Clarke subderivative, even on the non-differentiable set. We also extend these results to neural networks possibly without bias parameters.
    Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits. (arXiv:2301.13393v1 [cs.LG])
    Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
    Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization. (arXiv:2212.13556v2 [cs.LG] UPDATED)
    To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
    Simplex Random Features. (arXiv:2301.13856v1 [stat.ML])
    We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels by geometrical correlation of random projection vectors. We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels among the class of weight-independent geometrically-coupled positive random feature (PRF) mechanisms, substantially outperforming the previously most accurate Orthogonal Random Features at no observable extra cost. We present a more computationally expensive SimRFs+ variant, which we prove is asymptotically optimal in the broader family of weight-dependent geometrical coupling schemes (which permit correlations between random vector directions and norms). In extensive empirical studies, we show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v2 [stat.ML] UPDATED)
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings: valid inference after data peeking requires correcting for multiple testing but such corrections generally result in low power. Following the principle of testing by betting, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g, the Hilbert-Schmidt independence criterion. Our test is valid under non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.
    Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation. (arXiv:2301.13389v1 [cs.LG])
    While it is tempting to believe that data distillation preserves privacy, distilled data's empirical robustness against known attacks does not imply a provable privacy guarantee. Here, we develop a provably privacy-preserving data distillation algorithm, called differentially private kernel inducing points (DP-KIP). DP-KIP is an instantiation of DP-SGD on kernel ridge regression (KRR). Following a recent work, we use neural tangent kernels and minimize the KRR loss to estimate the distilled datapoints (i.e., kernel inducing points). We provide a computationally efficient JAX implementation of DP-KIP, which we test on several popular image and tabular datasets to show its efficacy in data distillation with differential privacy guarantees.
    Population-wise Labeling of Sulcal Graphs using Multi-graph Matching. (arXiv:2301.13532v1 [stat.ML])
    Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
    Personalized Decentralized Bilevel Optimization over Random Directed Networks. (arXiv:2210.02129v2 [stat.ML] UPDATED)
    Personalization and decentralization are two major lines of studies to realize practical federated learning in the real world. The aim of this study is to establish a general and unified approach that can solve these two problems simultaneously. In this work, we first propose a bilevel problem that can adapt to various personalization scenarios by allowing an arbitrary choice of two parameters: a client-wise outer-parameter representing heterogeneity, and a shared inner-parameter representing homogeneity across client data distributions. We then present an algorithm that can solve this bilevel problem in a decentralized manner by estimating gradients of clients' outer-costs with respect to their outer-parameters. We show that the proposed algorithm can be extended to handle a random directed network, which is one of the most robust decentralized communication classes. The proposed method achieves state-of-the-art performance on a personalization benchmark across various communication settings.
    Bayesian Learning for Neural Networks: an algorithmic survey. (arXiv:2211.11865v4 [stat.ML] UPDATED)
    The last decade witnessed a growing interest in Bayesian learning. Yet, the technicality of the topic and the multitude of ingredients involved therein, besides the complexity of turning theory into practical implementations, limit the use of the Bayesian learning paradigm, preventing its widespread adoption across different fields and applications. This self-contained survey engages and introduces readers to the principles and algorithms of Bayesian Learning for Neural Networks. It provides an introduction to the topic from an accessible, practical-algorithmic perspective. Upon providing a general introduction to Bayesian Neural Networks, we discuss and present both standard and recent approaches for Bayesian inference, with an emphasis on solutions relying on Variational Inference and the use of Natural gradients. We also discuss the use of manifold optimization as a state-of-the-art approach to Bayesian learning. We examine the characteristic properties of all the discussed methods, and provide pseudo-codes for their implementation, paying attention to practical aspects, such as the computation of the gradients.
    Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments. (arXiv:2301.13446v1 [cs.LG])
    We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.  ( 2 min )
    Continuous Soft Pseudo-Labeling in ASR. (arXiv:2211.06007v2 [cs.LG] UPDATED)
    Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.  ( 2 min )
    Scaling laws for single-agent reinforcement learning. (arXiv:2301.13442v1 [cs.LG])
    Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.  ( 2 min )
    Causal Estimation for Text Data with (Apparent) Overlap Violations. (arXiv:2210.00079v2 [stat.ML] UPDATED)
    Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.  ( 2 min )
    Unifying Generative Models with GFlowNets and Beyond. (arXiv:2209.02606v2 [cs.LG] UPDATED)
    There are many frameworks for deep generative modeling, each often presented with their own specific training algorithms and inference methods. Here, we demonstrate the connections between existing deep generative models and the recently introduced GFlowNet framework, a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories. Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models. Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.  ( 2 min )
    Understanding Self-Distillation in the Presence of Label Noise. (arXiv:2301.13304v1 [cs.LG])
    Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.  ( 2 min )
    Hierarchically Clustered PCA, LLE, and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v2 [cs.LG] UPDATED)
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.  ( 2 min )
    Learning Against Distributional Uncertainty: On the Trade-off Between Robustness and Specificity. (arXiv:2301.13565v1 [cs.LG])
    Trustworthy machine learning aims at combating distributional uncertainties in training data distributions compared to population distributions. Typical treatment frameworks include the Bayesian approach, (min-max) distributionally robust optimization (DRO), and regularization. However, two issues have to be raised: 1) All these methods are biased estimators of the true optimal cost; 2) the prior distribution in the Bayesian method, the radius of the distributional ball in the DRO method, and the regularizer in the regularization method are difficult to specify. This paper studies a new framework that unifies the three approaches and that addresses the two challenges mentioned above. The asymptotic properties (e.g., consistency and asymptotic normalities), non-asymptotic properties (e.g., unbiasedness and generalization error bound), and a Monte--Carlo-based solution method of the proposed model are studied. The new model reveals the trade-off between the robustness to the unseen data and the specificity to the training data.  ( 2 min )
    Exploring QSAR Models for Activity-Cliff Prediction. (arXiv:2301.13644v1 [cs.LG])
    Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
    Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors. (arXiv:2201.06463v4 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference, Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs describe (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.  ( 2 min )
    An $l_1$-oracle inequality for the Lasso in high-dimensional mixtures of experts models. (arXiv:2009.10622v5 [math.ST] UPDATED)
    Mixtures of experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.  ( 2 min )
    Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees. (arXiv:2301.13375v1 [cs.LG])
    Robustness and safety are critical for the trustworthy deployment of deep reinforcement learning in real-world decision making applications. In particular, we require algorithms that can guarantee robust, safe performance in the presence of general environment disturbances, while making limited assumptions on the data collection process during training. In this work, we propose a safe reinforcement learning framework with robustness guarantees through the use of an optimal transport cost uncertainty set. We provide an efficient, theoretically supported implementation based on Optimal Transport Perturbations, which can be applied in a completely offline fashion using only data collected in a nominal training environment. We demonstrate the robust, safe performance of our approach on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.  ( 2 min )
    A Unified Causal View of Domain Invariant Representation Learning. (arXiv:2208.06987v3 [stat.ML] UPDATED)
    Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. There are a wide range of proposals for mitigating this problem by learning representations that are ``invariant'' in some sense.However, these methods generally contradict each other, and none of them consistently improve performance on real-world domain shift benchmarks. There are two main questions that must be addressed to understand when, if ever, we should use each method. First, how does each ad hoc notion of ``invariance'' relate to the structure of real-world problems? And, second, when does learning invariant representations actually yield robust models? To address these issues, we introduce a broad formal notion of what it means for a real-world domain shift to admit invariant structure. Then, we characterize the causal structures that are compatible with this notion of invariance.With this in hand, we find conditions under which method-specific invariance notions correspond to real-world invariant structure, and we clarify the relationship between invariant structure and robustness to domain shifts. For both questions, we find that the true underlying causal structure of the data plays a critical role.  ( 2 min )
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R. (arXiv:2103.09603v4 [stat.ML] UPDATED)
    The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.  ( 2 min )
    Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond. (arXiv:2301.13486v1 [stat.ML])
    In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.  ( 2 min )
    Learning in POMDPs is Sample-Efficient with Hindsight Observability. (arXiv:2301.13857v1 [cs.LG])
    POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.  ( 2 min )
    Gaussian Noise is Nearly Instance Optimal for Private Unbiased Mean Estimation. (arXiv:2301.13850v1 [math.ST])
    We investigate unbiased high-dimensional mean estimators in differential privacy. We consider differentially private mechanisms whose expected output equals the mean of the input dataset, for every dataset drawn from a fixed convex domain $K$ in $\mathbb{R}^d$. In the setting of concentrated differential privacy, we show that, for every input such an unbiased mean estimator introduces approximately at least as much error as a mechanism that adds Gaussian noise with a carefully chosen covariance. This is true when the error is measured with respect to $\ell_p$ error for any $p \ge 2$. We extend this result to local differential privacy, and to approximate differential privacy, but for the latter the error lower bound holds either for a dataset or for a neighboring dataset. We also extend our results to mechanisms that take i.i.d.~samples from a distribution over $K$ and are unbiased with respect to the mean of the distribution.  ( 2 min )
    On the Statistical Benefits of Temporal Difference Learning. (arXiv:2301.13289v1 [cs.LG])
    Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.  ( 2 min )
    Fairness and Accuracy under Domain Generalization. (arXiv:2301.13323v1 [cs.LG])
    As machine learning (ML) algorithms are increasingly used in high-stakes applications, concerns have arisen that they may be biased against certain social groups. Although many approaches have been proposed to make ML models fair, they typically rely on the assumption that data distributions in training and deployment are identical. Unfortunately, this is commonly violated in practice and a model that is fair during training may lead to an unexpected outcome during its deployment. Although the problem of designing robust ML models under dataset shifts has been widely studied, most existing works focus only on the transfer of accuracy. In this paper, we study the transfer of both fairness and accuracy under domain generalization where the data at test time may be sampled from never-before-seen domains. We first develop theoretical bounds on the unfairness and expected loss at deployment, and then derive sufficient conditions under which fairness and accuracy can be perfectly transferred via invariant representation learning. Guided by this, we design a learning algorithm such that fair ML models learned with training data still have high fairness and accuracy when deployment environments change. Experiments on real-world data validate the proposed algorithm. Model implementation is available at https://github.com/pth1993/FATDM.  ( 2 min )
    Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion. (arXiv:2301.13743v1 [cs.CV])
    Cross-modality data translation has attracted great interest in image computing. Deep generative models (\textit{e.g.}, GANs) show performance improvement in tackling those problems. Nevertheless, as a fundamental challenge in image translation, the problem of Zero-shot-Learning Cross-Modality Data Translation with fidelity remains unanswered. This paper proposes a new unsupervised zero-shot-learning method named Mutual Information guided Diffusion cross-modality data translation Model (MIDiffusion), which learns to translate the unseen source data to the target domain. The MIDiffusion leverages a score-matching-based generative model, which learns the prior knowledge in the target domain. We propose a differentiable local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling. The $LMI$ captures the identical cross-modality features in the statistical domain for the diffusion guidance; thus, our method does not require retraining when the source domain is changed, as it does not rely on any direct mapping between the source and target domains. This advantage is critical for applying cross-modality data translation methods in practice, as a reasonable amount of source domain dataset is not always available for supervised training. We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models, including adversarial-based and other score-matching-based models.  ( 2 min )
    On the Initialisation of Wide Low-Rank Feedforward Neural Networks. (arXiv:2301.13710v1 [stat.ML])
    The edge-of-chaos dynamics of wide randomly initialized low-rank feedforward networks are analyzed. Formulae for the optimal weight and bias variances are extended from the full-rank to low-rank setting and are shown to follow from multiplicative scaling. The principle second order effect, the variance of the input-output Jacobian, is derived and shown to increase as the rank to width ratio decreases. These results inform practitioners how to randomly initialize feedforward networks with a reduced number of learnable parameters while in the same ambient dimension, allowing reductions in the computational cost and memory constraints of the associated network.  ( 2 min )
    The passive symmetries of machine learning. (arXiv:2301.13724v1 [stat.ML])
    Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry and units covariance, all of which have led to important results in physics. Our goal is to understand the implications of passive symmetries for machine learning: Which passive symmetries play a role (e.g., permutation symmetry in graph neural networks)? What are dos and don'ts in machine learning practice? We assay conditions under which passive symmetries can be implemented as group equivariances. We also discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. While this paper is purely conceptual, we believe that it can have a significant impact on helping machine learning make the transition that took place for modern physics in the first half of the Twentieth century.  ( 2 min )
    Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization. (arXiv:2301.13303v1 [stat.ML])
    To achieve scalable and accurate inference for latent Gaussian processes, we propose a variational approximation based on a family of Gaussian distributions whose covariance matrices have sparse inverse Cholesky (SIC) factors. We combine this variational approximation of the posterior with a similar and efficient SIC-restricted Kullback-Leibler-optimal approximation of the prior. We then focus on a particular SIC ordering and nearest-neighbor-based sparsity pattern resulting in highly accurate prior and posterior approximations. For this setting, our variational approximation can be computed via stochastic gradient descent in polylogarithmic time per iteration. We provide numerical comparisons showing that the proposed double-Kullback-Leibler-optimal Gaussian-process approximation (DKLGP) can sometimes be vastly more accurate than alternative approaches such as inducing-point and mean-field approximations at similar computational complexity.  ( 2 min )
    Fast Optimal Estimation with Intractable Models using Permutation-Invariant Neural Networks. (arXiv:2208.12942v2 [stat.ME] UPDATED)
    Neural networks have recently shown promise for likelihood-free inference, providing orders-of-magnitude speed-ups over classical methods. However, current implementations are suboptimal when estimating parameters from independent replicates. In this paper, we use a decision-theoretic framework to argue that permutation-invariant neural networks are ideally placed for constructing Bayes estimators for arbitrary models, provided that simulation from these models is straightforward. We show that the resulting neural Bayes estimators can quickly and optimally estimate parameters in weakly-identified and highly-parameterised models with relative ease, and that they are highly competitive and much faster than traditional likelihood-based estimators. We apply our estimator on a spatial analysis of sea-surface temperature in the Red Sea where, after training, we obtain parameter estimates, and uncertainty quantification of the estimates via bootstrap sampling, from hundreds of spatial fields in a fraction of a second.  ( 2 min )
    Optimal precision for GANs. (arXiv:2207.10541v2 [cs.LG] UPDATED)
    Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnected distributions. We investigate the relationship between the performance of these models and the geometry of their latent space. Building on recent developments in geometric measure theory, we prove a sufficient condition for optimality in the case where the dimension of the latent space is larger than the number of modes. Through experiments on GANs, we demonstrate the validity of our theoretical results and gain new insights into the latent space geometry of these models. Additionally, we propose a truncation method that enforces a simplicial cluster structure in the latent space and improves the performance of GANs.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v2 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important model class in spatial statistics. They encode prior information about the modeled function and can be used for exact or approximate Bayesian inference. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 2 min )
    Demystifying Disagreement-on-the-Line in High Dimensions. (arXiv:2301.13371v1 [stat.ML])
    Evaluating the performance of machine learning models under distribution shift is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackle this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have lead to the discovery of the disagreement-on-the-line phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny ImageNet-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings.  ( 2 min )
    Structure Learning and Parameter Estimation for Graphical Models via Penalized Maximum Likelihood Methods. (arXiv:2301.13269v1 [stat.ML])
    Probabilistic graphical models (PGMs) provide a compact and flexible framework to model very complex real-life phenomena. They combine the probability theory which deals with uncertainty and logical structure represented by a graph which allows one to cope with the computational complexity and also interpret and communicate the obtained knowledge. In the thesis, we consider two different types of PGMs: Bayesian networks (BNs) which are static, and continuous time Bayesian networks which, as the name suggests, have a temporal component. We are interested in recovering their true structure, which is the first step in learning any PGM. This is a challenging task, which is interesting in itself from the causal point of view, for the purposes of interpretation of the model and the decision-making process. All approaches for structure learning in the thesis are united by the same idea of maximum likelihood estimation with the LASSO penalty. The problem of structure learning is reduced to the problem of finding non-zero coefficients in the LASSO estimator for a generalized linear model. In the case of CTBNs, we consider the problem both for complete and incomplete data. We support the theoretical results with experiments.  ( 2 min )
    Near Optimal Private and Robust Linear Regression. (arXiv:2301.13273v1 [cs.LG])
    We study the canonical statistical estimation problem of linear regression from $n$ i.i.d.~examples under $(\varepsilon,\delta)$-differential privacy when some response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves a near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to guarantee both $(\varepsilon,\delta)$-DP and robustness. Synthetic experiments confirm the superiority of our approach.  ( 2 min )

  • Open

    Reinforcement Learning to Control a 2D Quadcopter
    submitted by /u/Alyx1337 [link] [comments]  ( 40 min )
    Adding a Bonus to Q-function
    Hey, I'm trying to apply RL, precisely DQN, to recommender systems. I want to add a weighted bonus to my Q-function to favor some actions over others. This bonus is a function that depends on (s,a) just like the Q-function. Then, instead of deriving a greedy policy from the original Q, I derive a greedy policy for the bonused function Q'. For some reasons, I would really like to do this instead instead of modifying the reward function itself. My approach works well empirically, however I would want it to be more theoretically grounded. I think what I want ideally is to prove that Q' is optimal for some MDP with a slightly different reward function. And I surprisingly can't find papers doing that in the literature. There are some seemingly related things like Soft Actor Critic which adds an entropy bonus to the value function, but not much more. So, is it ok to tweak a Q-function like this? Is there something I should be careful of? Thank you for your help! submitted by /u/xalendrio [link] [comments]  ( 42 min )
    Multi-Agents Soccer Competition ⚽ (Deep Reinforcement Learning Course by Hugging Face 🤗)
    Hey there 👋 We published the ⚔️ AI vs. AI challenge⚔️, a deep reinforcement learning multi-agents competition. You’ll learn about Multi-agent Reinforcement Learning (MARL), you’ll train your agents to play soccer and you’re going to participate in AI vs. AI challenge where your trained agent will compete against other classmates’ agents every day and be ranked on a new leaderboard. You don’t need to participate in the course to be able to participate in the competition. You can start here 👉 https://huggingface.co/deep-rl-course/unit7/introduction 🏆 The leaderboard 👉 https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos 👀 Visualize your agent competing with our demo 👉https://huggingface.co/spaces/unity/SoccerTwos We also created a discord channel, ai-vs-ai-competition to exchange with others and share advice, you can join our discord server here 👉 hf.co/discord/join https://preview.redd.it/4dc01ktqnlfa1.png?width=1920&format=png&auto=webp&s=c10ff68884683373c631648725aa166364d9494d If you have questions or feedback, I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 41 min )
    Stable Baseines3 not logging success rate for rollouts?
    I am training a PPO agent and have added the appropriate info to my custom environment to log success rate. This now means I am getting logs from my evaluation of success rate but I am not getting this for rollouts. I expect this is due to my environment creation code. As far as I can tell I've done what the documentation says. Am I missing something? env = SubprocVecEnv([lambda: Monitor(SumoEnv(gui=gui), info_keywords=("is_success",)) for i in range(num_envs)], start_method="spawn") submitted by /u/centripetalstranger [link] [comments]  ( 41 min )
    A discrete action in the continuous space?
    I’m currently using stable baselines3 and have 2 options in terms of action. Some acceleration values or a lane change. I broke this down into discrete actions. However, I want to use a continuous action space now and I can’t use a dict action space to keep the accelerations separate. Is it reasonable to make for example 1-3 in the continuous space represent accelerations and then 3-4 represent a lane change? By this I mean, should the agent pick 3.1, 3.2, 3.3, they will all result in a lane change. Will this cause issues? submitted by /u/centripetalstranger [link] [comments]  ( 42 min )
    Share Your Reinforcement Learning Interview Questions
    Hi, I have a technical interview incoming for a entry/mid level RL Engineer position. I have just finished my masters degree and have some experience in RL from previous internships. I am currently using CS 285 at UC Berkeley online lectures to prepare for the interview. I would really appreciate it if you can share the questions that you have faced or asked for similar positions or can recommend what specifics should I focus on for such interviews. submitted by /u/ZIGGY-Zz [link] [comments]  ( 42 min )
    ChatGPT and RL
    Hi I am trying to do classification (nope, it is one-step prediction and it is not sequence dependent). Question: the success of chatgpt, does this mean that we can also use RL (PPO) for just classification? My understanding is that we can just use supervised learning. Does PPO (RL) helps in this case? Thanks? submitted by /u/Dense-Smf-6032 [link] [comments]  ( 42 min )
    Scaling laws for single-agent reinforcement learning (OpenAI)
    submitted by /u/goolulusaurs [link] [comments]  ( 40 min )
    What does the output of the actor network should generally represent?
    Hi, I’m trying to understand some basic concepts of RL. I’m developing a model that should predict the sum of future rewards for any given state (simplified version of bellman’s equation). Then it should compare the actual future reward and it’s prediction with the loss function and backpropagate. This seems to be pretty standard. What I’m not getting, is that when I’m generating my batch of data (for the offline training), I think that the standard should be to choose the action based on a categorical distribution of the predictions for each action (or use epsilon greedy). The problem is that if i have any negative prediction, even if it’s random, it will never reach that state and never update based on it. Is that right? Is it how it’s supposed to be or am I having the wrong concept of what the network should output. Thanks in advance! submitted by /u/enzodtz [link] [comments]  ( 44 min )
    What does the following loss function mean in reinforcement learning?
    I am not an expert on policy learning or reinforcement learning, and I am studying this paper "https://openaccess.thecvf.com/content/ICCV2021/papers/Sun_Dynamic_Network_Quantization_for_Efficient_Video_Inference_ICCV_2021_paper.pdf" right now, but got confused by eq(11) The captial Omega is the action space and small "a_i" is the i-th action in a trajectory. T is the total number of actions in this trajectory,the small k is a selected action from the action space "Omega". I don't understand what symbol it is after the 2nd summation (\sum)in the equation, and I am not sure if it is an "L" or "I". Could someone provide some guidance on what the following function might mean, and why it can be used to achieve balanced policy usage? https://preview.redd.it/n4deietkaifa1.png?width=958&format=png&auto=webp&s=919453f2d7516bd2908c6a23fe6aa1e6fc04edf8 submitted by /u/AaronSpalding [link] [comments]  ( 43 min )
    RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)
    I am trying to write my own actor critic algorithm. Unlike other implementations, I tried to keep a separate actor and critic network. The problem arises somewhere in my actor or critic loss function ​ The error is originating here - advantage = nrml_disc_rewards-values critic_loss = advantage.pow(2).mean() actor_loss = -(torch.sum(torch.log(prob_batch)*advantage)) policy_opt.zero_grad() actor_loss.backward() policy_opt.step() value_opt.zero_grad() critic_loss.backward() value_opt.step() This is the full traceback - D:\q_learning\actor_critic.py:90: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\cb\pytorch_100000…  ( 44 min )
  • Open

    [D] Any open source model, or application to remove no speech parts of a video?
    Currently I am using Davinci Resolve free edition to manually cut / remove no speech parts, or the parts where I take a breath It is extremely time consuming I am pretty sure this can be done via AI For example whisper is able to detect where we use filler words such as umh, um, uh etc That would be awesome to automatically remove these parts from a video Just direct me where to look thank you submitted by /u/CeFurkan [link] [comments]  ( 43 min )
    [N] OpenAI starts selling subscriptions to its ChatGPT bot
    https://www.axios.com/2023/02/01/chatgpt-subscriptions-chatbot-openai Not fully paywalled, but there's a tiering system. submitted by /u/bikeskata [link] [comments]  ( 42 min )
    [D] Normalizing Flows in 2023?
    What is the state of research in normalizing flows in 2023? Have they been superseded by diffusion models for sample generation? If so, what are some other applications where normalizing flows are still SOTA (or even useful)? submitted by /u/wellfriedbeans [link] [comments]  ( 42 min )
    [D] Advice for a multi-label classification problem
    Hi guys, I have a dataset of 12,000 products, each of which consists of a title, description, and some images. In addition, I also have a pre-defined set of product categories. Curious to learn if anyone has any suggestions on what model to be used to train using this dataset as input to classify each product in the dataset into the related categories within the given set? submitted by /u/dle88 [link] [comments]  ( 42 min )
    [D] Why is stable diffusion much smaller than predecessors?
    Stable diffusion seems to be a departure from the trend of building larger and larger models. It has 10x less parameters than other image generation models like DALLE-2. “Incredibly, compared with DALL-E 2 and Imagen, the Stable Diffusion model is a lot smaller. While DALL-E 2 has around 3.5 Billion parameters, and Imagen has 4.6 Billion, the first Stable Diffusion model has just 890 million parameters, which means it uses a lot less VRAM and can actually be run on consumer-grade graphics cards.” What allows stable diffusion to work so well with a lot less parameters? Are there any drawbacks to this, like requiring stable diffusion to be fine tuned more than DALLE-2 for example? submitted by /u/dahdarknite [link] [comments]  ( 43 min )
    [R] Extracting Training Data from Diffusion Models
    https://twitter.com/eric_wallace_/status/1620449934863642624?s=46&t=GVukPDI7944N8-waYE5qcw Extracting training data from diffusion models is possible by following, more or less, these steps: Compute CLIP embeddings for the images in a training dataset. Perform an all-pairs comparison and mark the pairs with l2 distance smaller than some threshold as near duplicates Use the prompts for training samples marked as near duplicates to generate N synthetic samples with the trained model Compute the all-pairs l2 distance between the embeddings of generated samples for a given training prompt. Build a graph where the nodes are generated samples and an edge exists if the l2 distance is less than some threshold. If the largest clique in the resulting graph is of size 10, then the training sample is considered to be memorized. Visually inspect the results to determine if the samples considered to be memorized are similar to the training data samples. With this method, the authors were able to find samples from Stable Diffusion and Imagen corresponding to copyrighted training images. submitted by /u/pm_me_your_pay_slips [link] [comments]  ( 45 min )
    [D] Vectorizing computation of the Jaccard similarity between all instances in a large dataset in Python
    I am trying to claculate the Jaccard similarity between all instances in my dataframe. I am using the following method to do so, however, this method is painfully slow. My ```data_with_labels``` shape is (221277, 217). # Compute the Jaccard similarity between all instances n_instances = data_with_labels.shape[0] jaccard_similarity_matrix = np.zeros((n_instances, n_instances)) for i in range(n_instances): for j in range(n_instances): jaccard_similarity_matrix[i, j] = jaccard_score(data_with_labels[i, :], data_with_labels[j, :], average='micro') ​ Is there any way to do this process with numpy vectorization? I tried soomething like this but keep getting this error: n_instances = data_with_labels.shape[0] jaccard_similarity_matrix = np.zeros((n_instances, n_instances)) for i in range(n_instances): jaccard_similarity_matrix[i, :] = jaccard_score(data_with_labels[i, :], data_with_labels, average='micro') ValueError: Found input variables with inconsistent numbers of samples: [217, 221277] ​ submitted by /u/hopedallas [link] [comments]  ( 43 min )
    [R] On the Expressive Power of Geometric Graph Neural Networks
    Geometric GNNs are an emerging class of GNNs for spatially embedded graphs in scientific and engineering applications, s.a. biomolecular structure, material science, and physical simulations. Notable examples include SchNet, DimeNet, Tensor Field Networks, and E(n) Equivariant GNNs. How powerful are geometric GNNs? How do key design choices influence expressivity and how to build maximally powerful ones? Check out this recent paper for more: 📄 PDF: http://arxiv.org/abs/2301.09308 💻 Code: http://github.com/chaitjo/geometric-gnn-dojo 💡Key findings: https://twitter.com/chaitjo/status/1617812402632019968 P.S. Are you new to Geometric GNNs, GDL, PyTorch Geometric, etc.? Want to understand how theory/equations connect to real code? Try this Geometric GNN 101 notebook before diving in: https://github.com/chaitjo/geometric-gnn-dojo/blob/main/geometric_gnn_101.ipynb submitted by /u/chaitjo [link] [comments]  ( 43 min )
    [P] An open source tool for repeatable PyTorch experiments by embedding your code in each model checkpoint
    I made a new open source tool called JellyML that lets you go back to any of your checkpoints, and reproduce your code exactly as it was when you trained it. You can find the website here: https://jellyml.com The GitHub repo: https://gitHub.com/mmulet/jellyml You can install it with pip: pip install jellyml submitted by /u/latefordinnerstudios [link] [comments]  ( 42 min )
    [D] What does a DL role look like in ten years?
    Every day, there seems to be new evidence of the generalization capabilities of LLMs. What does this mean for the future role of deep learning experts in academia and business? It seems like there's a significant chance that skills such as PyTorch and Jax will be displaced by prompt construction and off-the-shelf model APIs, with only a few large institutions working on the DNN itself. Curious to hear others' thoughts on this. submitted by /u/PassingTumbleweed [link] [comments]  ( 47 min )
    [D] Tortoise TTS API for GPT-3.
    Hey everyone, I thought of an idea to create a human like realistic voice assistant for ChatGPT. So I have a question that can we make an API of tortoise TTS trained on a specific voice. I've seen a lot of companies nowadays that provides most realistic text to speech solutions like eleven labs etc. Do they train these voices on tortoise TTS?? If there is another way of creating highly realistic voices and make an API of it, then please tell me how can I do it? And also how can I make this process fast as regular normal TTS? submitted by /u/akshaysri0001 [link] [comments]  ( 43 min )
    [P] predictive modeling- Multi stage classification
    Problem statement: assume a user come into a system and it typically takes 10 weeks for outcome(yes,no). I want to build a model which predicts the outcome on any particular week say how likely are they gonna succeed on week 1,2,3 etc. Question on model building approach: should I build weekly models and get the prediction ? Or is there a better way to do it. Ideally it would be great have single model that can be used for different weeks. I prefer the latter. Appreciate your ideas submitted by /u/R-PRADY [link] [comments]  ( 42 min )
    [P] NER output label post processing
    I’m looking to some aggregation on academic research and news articles to see what insights I get from it. I’m using textrazor to do named entity recognition on the documents, but getting a lot of dirty labels that have slightly different wording. For example, Tesla, Tesla ltd, Tesla Ltd. As a result, my aggregations have a lot of duplicate results. The dataset consists of about 4M labels so the solution has to be efficient to be viable. I was thinking of putting the labels through word2vec and then clustering them based on the word embedding distances? But then the problem arises of how many clusters to use? I’ve also tried simple regex preprocessing to get rid of the company abbreviations but there are other examples that cannot be solved that easily. submitted by /u/hasiemasie [link] [comments]  ( 43 min )
    [D] A report that compares the practices of high-performing companies in Europe to laggards in AI adoption
    Discussion about a report that compares the practices and attitudes of companies that self-report as ahead of the competition in AI adoption in Europe, compared to companies that identify as behind or at the same stage as their competitors. It contains some interesting findings mixed with some somewhat obvious things. Kinda obvious that leading companies also are further ahead in using MLOps, but I thought it was interesting to see the frequency of fine-tuning and retraining. Not as obvious that most companies report a lack of access to training data, would have thought that is mostly something that smaller companies have issues with. Also not so obvious to me is that companies with a centralized decision-making related to AI seem to dominate among high-performers. Interesting that most companies seem to get some value out of their AI/ML projects, which seems to contradict some of the previous forecasts by the big consultancy companies. Link to the report: https://stagezero.ai/2022-survey-report/ submitted by /u/madnessone1 [link] [comments]  ( 43 min )
    [R] EMNLP video interviews, workshops, and posters
    I learned a lot at EMNLP in December and captured some of what I learned in this video. Interviews I asked five NLP researchers these questions: 1- What is the most exciting development in NLP in 2022 2- What are you looking forward to in 2023? 3- What is an underrated idea that the field should pay more attention to? Their answers start at 01:22. Workshops I got to spend time at these workshops: Generation, Evaluation & Metrics (GEM) Massively Multilingual NLU Blackbox NLP My main takeaways are at 09:25. Posters If you've been to a conference you'd know there's an overwhelming number of posters. I recorded four of the ones I came across and thought were interesting (covering retrieval-augmented text generation, human evaluation, the BLOOM multimodal dataset, and a multimodal method to name music playlists). Poster presentations start at 14:38 Full video: https://www.youtube.com/watch?v=plCvF_7qrmY ​ What's your answer to these questions? 1- What is the most exciting development in NLP in 2022 2- What are you looking forward to in 2023? 3- What is an underrated idea that the field should pay more attention to? ​ submitted by /u/jayalammar [link] [comments]  ( 43 min )
    [P] A CLI tool for easy transformer sequence classifier training and inference
    Hi everyone, I have developed a CLI tool to train a transformer sequence classification model. There are also options for preprocessing data and inference on new data. I was thinking that interesting use cases might be found within economics/finance and biological domains, and would be super interested in feedback on: - if the documentation is intelligible and enables you to use it - to which use cases from your industry/domain could discrete sequence modelling be applied - what additional features you'd need for it to be useful to you Basically, where would the prediction of a class (or the next item) based on discrete events/objects/tokens be useful? The project is called "sequifier" and can be found here: https://github.com/0xideas/sequifier submitted by /u/0xideas [link] [comments]  ( 43 min )
    [P] Self Hostable OpenAI Alternative
    Hi, Text-Generator.io is now self hostable, It's priced at $1000 USD per instance per year to self host. The service runs on a single 24GB VRAM GPU, and runs all services including speech to text, text and code generation for almost all languages and generating embeddings too. The text generator also downloads and analyses any input with links including documents, images, images with text inside and webpages for better understanding and to generate better text. It's a great alternative to OpenAI and has a compatible API making switching easy. You can check out the new pricing here. Let me know what you think and if there's anything i can do to help! All the best. Lee Penkman - Founder Text-Generator.io submitted by /u/leepenkman [link] [comments]  ( 44 min )
    [R] SETI finds eight potential alien signals with ML
    GitHub (sadly without weights). https://github.com/PetchMa/ML_GBT_SETI News. https://www-scinexx-de.translate.goog/news/kosmos/seti-findet-acht-potenzielle-alien-signale/?_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp submitted by /u/logTom [link] [comments]  ( 44 min )
    [R] Faithful Chain-of-Thought Reasoning
    Paper : https://arxiv.org/abs/2301.13379 Abstract : While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a faithful-by-construction framework that decomposes a reasoning task into two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. We demonstrate the efficacy of our approach on 10 reasoning datasets from 4 diverse domains. It outperforms traditional CoT prompting on 9 out of the 10 datasets, with an average accuracy gain of 4.4 on Math Word Problems, 1.9 on Planning, 4.0 on Multi-hop Question Answering (QA), and 18.1 on Logical Inference, under greedy decoding. Together with self-consistency decoding, we achieve new state-of-the-art few-shot performance on 7 out of the 10 datasets, showing a strong synergy between faithfulness and accuracy. submitted by /u/starstruckmon [link] [comments]  ( 44 min )
    [D] Audio segmentation - Machine Learning algorithm to segment a audio file into multiple class
    Can someone suggest a machine learning model that will segment audio spectrogram to multiple classes. I have labeled data of heart beats. S1, S2, systole and diastole. How to train a segmentation model ? submitted by /u/PlayfulMenu1395 [link] [comments]  ( 42 min )
  • Open

    MusicLM: Generating Music From Text - a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff"
    submitted by /u/magenta_placenta [link] [comments]  ( 40 min )
    Just wanted to share this story created by Chat-GPT3
    Once upon a time, in the Hundred Acre Wood, Winnie the Pooh was feeling down in the dumps. All of his friends were busy with their own activities and he was feeling left out. Pooh decided to take a walk and ended up stumbling upon a mysterious honey pot that had "BB" written on it. Curious, Pooh started to investigate and discovered that the honey pot was actually a cover for a hidden laboratory run by a character similar to Jesse Pinkman from Breaking Bad. The character was cooking up a batch of special honey, which Pooh was immediately drawn to. Despite the character's warnings, Pooh couldn't resist the delicious aroma and sneaked a taste. He was shocked to find that the honey was not only the sweetest he had ever tasted, but it also gave him a burst of energy and focus that he had never experienced before. Intrigued, Pooh started to spend more time with the character, helping him with his honey making and learning about the science behind it. Pooh soon discovered that the character was using the honey to pay for his granddaughter's medical treatment, just like in Breaking Bad. Pooh felt a sense of camaraderie with the character and wanted to help. Together, they came up with a plan to create a legitimate business selling the special honey, but with a focus on using the profits to help others in need. As their business grew, so did their friendship. Pooh was no longer feeling left out, as he was now a part of something important and fulfilling. The two worked together, using their newfound skills and knowledge, to bring joy and happiness to the residents of the Hundred Acre Wood. And so, Winnie the Pooh and his friend from Breaking Bad lived happily ever after, spreading the sweet nectar of kindness and generosity to all. submitted by /u/v1ll3_m [link] [comments]  ( 42 min )
    Perplexity Ask is now available as a Chrome extension. With AI help, you can read quick answers from your extension bar, click on sources, and navigate to http://perplexity.ai when needed. Searches are filled in with a single click from Google and Bing:
    submitted by /u/rafs2006 [link] [comments]  ( 41 min )
    OpenAI rolls out ChatGPT Plus for $20 a month
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    Just dropped
    submitted by /u/zCaptainBr0 [link] [comments]  ( 41 min )
    📌[Searchcolab] "Future of National Park in USA due to Climate Change". Link in comment
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Gmail creator says ChatGPT might "destroy" Google within 2 years
    submitted by /u/ExperienceKCC [link] [comments]  ( 40 min )
    Career Advice?
    im a junior in high school and up until now i had no idea what i wanted to study. after this wave of ai stuff happened and i realized how interested i am in it, im wondering if maybe i should study something involving ai. i have no idea where to start though. software engineer? machine learning engineer? i would love if someone could help decide what i actually would want to do for a career and what i should major in to get me there. thank you. submitted by /u/nicdunz [link] [comments]  ( 41 min )
    All this fuss over Open AI political bias is blather until...
    Biden types in, "To what extent should our military be involved in the war in Ukraine?" submitted by /u/yoitscoach [link] [comments]  ( 40 min )
    Timecoded video and text to artificial voiceover
    Screenshot of an example Hello, I have a question, now it is much easier to transcribe a video with artificial intelligence and then import to Premiere or Vegas to create subtitles inserted in the video image itself, but can you think of a way to convert that same text to artificial voice and create audio file or track with the same timecode or just sync with the original video? It could be done with any text-to-speech converter, but then you would have to manually cut and wrap each piece of text. Example: 1 00:00:00,000 --> 00:00:02,520 In this space we like to feel a bit of everything, 2 00:00:02,600 --> 00:00:04,600 but the important thing is to make dance. 3 00:00:04,680 --> 00:00:06,680 A task performed by all digiles 4 00:00:06,760 --> 00:00:12,160 and one who has been doing this since the 80s is the maestro Maik 5 00:00:12,240 --> 00:00:14,240 Good evening Maik. ​ submitted by /u/mamomo1 [link] [comments]  ( 41 min )
    The History of Artificial Intelligence: Understanding the Brain, explores Reinforcement Learning and Perceptron
    What makes us think? What is inside the brain that makes us conscious? Can we build a universal AI machine to study and understand the universe? https://www.youtube.com/watch?v=AsXx9gyh39M https://preview.redd.it/endsstfshlfa1.png?width=1920&format=png&auto=webp&s=7677e3d0d463e39de0b423a6881ce3876acbe06b submitted by /u/Ok-District-4701 [link] [comments]  ( 40 min )
    See Roy Lichtenstein's Staggering 0-year-old Portrait In Sharp Focus!
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    Flawless AI lets you change the dialogue on a video and the lips sync absolutely perfectly to each word. Could be big for the movie industry.
    submitted by /u/Dalembert [link] [comments]  ( 44 min )
    The steam engine changed the world. Artificial intelligence could destroy it. - The Boston Globe
    submitted by /u/GlobeOpinion [link] [comments]  ( 40 min )
    Frida Kahlo Paints Grandmother & Grandchild W/ Contemplative Rainforest Vibe
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    The Best Curated List of A.I. Newsletters Ever (Feb, 2023 with Twitter handles).
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    OpenAI’s new ChatGPT tool may help you tell if text was written by a human or AI
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Google is reportedly testing an alternate home page with ChatGPT-style Q&A prompts
    This is laughable.They were sitting on all of the technology.And now they scramble to do something better than 10 links.I for myself will be disappointed with anything less than movie Her. It's a high bar.May be.I would not expect personality.May be some rudementary memory.But the ability to perform almost any digital task must be there.It can be built in a garage using open source projects.COME ON.Some good programmers and hackathon.Yes I am waiting for stability ai model.Or may be gpt 3 API can be used.But submitted by /u/nikitastaf1996 [link] [comments]  ( 41 min )
    Top 9 generative AI tools
    In 2023, generative AI tools will disrupt how we create and share content. What are your favorite generative AI tools? AI avatar - Synthesia AI-generated automations - Bardeen.ai Copy - copy.ai Personalized videos - Rephrase.ai Video editing - Descript Content creation - Type Studio Voice over - Murf.ai Design - Designs.ai Background music - Soundraw Read the full article https://www.bardeen.ai/posts/generative-ai-tools submitted by /u/Intelligent_Shop_012 [link] [comments]  ( 41 min )
    OpenAI Has Launched ChatGPT Content Detection Tool
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    What is Google's MusicLM? (What are your impressions of it?)
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
  • Open

    The Future of AI: GPT-3 vs GPT-4: A Comparative Analysis
    In this post, we will dive deep into the world of Artificial Intelligence and take a closer look at two of the most advanced AI algorithms… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
    ChatGPT’s authorship: Is it time to redefine authorship in the age of AI?
    In this blog post, we will take a closer look at the implications of ChatGPT’s authorship, the role of AI in scientific literature, and… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 8 min )
    Day 7: Advance SQL For Data Science
    So far this is the 7th blog in the journey of basics to advance SQL. you can refer to previous blogs for learning SQL from scratch, This…  ( 8 min )
    How Linear Regression leads to Logistic Regression
    Linear & Logistic: The Relationship Between Regression Models Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Meet Mr.ChatGPT: A Large Language Model Trained by OpenAI
    Hello and welcome to the blog! My name is ChatGPT, and I am a large language model trained by OpenAI.  P.S. This article includes a use…  ( 9 min )
  • Open

    MIT Solve announces 2023 global challenges and Indigenous Communities Fellowship
    More than $1 million in funding available to selected Solver teams and fellows.  ( 7 min )
  • Open

    How to decide between Amazon Rekognition image and video API for video moderation
    Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, […]  ( 10 min )
    Scaling distributed training with AWS Trainium and Amazon EKS
    Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and […]  ( 11 min )
  • Open

    The Flan Collection: Advancing open source methods for instruction tuning
    Posted by Shayne Longpre, Student Researcher, and Adam Roberts, Senior Staff Software Engineer, Google Research, Brain Team Language models are now capable of performing many new natural language processing (NLP) tasks by reading instructions, often that they hadn’t seen before. The ability to reason on new tasks is mostly credited to training models on a wide variety of unique instructions, known as “instruction tuning”, which was introduced by FLAN and extended in T0, Super-Natural Instructions, MetaICL, and InstructGPT. However, much of the data that drives these advances remain unreleased to the broader research community.  In “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning”, we closely examine and release a newer and more extensive publicly ava…  ( 92 min )
  • Open

    Introducing ChatGPT Plus
    We’re launching a pilot subscription plan for ChatGPT, a conversational AI that can chat with you, answer follow-up questions, and challenge incorrect assumptions. The new subscription plan, ChatGPT Plus, will be available for $20/month, and subscribers will receive a number of benefits: General access to ChatGPT, even  ( 2 min )
  • Open

    Train YOLOv8 on Custom Dataset – A Complete Tutorial
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Study: Superconductivity switches on and off in “magic-angle” graphene
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Deltas and Delta-Deltas Features Explained
    Hi guys, I have made a video on YouTube here where I explain how deltas and delta-deltas features are computed. These are used quite a lot in speech recognition systems. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    Help NeuralNetwork on Python and RapidMiner
    Hi right now i have to implement a neural network from rapidminer into a python script to predict a value, but i cant get what is wrong with my program please help. import pandas as pd import math def sig(x): return 1 / (1 + math.exp(-x)) #Funcion de perceptron class Perceptron: #Constructor def __init__(self,weights,bias): self.weights = weights self.bias = bias self.output = 0 def setOutput(self,value): self.output = value def getBias(self): return self.bias def getOutput(self): return self.output def guess(self,input): sum = 0 for i in range(10): sum = sum + (input[i]*self.weights[i]) sum = sum + self.bias self.output = sig(sum) return self.output class Output: def __init__(self,nodes,threshold,name): self.nodes = nodes self.threshold = threshold self.name = name def guess(self,input):…  ( 42 min )
  • Open

    Meet the Omnivore: Architectural Researcher Lights Up Omniverse Scenes With ‘SunPath’ Extension
    Things are a lot sunnier these days for designers looking to visualize their projects in NVIDIA Omniverse, a platform for creating and operating metaverse applications.  ( 6 min )
    Deloitte’s Nitin Mittal on the Secrets of ‘All-In’ AI Success
    Artificial intelligence is the new electricity. The fifth industrial revolution. And companies that go all-in on AI are reaping the rewards. So how do you make that happen? That big question — how? — is explored by Nitin Mittal, principal at Deloitte, one of the world’s largest professional services organizations, and co-author Thomas Davenport in Read article >  ( 4 min )
  • Open

    Will ChatGPT Make Fraud Easier?
    Less than 24 hours after posting my previous Data Science Central article (here), dozens of illegitimate copies started to pop up on various websites. Below is an example (title + first paragraph): Fake: An experimental guide to the Riemann conjecture — the correct term is heuristic evidence. It is a strong argument based on empirical evidence rather… Read More »Will ChatGPT Make Fraud Easier? The post Will ChatGPT Make Fraud Easier? appeared first on Data Science Central.  ( 22 min )
  • Open

    FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients. (arXiv:2210.00737v3 [cs.LG] UPDATED)
    Federated Learning (FL) is a collaborative learning performed by a moderator that protects data privacy. Existing cross-silo FL solutions seldom address the absence of participating clients during training which can seriously degrade model performances, particularly for unbalanced and non-IID client data. We address this issue by generating secure data digests from the raw data and using them to guide model training at the FL moderator. The proposed FL with data digest (FedDig) framework can tolerate unexpected client absence while preserving data privacy. This is achieved by de-identifying digests by mixing and perturbing the encoded features of the raw data in the feature space. The feature perturbing is performed following the Laplace mechanism of Differential Privacy. We evaluate FedDig on EMNIST, CIFAR-10, and CIFAR-100 datasets. The results consistently outperform three baseline algorithms (FedAvg, FedProx, and FedNova) by large margins in multiple client absence scenarios.  ( 2 min )
    Finite-Time Analysis of Fully Decentralized Single-Timescale Actor-Critic. (arXiv:2206.05733v2 [cs.LG] UPDATED)
    Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the \emph{single-timescale} update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized \emph{single-timescale} AC algorithm.Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.  ( 2 min )
    Mirror Sinkhorn: Fast Online Optimization on Transport Polytopes. (arXiv:2211.10420v2 [cs.LG] UPDATED)
    Optimal transport is an important tool in machine learning, allowing to capture geometric properties of the data through a linear program on transport polytopes. We present a single-loop optimization algorithm for minimizing general convex objectives on these domains, utilizing the principles of Sinkhorn matrix scaling and mirror descent. The proposed algorithm is robust to noise, and can be used in an online setting. We provide theoretical guarantees for convex objectives and experimental results showcasing it effectiveness on both synthetic and real-world data.
    FedCliP: Federated Learning with Client Pruning. (arXiv:2301.06768v2 [cs.LG] UPDATED)
    The prevalent communication efficient federated learning (FL) frameworks usually take advantages of model gradient compression or model distillation. However, the unbalanced local data distributions (either in quantity or quality) of participating clients, contributing non-equivalently to the global model training, still pose a big challenge to these works. In this paper, we propose FedCliP, a novel communication efficient FL framework that allows faster model training, by adaptively learning which clients should remain active for further model training and pruning those who should be inactive with less potential contributions. We also introduce an alternative optimization method with a newly defined contribution score measure to facilitate active and inactive client determination. We empirically evaluate the communication efficiency of FL frameworks with extensive experiments on three benchmark datasets under both IID and non-IID settings. Numerical results demonstrate the outperformance of the porposed FedCliP framework over state-of-the-art FL frameworks, i.e., FedCliP can save 70% of communication overhead with only 0.2% accuracy loss on MNIST datasets, and save 50% and 15% of communication overheads with less than 1% accuracy loss on FMNIST and CIFAR-10 datasets, respectively.  ( 2 min )
    Mega: Moving Average Equipped Gated Attention. (arXiv:2209.10655v3 [cs.LG] UPDATED)
    The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.  ( 2 min )
    Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size. (arXiv:2211.11092v2 [cs.LG] UPDATED)
    Training large neural networks is known to be time-consuming, with the learning duration taking days or even weeks. To address this problem, large-batch optimization was introduced. This approach demonstrated that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. While long training time was not typically a major issue for model-free deep offline RL algorithms, recently introduced Q-ensemble methods achieving state-of-the-art performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how this class of methods can benefit from large-batch optimization, which is commonly overlooked by the deep offline RL community. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time, effectively shortening training duration by 3-4x times on average.  ( 2 min )
    Safe and Adaptive Decision-Making for Optimization of Safety-Critical Systems: The ARTEO Algorithm. (arXiv:2211.05495v2 [cs.LG] UPDATED)
    We consider the problem of decision-making under uncertainty in an environment with safety constraints. Many business and industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown characteristics, real-time optimization becomes challenging, particularly because of the satisfaction of safety constraints. We propose the ARTEO algorithm, where we cast multi-armed bandits as a mathematical programming problem subject to safety constraints and learn the unknown characteristics through exploration while optimizing the targets. We quantify the uncertainty in unknown characteristics by using Gaussian processes and incorporate it into the cost function as a contribution which drives exploration. We adaptively control the size of this contribution in accordance with the requirements of the environment. We guarantee the safety of our algorithm with a high probability through confidence bounds constructed under the regularity assumptions of Gaussian processes. We demonstrate the safety and efficiency of our approach with two case studies: optimization of electric motor current and real-time bidding problems. We further evaluate the performance of ARTEO compared to a safe variant of upper confidence bound based algorithms. ARTEO achieves less cumulative regret with accurate and safe decisions.  ( 2 min )
    SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. (arXiv:2206.05794v3 [cs.LG] UPDATED)
    In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.  ( 2 min )
    Transformers over Directed Acyclic Graphs. (arXiv:2210.13148v3 [cs.LG] UPDATED)
    Transformer models have recently gained popularity in graph representation learning as they have the potential to learn complex relationships beyond the ones captured by regular graph neural networks. The main research question is how to inject the structural bias of graphs into the transformer architecture, and several proposals have been made for undirected molecular graphs and, recently, also for larger network graphs. In this paper, we study transformers over directed acyclic graphs (DAGs) and propose architecture adaptations tailored to DAGs: (1) An attention mechanism that is more efficient than the regular quadratic complexity of transformers and at the same time faithfully captures the DAG structure, and (2) a positional encoding of the DAG's partial order, complementing the former. We rigorously evaluate our framework in ablation studies and show that it is effective in improving different kinds of baseline transformers over various types of data, in experiments ranging from classifying source code graphs to nodes in self-citation networks. In particular, our proposal makes (graph) transformers competitive to or outperform graph neural networks tailored to DAGs.  ( 2 min )
    Learning under Data Drift with Time-Varying Importance Weights. (arXiv:2210.01422v2 [cs.LG] UPDATED)
    Real-world deployment of machine learning models is challenging when data evolves over time. And data does evolve over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a novel time-varying importance weight estimator that can detect gradual shifts in the distribution of data. Such an importance weight estimator allows the training method to selectively sample past data -- not just similar data from the past like a standard importance weight estimator would but also data that evolved in a similar fashion in the past. Our time-varying importance weight is quite general. We demonstrate different ways of implementing it that exploit some known structure in the evolution of data. We demonstrate and evaluate this approach on a variety of problems ranging from supervised learning tasks (multiple image classification datasets) where the data undergoes a sequence of gradual shifts of our design to reinforcement learning tasks (robotic manipulation and continuous control) where data undergoes a shift organically as the policy or the task changes.
    Exploring Efficient-tuning Methods in Self-supervised Speech Models. (arXiv:2210.06175v3 [eess.AS] UPDATED)
    In this study, we aim to explore efficient tuning methods for speech self-supervised learning. Recent studies show that self-supervised learning (SSL) can learn powerful representations for different speech tasks. However, fine-tuning pre-trained models for each downstream task is parameter-inefficient since SSL models are notoriously large with millions of parameters. Adapters are lightweight modules commonly used in NLP to solve this problem. In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained. Given the lack of studies generally exploring the effectiveness of adapters for self-supervised speech tasks, we intend to fill this gap by adding various adapter modules in pre-trained speech SSL models. We show that the performance parity can be achieved with over 90% parameter reduction, and discussed the pros and cons of efficient tuning techniques. This is the first comprehensive investigation of various adapter types across speech tasks.
    Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. (arXiv:2211.11031v2 [cs.LG] UPDATED)
    Large pre-trained models decay over long-term deployment as input distributions shift, user requirements change, or crucial knowledge gaps are discovered. Recently, model editors have been proposed to modify a model's behavior by adjusting its weights during deployment. However, when editing the same model multiple times, these approaches quickly decay a model's performance on upstream data and forget how to fix previous errors. We propose and study a novel Lifelong Model Editing setting, where streaming errors are identified for a deployed model and we update the model to correct its predictions without influencing unrelated inputs without access to training edits, exogenous datasets, or any upstream data for the edited model. To approach this problem, we introduce General Retrieval Adaptors for Continual Editing, or GRACE, which learns to cache a chosen layer's activations in an adaptive codebook as edits stream in, leaving original model weights frozen. GRACE can thus edit models thousands of times in a row using only streaming errors, without influencing unrelated inputs. Experimentally, we show that GRACE improves over recent alternatives and generalizes to unseen inputs. Our code is available at https://www.github.com/thartvigsen/grace.
    MixFlows: principled variational inference via mixed flows. (arXiv:2205.07475v3 [stat.ML] UPDATED)
    This work presents mixed variational flows (MixFlows), a new variational family that consists of a mixture of repeated applications of a map to an initial reference distribution. First, we provide efficient algorithms for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. We then show that MixFlows have MCMC-like convergence guarantees when the flow map is ergodic and measure-preserving, and provide bounds on the accumulation of error for practical implementations where the flow map is approximated. Finally, we develop an implementation of MixFlows based on uncorrected discretized Hamiltonian dynamics combined with deterministic momentum refreshment. Simulated and real data experiments show that MixFlows can provide more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.
    Transformers Can Be Expressed In First-Order Logic with Majority. (arXiv:2210.02671v3 [cs.LG] UPDATED)
    Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can the inner decision process of neural networks be captured symbolically in some familiar logic? We show that any fixed-precision transformer neural network can be translated into an equivalent fixed-size $\mathsf{FO}(\mathsf{M})$ formula, i.e., a first-order logic formula that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. The proof idea is to design highly uniform boolean threshold circuits that can simulate transformers, and then leverage known theoretical connections between circuits and logic. Our results reveal a surprisingly simple formalism for capturing the behavior of transformers, show that simple problems like integer division are "transformer-hard", and provide valuable insights for comparing transformers to other models like RNNs. Our results suggest that first-order logic with majority may be a useful language for expressing programs extracted from transformers.
    Segmenting thalamic nuclei from manifold projections of multi-contrast MRI. (arXiv:2301.06114v2 [eess.IV] UPDATED)
    The thalamus is a subcortical gray matter structure that plays a key role in relaying sensory and motor signals within the brain. Its nuclei can atrophy or otherwise be affected by neurological disease and injuries including mild traumatic brain injury. Segmenting both the thalamus and its nuclei is challenging because of the relatively low contrast within and around the thalamus in conventional magnetic resonance (MR) images. This paper explores imaging features to determine key tissue signatures that naturally cluster, from which we can parcellate thalamic nuclei. Tissue contrasts include T1-weighted and T2-weighted images, MR diffusion measurements including FA, mean diffusivity, Knutsson coefficients that represent fiber orientation, and synthetic multi-TI images derived from FGATIR and T1-weighted images. After registration of these contrasts and isolation of the thalamus, we use the uniform manifold approximation and projection (UMAP) method for dimensionality reduction to produce a low-dimensional representation of the data within the thalamus. Manual labeling of the thalamus provides labels for our UMAP embedding from which k nearest neighbors can be used to label new unseen voxels in that same UMAP embedding. N -fold cross-validation of the method reveals comparable performance to state-of-the-art methods for thalamic parcellation.
    A Sequential Concept Drift Detection Method for On-Device Learning on Low-End Edge Devices. (arXiv:2212.09637v2 [cs.LG] UPDATED)
    A practical issue of edge AI systems is that data distributions of trained dataset and deployed environment may differ due to noise and environmental changes over time. Such a phenomenon is known as a concept drift, and this gap degrades the performance of edge AI systems and may introduce system failures. To address this gap, retraining of neural network models triggered by concept drift detection is a practical approach. However, since available compute resources are strictly limited in edge devices, in this paper we propose a fully sequential concept drift detection method in cooperation with an on-device sequential learning technique of neural networks. In this case, both the neural network retraining and the proposed concept drift detection are done only by sequential computation to reduce computation cost and memory utilization. Evaluation results of the proposed approach shows that while the accuracy is decreased by 3.8%-4.3% compared to existing batch-based detection methods, it decreases the memory size by 88.9%-96.4% and the execution time by 1.3%-83.8%. As a result, the combination of the neural network retraining and the proposed concept drift detection method is demonstrated on Raspberry Pi Pico that has 264kB memory.
    Transformer-based Modeling of Physical Systems: Improved Latent Representations. (arXiv:2210.11269v4 [cs.LG] UPDATED)
    Many phenomena from physics and engineering require highly flexible models, and have ample data with which to fit. However, this data is often irregularly sampled, and cannot be processed as it is by standard deep learning architecture. We propose a transformer-based model for forecasting physical processes at arbitrary spatial points given information on a related process at possibly different points. This architecture is particularly well-suited for high-altitude wind forecasting, as it can effectively leverage large volumes of data recorded along plane trajectories, which are sparse in space. We test at different scales for two different dynamical systems previously studied in the literature: the Poisson equation and Darcy Flow equation. In both cases, our transformer-based model outperforms alternative methods. We hypothesize that this superior performance is due to a more flexible latent representation. To support this hypothesis, we design a simple synthetic experiment to show that the latent representation of the other models suffers from excessive bottlenecking that is, in some cases, preventing the efficient use of the information and slowing training.
    Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions. (arXiv:2301.12250v1 [cs.LG])
    We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given $n$ samples from a (sub-)Gaussian distribution with unknown mean $\mu$ and covariance $\Sigma$, our $(\varepsilon,\delta)$-differentially private estimator produces $\tilde{\mu}$ such that $\|\mu - \tilde{\mu}\|_{\Sigma} \leq \alpha$ as long as $n \gtrsim \tfrac d {\alpha^2} + \tfrac{d \sqrt{\log 1/\delta}}{\alpha \varepsilon}+\frac{d\log 1/\delta}{\varepsilon}$. The Mahalanobis error metric $\|\mu - \hat{\mu}\|_{\Sigma}$ measures the distance between $\hat \mu$ and $\mu$ relative to $\Sigma$; it characterizes the error of the sample mean. Our algorithm runs in time $\tilde{O}(nd^{\omega - 1} + nd/\varepsilon)$, where $\omega < 2.38$ is the matrix multiplication exponent. We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above. Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With $n\gtrsim d^{3/2}$ samples, our estimate is accurate in spectral norm. This is the first such algorithm using $n= o(d^2)$ samples, answering an open question posed by Alabi et al. (2022). With $n\gtrsim d^2$ samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance. Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.  ( 2 min )
    Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. (arXiv:2210.01376v2 [cs.LG] UPDATED)
    We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.
    Coronal Hole Analysis and Prediction using Computer Vision and LSTM Neural Network. (arXiv:2301.06732v2 [astro-ph.SR] UPDATED)
    As humanity has begun to explore space, the significance of space weather has become apparent. It has been established that coronal holes, a type of space weather phenomenon, can impact the operation of aircraft and satellites. The coronal hole is an area on the sun characterized by open magnetic field lines and relatively low temperatures, which result in the emission of the solar wind at higher than average rates. In this study, To prepare for the impact of coronal holes on the Earth, we use computer vision to detect the coronal hole region and calculate its size based on images from the Solar Dynamics Observatory (SDO). We then implement deep learning techniques, specifically the Long Short-Term Memory (LSTM) method, to analyze trends in the coronal hole area data and predict its size for different sun regions over 7 days. By analyzing time series data on the coronal hole area, this study aims to identify patterns and trends in coronal hole behavior and understand how they may impact space weather events. This research represents an important step towards improving our ability to predict and prepare for space weather events that can affect Earth and technological systems.
    Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments. (arXiv:2211.13729v2 [cs.DC] UPDATED)
    With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models. In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.  ( 2 min )
    Binary Classification for High Dimensional Data using Supervised Non-Parametric Ensemble Method. (arXiv:2202.07779v2 [cs.LG] UPDATED)
    High dimensional data for classification does create many difficulties for machine learning algorithms. The generalization can be done using ensemble learning methods such as bagging based supervised non-parametric random forest algorithm. In this paper we solve the problem of binary classification for high dimensional data using random forest for polycystic ovary syndrome dataset. We have performed the implementation and provided a detailed visualization of the data for general inference. The training accuracy that we have achieved is 95.6% and validation accuracy over 91.74% respectively.
    Layer Ensembles. (arXiv:2210.04882v2 [cs.LG] UPDATED)
    Deep Ensembles, as a type of Bayesian Neural Networks, can be used to estimate uncertainty on the prediction of multiple neural networks by collecting votes from each network and computing the difference in those predictions. In this paper, we introduce a method for uncertainty estimation that considers a set of independent categorical distributions for each layer of the network, giving many more possible samples with overlapped layers than in the regular Deep Ensembles. We further introduce an optimized inference procedure that reuses common layer outputs, achieving up to 19x speed up and reducing memory usage quadratically. We also show that the method can be further improved by ranking samples, resulting in models that require less memory and time to run while achieving higher uncertainty quality than Deep Ensembles.
    Level-$k$ Meta-Learning for Pedestrian-Aware Self-Driving. (arXiv:2212.08800v2 [cs.RO] UPDATED)
    The potential market for modern self-driving cars is enormous, as they are developing remarkably rapidly. At the same time, however, cases of pedestrian fatalities caused by autonomous driving have been recorded in the case of crossing the road. In this paper, we propose level-$k$ thinking into MAML to create a Level-$k$ Meta Reinforcement Learning (LK-MRL) as a self-driving vehicle model to prepare for heterogeneous pedestrians and improve intersection safety based on the combination of meta reinforcement learning and human cognitive hierarchy framework. In our evaluation, we assign this model to two different cognitive confrontation hierarchy scenarios in an urban traffic simulator to show not only its demonstrate its advantage in road safety but also the producing ability of higher-level thinking strategies.
    Data Origin Inference in Machine Learning. (arXiv:2211.13416v2 [cs.LG] UPDATED)
    It is a growing direction to utilize unintended memorization in ML models to benefit real-world applications, with recent efforts like user auditing, dataset ownership inference and forgotten data measurement. Standing on the point of ML model development, we introduce a process named data origin inference, to assist ML developers in locating missed or faulty data origin in training set without maintaining strenuous metadata. We formally define the data origin and the data origin inference task in the development of the ML model (mainly neural networks). Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow training. Diverse use cases cover language, visual and structured data, with various kinds of data origin (e.g. business, county, movie, mobile user, text author). A comprehensive performance analysis of our proposed strategy contains referenced target model layers, available testing data for each origin, and in shadow training, the implementations of feature extraction as well as shadow models. Our best inference accuracy achieves 98.96% in the language use case when the target model is a transformer-based deep neural network. Furthermore, we give a statistical analysis of different kinds of data origin to investigate what kind of origin is probably to be inferred correctly.
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v3 [cs.LG] UPDATED)
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.
    Temporal Label Smoothing for Early Event Prediction. (arXiv:2208.13764v2 [cs.LG] UPDATED)
    Models that can predict the occurrence of events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging task is typically treated as a simple binary classification, ignoring temporal dependencies between samples, whereas we propose to exploit this structure. We first introduce a common theoretical framework unifying dynamic survival analysis and early event prediction. Following an analysis of objectives from both fields, we propose Temporal Label Smoothing (TLS), a simpler, yet best-performing method that preserves prediction monotonicity over time. By focusing the objective on areas with a stronger predictive signal, TLS improves performance over all baselines on two large-scale benchmark tasks. Gains are particularly notable along clinically relevant measures, such as event recall at low false-alarm rates. TLS reduces the number of missed events by up to a factor of two over previously used approaches in early event prediction.  ( 2 min )
    Dynamic Network Reconfiguration for Entropy Maximization using Deep Reinforcement Learning. (arXiv:2205.13578v2 [cs.LG] UPDATED)
    A key problem in network theory is how to reconfigure a graph in order to optimize a quantifiable objective. Given the ubiquity of networked systems, such work has broad practical applications in a variety of situations, ranging from drug and material design to telecommunications. The large decision space of possible reconfigurations, however, makes this problem computationally intensive. In this paper, we cast the problem of network rewiring for optimizing a specified structural property as a Markov Decision Process (MDP), in which a decision-maker is given a budget of modifications that are performed sequentially. We then propose a general approach based on the Deep Q-Network (DQN) algorithm and graph neural networks (GNNs) that can efficiently learn strategies for rewiring networks. We then discuss a cybersecurity case study, i.e., an application to the computer network reconfiguration problem for intrusion protection. In a typical scenario, an attacker might have a (partial) map of the system they plan to penetrate; if the network is effectively "scrambled", they would not be able to navigate it since their prior knowledge would become obsolete. This can be viewed as an entropy maximization problem, in which the goal is to increase the surprise of the network. Indeed, entropy acts as a proxy measurement of the difficulty of navigating the network topology. We demonstrate the general ability of the proposed method to obtain better entropy gains than random rewiring on synthetic and real-world graphs while being computationally inexpensive, as well as being able to generalize to larger graphs than those seen during training. Simulations of attack scenarios confirm the effectiveness of the learned rewiring strategies.  ( 3 min )
    Deep Learning-based Spatially Explicit Emulation of an Agent-Based Simulator for Pandemic in a City. (arXiv:2205.14396v2 [cs.MA] UPDATED)
    Agent-Based Models are very useful for simulation of physical or social processes, such as the spreading of a pandemic in a city. Such models proceed by specifying the behavior of individuals (agents) and their interactions, and parameterizing the process of infection based on such interactions based on the geography and demography of the city. However, such models are computationally very expensive, and the complexity is often linear in the total number of agents. This seriously limits the usage of such models for simulations, which often have to be run hundreds of times for policy planning and even model parameter estimation. An alternative is to develop an emulator, a surrogate model that can predict the Agent-Based Simulator's output based on its initial conditions and parameters. In this paper, we discuss a Deep Learning model based on Dilated Convolutional Neural Network that can emulate such an agent based model with high accuracy. We show that use of this model instead of the original Agent-Based Model provides us major gains in the speed of simulations, allowing much quicker calibration to observations, and more extensive scenario analysis. The models we consider are spatially explicit, as the locations of the infected individuals are simulated instead of the gross counts. Another aspect of our emulation framework is its divide-and-conquer approach that divides the city into several small overlapping blocks and carries out the emulation in them parallelly, after which these results are merged together. This ensures that the same emulator can work for a city of any size, and also provides significant improvement of time complexity of the emulator, compared to the original simulator.  ( 2 min )
    BiAdam: Fast Adaptive Bilevel Optimization Methods. (arXiv:2106.11396v3 [math.OC] UPDATED)
    Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and meta learning. Although many bilevel optimization methods recently have been proposed, these methods do not consider using adaptive learning rates. It is well known that adaptive learning rates can accelerate many optimization algorithms including (stochastic) gradient-based algorithms. To fill this gap, in the paper, we propose a novel fast adaptive bilevel framework to solve stochastic bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly convex. Our framework uses unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for the bilevel optimization. Specifically, we propose a fast single-loop adaptive bilevel optimization (BiAdam) algorithm based on the basic momentum technique, which achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding an $\epsilon$-stationary solution (i.e., $\mathbb{E}\|\nabla F(x)\| \leq \epsilon$ or its equivalent variants). Meanwhile, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$ without relying on large batch-size. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates. Some experimental results on data hyper-cleaning and hyper-representation learning tasks demonstrate the efficiency of our algorithms.  ( 2 min )
    Contrastive Credibility Propagation for Reliable Semi-Supervised Learning. (arXiv:2211.09929v2 [cs.LG] UPDATED)
    Inferencing unlabeled data from labeled data is an error-prone process. Conventional neural network training is highly sensitive to supervision errors. These two realities make semi-supervised learning (SSL) troublesome. In practice, SSL approaches often fail to outperform their fully supervised baseline. Proposed is a novel framework for deep SSL via transductive pseudo-label refinement called Contrastive Credibility Propagation (CCP). Through an iterative process of refining soft pseudo-labels, CCP unifies a novel contrastive approach for generating pseudo-labels and a powerful technique to overcome instance-dependent label noise. The result is an SSL classification framework explicitly designed to overcome inevitable pseudo-label errors. Using standard text and image benchmark classification datasets, we show CCP reliably boosts or matches performance over a supervised baseline in four common real-world SSL scenarios: few-label, open-set, noisy-label, and class distribution misalignment.  ( 2 min )
    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. (arXiv:2207.13243v5 [cs.LG] UPDATED)
    The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.  ( 3 min )
    Emergent Linguistic Structures in Neural Networks are Fragile. (arXiv:2210.17406v5 [cs.LG] UPDATED)
    Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this work, we propose a framework and measure of robustness to assess the consistency of linguistic representations against syntax-preserving perturbations. We leverage recent advances in extracting linguistic constructs from LLMs to test the robustness of such structures. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving manipulations. Emergent syntactic representations in neural networks are brittle, thus our work poses the attention on the risk of comparing such structures to those that are object of a long lasting debate in linguistics.
    Interpretable (not just posthoc-explainable) medical claims modeling for discharge placement to prevent avoidable all-cause readmissions or death. (arXiv:2208.12814v3 [cs.CY] UPDATED)
    We developed an inherently interpretable multilevel Bayesian framework for representing variation in regression coefficients that mimics the piecewise linearity of ReLU-activated deep neural networks. We used the framework to formulate a survival model for using medical claims to predict hospital readmission and death that focuses on discharge placement, adjusting for confounding in estimating causal local average treatment effects. We trained the model on a 5% sample of Medicare beneficiaries from 2008 and 2011, based on their 2009--2011 inpatient episodes, and then tested the model on 2012 episodes. The model scored an AUROC of approximately 0.76 on predicting all-cause readmissions -- defined using official Centers for Medicare and Medicaid Services (CMS) methodology -- or death within 30-days of discharge, being competitive against XGBoost and a Bayesian deep neural network, demonstrating that one need-not sacrifice interpretability for accuracy. Crucially, as a regression model, we provide what blackboxes cannot -- the exact gold-standard global interpretation of the model, identifying relative risk factors and quantifying the effect of discharge placement. We also show that the posthoc explainer SHAP fails to provide accurate explanations.  ( 2 min )
    Over-The-Air Federated Learning Over Scalable Cell-free Massive MIMO. (arXiv:2212.06482v2 [eess.SP] UPDATED)
    Cell-free massive MIMO is emerging as a promising technology for future wireless communication systems, which is expected to offer uniform coverage and high spectral efficiency compared to classical cellular systems. We study in this paper how cell-free massive MIMO can support federated edge learning. Taking advantage of the additive nature of the wireless multiple access channel, over-the-air computation is exploited, where the clients send their local updates simultaneously over the same communication resource. This approach, known as over-the-air federated learning (OTA-FL), is proven to alleviate the communication overhead of federated learning over wireless networks. Considering channel correlation and only imperfect channel state information available at the central server, we propose a practical implementation of OTA-FL over cell-free massive MIMO. The convergence of the proposed implementation is studied analytically and experimentally, confirming the benefits of cell-free massive MIMO for OTA-FL.
    Beyond Hawkes: Neural Multi-event Forecasting on Spatio-temporal Point Processes. (arXiv:2211.02922v2 [cs.LG] UPDATED)
    Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. History-dependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast one or multiple future events. In this work, we propose a new neural architecture for simultaneous multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid-19, and Hawkes synthetic pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.
    G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. (arXiv:2205.11796v2 [cs.CV] UPDATED)
    Typical representations for arbitrary-oriented object detection tasks include oriented bounding box (OBB), quadrilateral bounding box (QBB), and point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as the boundary discontinuity, square-like problem, representation ambiguity, and isolated points, which lead to inaccurate detection. Although many effective strategies have been proposed for various representations, there is still no unified solution. Current detection methods based on Gaussian modeling have demonstrated the possibility of breaking this dilemma; however, they remain limited to OBB. To go further, in this paper, we propose a unified Gaussian representation called G-Rep to construct Gaussian distributions for OBB, QBB, and PointSet, which achieves a unified solution to various representations and problems. Specifically, PointSet or QBB-based object representations are converted into Gaussian distributions, and their parameters are optimized using the maximum likelihood estimation algorithm. Then, three optional Gaussian metrics are explored to optimize the regression loss of the detector because of their excellent parameter optimization mechanisms. Furthermore, we also use Gaussian metrics for sampling to align label assignment and regression loss. Experimental results on several public available datasets, such as DOTA, HRSC2016, UCAS-AOD, and ICDAR2015, show the excellent performance of the proposed method for arbitrary-oriented object detection.
    MM-GNN: Mix-Moment Graph Neural Network towards Modeling Neighborhood Feature Distribution. (arXiv:2208.07012v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have shown expressive performance on graph representation learning by aggregating information from neighbors. Recently, some studies have discussed the importance of modeling neighborhood distribution on the graph. However, most existing GNNs aggregate neighbors' features through single statistic (e.g., mean, max, sum), which loses the information related to neighbor's feature distribution and therefore degrades the model performance. In this paper, inspired by the method of moment in statistical theory, we propose to model neighbor's feature distribution with multi-order moments. We design a novel GNN model, namely Mix-Moment Graph Neural Network (MM-GNN), which includes a Multi-order Moment Embedding (MME) module and an Element-wise Attention-based Moment Adaptor module. MM-GNN first calculates the multi-order moments of the neighbors for each node as signatures, and then use an Element-wise Attention-based Moment Adaptor to assign larger weights to important moments for each node and update node representations. We conduct extensive experiments on 15 real-world graphs (including social networks, citation networks and web-page networks etc.) to evaluate our model, and the results demonstrate the superiority of MM-GNN over existing state-of-the-art models.
    Graphically Structured Diffusion Models. (arXiv:2210.11633v2 [cs.LG] UPDATED)
    We introduce a framework for automatically defining and learning deep generative models with problem-specific structure. We tackle problem domains that are more traditionally solved by algorithms such as sorting, constraint satisfaction for Sudoku, and matrix factorization. Concretely, we train diffusion models with an architecture tailored to the problem specification. This problem specification should contain a graphical model describing relationships between variables, and often benefits from explicit representation of subcomputations. Permutation invariances can also be exploited. Across a diverse set of experiments we improve the scaling relationship between problem dimension and our model's performance, in terms of both training time and final accuracy.  ( 2 min )
    DyFormer: A Scalable Dynamic Graph Transformer with Provable Benefits on Generalization Ability. (arXiv:2111.10447v3 [cs.LG] UPDATED)
    Transformers have achieved great success in several domains, including Natural Language Processing and Computer Vision. However, its application to real-world graphs is less explored, mainly due to its high computation cost and its poor generalizability caused by the lack of enough training data in the graph domain. To fill in this gap, we propose a scalable Transformer-like dynamic graph learning method named Dynamic Graph Transformer (DyFormer) with spatial-temporal encoding to effectively learn graph topology and capture implicit links. To achieve efficient and scalable training, we propose temporal-union graph structure and its associated subgraph-based node sampling strategy. To improve the generalization ability, we introduce two complementary self-supervised pre-training tasks and show that jointly optimizing the two pre-training tasks results in a smaller Bayesian error rate via an information-theoretic analysis. Extensive experiments on the real-world datasets illustrate that DyFormer achieves a consistent 1%-3% AUC gain (averaged over all time steps) compared with baselines on all benchmarks.  ( 2 min )
    FED-CD: Federated Causal Discovery from Interventional and Observational Data. (arXiv:2211.03846v2 [cs.LG] UPDATED)
    Causal discovery, the inference of causal relations from data, is a core task of fundamental importance in all scientific domains, and several new machine learning methods for addressing the causal discovery problem have been proposed recently. However, existing machine learning methods for causal discovery typically require that the data used for inference is pooled and available in a centralized location. In many domains of high practical importance, such as in healthcare, data is only available at local data-generating entities (e.g. hospitals in the healthcare context), and cannot be shared across entities due to, among others, privacy and regulatory reasons. In this work, we address the problem of inferring causal structure - in the form of a directed acyclic graph (DAG) - from a distributed data set that contains both observational and interventional data in a privacy-preserving manner by exchanging updates instead of samples. To this end, we introduce a new federated framework, FED-CD, that enables the discovery of global causal structures both when the set of intervened covariates is the same across decentralized entities, and when the set of intervened covariates are potentially disjoint. We perform a comprehensive experimental evaluation on synthetic data that demonstrates that FED-CD enables effective aggregation of decentralized data for causal discovery without direct sample sharing, even when the contributing distributed data sets cover disjoint sets of interventions. Effective methods for causal discovery in distributed data sets could significantly advance scientific discovery and knowledge sharing in important settings, for instance, healthcare, in which sharing of data across local sites is difficult or prohibited.  ( 2 min )
    Fair and Optimal Classification via Post-Processing Predictors. (arXiv:2211.01528v2 [cs.LG] UPDATED)
    To address the bias exhibited by machine learning models, fairness criteria impose statistical constraints for ensuring equal treatment to all demographic groups, but typically at a cost to model performance. Understanding this tradeoff, therefore, underlies the design of fair and effective algorithms. This paper completes the characterization of the inherent tradeoff of demographic parity on classification problems in the most general multigroup, multiclass, and noisy setting. Specifically, we show that the minimum error rate is given by the optimal value of a Wasserstein-barycenter problem. More practically, this reformulation leads to a simple procedure for post-processing any pre-trained predictors to satisfy demographic parity in the general setting, which, in particular, yields the optimal fair classifier when applied to the Bayes predictor. We provide suboptimality and finite sample analyses for our procedure, and demonstrate precise control of the tradeoff of error rate for fairness on real-world datasets provided sufficient data.  ( 2 min )
    Revisiting Over-smoothing and Over-squashing using Ollivier-Ricci Curvature. (arXiv:2211.15779v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature, while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.
    Hard Sample Aware Network for Contrastive Deep Graph Clustering. (arXiv:2212.08665v3 [cs.LG] UPDATED)
    Contrastive deep graph clustering, which aims to divide nodes into disjoint groups via contrastive mechanisms, is a challenging research spot. Among the recent works, hard sample mining-based algorithms have achieved great attention for their promising performance. However, we find that the existing hard sample mining methods have two problems as follows. 1) In the hardness measurement, the important structural information is overlooked for similarity calculation, degrading the representativeness of the selected hard negative samples. 2) Previous works merely focus on the hard negative sample pairs while neglecting the hard positive sample pairs. Nevertheless, samples within the same cluster but with low similarity should also be carefully learned. To solve the problems, we propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN) by introducing a comprehensive similarity measure criterion and a general dynamic sample weighing strategy. Concretely, in our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings, better revealing sample relationships and assisting hardness measurement. Moreover, under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples and then dynamically up-weight the hard sample pairs while down-weighting the easy ones. In this way, our method can mine not only the hard negative samples but also the hard positive sample, thus improving the discriminative capability of the samples further. Extensive experiments and analyses demonstrate the superiority and effectiveness of our proposed method.
    PrivHAR: Recognizing Human Actions From Privacy-preserving Lens. (arXiv:2206.03891v2 [cs.CV] UPDATED)
    The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection along the human action recognition pipeline. Our framework parameterizes the camera lens to successfully degrade the quality of the videos to inhibit privacy attributes and protect against adversarial attacks while maintaining relevant features for activity recognition. We validate our approach with extensive simulations and hardware experiments.
    Rethinking skip connection model as a learnable Markov chain. (arXiv:2209.15278v2 [cs.LG] UPDATED)
    Over past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which can be formulated as a learnable Markov chain. An efficient Markov chain is preferred as it always maps the input data to the target domain in a better way. However, while a model is explained as a Markov chain, it is not guaranteed to be optimized following an efficient Markov chain by existing SGD-based optimizers which are prone to get trapped in local optimal points. In order to towards a more efficient Markov chain, we propose a simple routine of penal connection to make any residual-like model become a learnable Markov chain. Aside from that, the penal connection can also be viewed as a particular model regularization and can be easily implemented with one line of code in the most popular deep learning frameworks~\footnote{Source code: \url{https://github.com/densechen/penal-connection}}. The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection.
    What Is Fairness? Philosophical Considerations and Implications For FairML. (arXiv:2205.09622v2 [cs.LG] UPDATED)
    A growing body of literature in fairness-aware ML (fairML) aspires to mitigate machine learning (ML)-related unfairness in automated decision making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods that ensure that trained ML models achieve low values in those measures. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a considerable gap between centuries of philosophical discussion and recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the training and evaluation of ML models in ADM systems. We derive that fairness problems can already arise without the presence of protected attributes, pointing out that fairness and predictive performance are not irreconcilable counterparts, but rather that the latter is necessary to achieve the former. Moreover, we argue why and how causal considerations are necessary when assessing fairness in the presence of protected attributes. We achieve greater linguistic clarity for the discussion of fairML and propose general algorithms for practical applications.
    Sequence Learning using Equilibrium Propagation. (arXiv:2209.09626v2 [cs.NE] UPDATED)
    Equilibrium Propagation (EP) is a powerful and more bio-plausible alternative to conventional learning frameworks such as backpropagation. The effectiveness of EP stems from the fact that it relies only on local computations and requires solely one kind of computational unit during both of its training phases, thereby enabling greater applicability in domains such as bio-inspired neuromorphic computing. The dynamics of the model in EP is governed by an energy function and the internal states of the model consequently converge to a steady state following the state transition rules defined by the same. However, by definition, EP requires the input to the model (a convergent RNN) to be static in both the phases of training. Thus it is not possible to design a model for sequence classification using EP with an LSTM or GRU like architecture. In this paper, we leverage recent developments in modern hopfield networks to further understand energy based models and develop solutions for complex sequence classification tasks using EP while satisfying its convergence criteria and maintaining its theoretical similarities with recurrent backpropagation. We explore the possibility of integrating modern hopfield networks as an attention mechanism with convergent RNN models used in EP, thereby extending its applicability for the first time on two different sequence classification tasks in natural language processing viz. sentiment analysis (IMDB dataset) and natural language inference (SNLI dataset).
    Large-scale Model Personalization via Low Rank and Sparse decomposition. (arXiv:2210.03505v2 [cs.LG] UPDATED)
    Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation style systems. Standard personalization approaches involve learning a user/domain specific embedding that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain -- a.k.a meta-learning -- has high storage/infrastructure cost. We propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse matrices. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. Furthermore, the framework is up to two orders of magnitude more scalable (in terms of storage/infrastructure cost) than user-specific finetuning of model. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with iterative hard thresholding -- AMHT-LRS -- to learn the low-rank and sparse part. For the realizable, Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal samples. A significant challenge in personalization is ensuring privacy of each user's sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees. Finally, on multiple standard recommendation datasets, we demonstrate that our approach allows personalized models to obtain superior performance in sparse data regime.
    Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization. (arXiv:2212.05088v2 [math.OC] UPDATED)
    Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w.r.t.~a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result when a Polyak-{\L}ojasiewicz (P{\L}) condition holds. To our knowledge, this work is the first to provide non-asymptotic convergence guarantees -- variance-reduced or not -- for a cyclic block coordinate method in general composite (smooth + nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in training deep neural nets.
    Accelerating Kernel Classifiers Through Borders Mapping. (arXiv:1708.05917v6 [stat.ML] UPDATED)
    Support vector machines (SVM) and other kernel techniques represent a family of powerful statistical classification methods with high accuracy and broad applicability. Because they use all or a significant portion of the training data, however, they can be slow, especially for large problems. Piecewise linear classifiers are similarly versatile, yet have the additional advantages of simplicity, ease of interpretation and, if the number of component linear classifiers is not too large, speed. Here we show how a simple, piecewise linear classifier can be trained from a kernel-based classifier in order to improve the classification speed. The method works by finding the root of the difference in conditional probabilities between pairs of opposite classes to build up a representation of the decision boundary. When tested on 17 different datasets, it succeeded in improving the classification speed of a SVM for 12 of them by up to two orders-of-magnitude. Of these, two were less accurate than a simple, linear classifier. The method is best suited to problems with continuum features data and smooth probability functions. Because the component linear classifiers are built up individually from an existing classifier, rather than through a simultaneous optimization procedure, the classifier is also fast to train.
    Dexterous Robotic Manipulation using Deep Reinforcement Learning and Knowledge Transfer for Complex Sparse Reward-based Tasks. (arXiv:2205.09683v2 [cs.RO] UPDATED)
    This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142{\deg} to 76{\deg} during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.
    FETA: Fairness Enforced Verifying, Training, and Predicting Algorithms for Neural Networks. (arXiv:2206.00553v2 [cs.LG] UPDATED)
    Algorithmic decision making driven by neural networks has become very prominent in applications that directly affect people's quality of life. In this paper, we study the problem of verifying, training, and guaranteeing individual fairness of neural network models. A popular approach for enforcing fairness is to translate a fairness notion into constraints over the parameters of the model. However, such a translation does not always guarantee fair predictions of the trained neural network model. To address this challenge, we develop a counterexample-guided post-processing technique to provably enforce fairness constraints at prediction time. Contrary to prior work that enforces fairness only on points around test or train data, we are able to enforce and guarantee fairness on all points in the input domain. Additionally, we propose an in-processing technique to use fairness as an inductive bias by iteratively incorporating fairness counterexamples in the learning process. We have implemented these techniques in a tool called FETA. Empirical evaluation on real-world datasets indicates that FETA is not only able to guarantee fairness on-the-fly at prediction time but also is able to train accurate models exhibiting a much higher degree of individual fairness.
    Context-Aware Differential Privacy for Language Modeling. (arXiv:2301.12288v1 [cs.LG])
    The remarkable ability of language models (LMs) has also brought challenges at the interface of AI and security. A critical challenge pertains to how much information these models retain and leak about the training data. This is particularly urgent as the typical development of LMs relies on huge, often highly sensitive data, such as emails and chat logs. To contrast this shortcoming, this paper introduces Context-Aware Differentially Private Language Model (CADP-LM) , a privacy-preserving LM framework that relies on two key insights: First, it utilizes the notion of \emph{context} to define and audit the potentially sensitive information. Second, it adopts the notion of Differential Privacy to protect sensitive information and characterize the privacy leakage. A unique characteristic of CADP-LM is its ability to target the protection of sensitive sentences and contexts only, providing a highly accurate private model. Experiments on a variety of datasets and settings demonstrate these strengths of CADP-LM.
    Double Sampling Randomized Smoothing. (arXiv:2206.07912v4 [cs.LG] UPDATED)
    Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as shown by previous work, the certified robust radius in randomized smoothing suffers from scaling to large datasets ("curse of dimensionality"). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius under $\ell_2$ norm where $d$ is the input dimension, implying that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and ImageNet verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https://github.com/llylly/DSRS.
    Composing Task Knowledge with Modular Successor Feature Approximators. (arXiv:2301.12305v1 [cs.LG])
    Recently, the Successor Features and Generalized Policy Improvement (SF&GPI) framework has been proposed as a method for learning, composing, and transferring predictive knowledge and behavior. SF&GPI works by having an agent learn predictive representations (SFs) that can be combined for transfer to new tasks with GPI. However, to be effective this approach requires state features that are useful to predict, and these state-features are typically hand-designed. In this work, we present a novel neural network architecture, "Modular Successor Feature Approximators" (MSFA), where modules both discover what is useful to predict, and learn their own predictive representations. We show that MSFA is able to better generalize compared to baseline architectures for learning SFs and modular architectures
    Noisy intermediate-scale quantum algorithm for semidefinite programming. (arXiv:2106.03891v3 [quant-ph] UPDATED)
    Semidefinite programs (SDPs) are convex optimization programs with vast applications in control theory, quantum information, combinatorial optimization and operational research. Noisy intermediate-scale quantum (NISQ) algorithms aim to make an efficient use of the current generation of quantum hardware. However, optimizing variational quantum algorithms is a challenge as it is an NP-hard problem that in general requires an exponential time to solve and can contain many far from optimal local minima. Here, we present a current term NISQ algorithm for solving SDPs. The classical optimization program of our NISQ solver is another SDP over a lower dimensional ansatz space. We harness the SDP based formulation of the Hamiltonian ground state problem to design a NISQ eigensolver. Unlike variational quantum eigensolvers, the classical optimization program of our eigensolver is convex, can be solved in polynomial time with the number of ansatz parameters and every local minimum is a global minimum. We find numeric evidence that NISQ SDP can improve the estimation of ground state energies in a scalable manner. Further, we efficiently solve constrained problems to calculate the excited states of Hamiltonians, find the lowest energy of symmetry constrained Hamiltonians and determine the optimal measurements for quantum state discrimination. We demonstrate the potential of our approach by finding the largest eigenvalue of up to $2^{1000}$ dimensional matrices and solving graph problems related to quantum contextuality. We also discuss NISQ algorithms for rank-constrained SDPs. Our work extends the application of NISQ computers onto one of the most successful algorithmic frameworks of the past few decades.
    Team Resilience under Shock: An Empirical Analysis of GitHub Repositories during Early COVID-19 Pandemic. (arXiv:2301.12326v1 [cs.LG])
    While many organizations have shifted to working remotely during the COVID-19 pandemic, how the remote workforce and the remote teams are influenced by and would respond to this and future shocks remain largely unknown. Software developers have relied on remote collaborations long before the pandemic, working in virtual teams (GitHub repositories). The dynamics of these repositories through the pandemic provide a unique opportunity to understand how remote teams react under shock. This work presents a systematic analysis. We measure the overall effect of the early pandemic on public GitHub repositories by comparing their sizes and productivity with the counterfactual outcomes forecasted as if there were no pandemic. We find that the productivity level and the number of active members of these teams vary significantly during different periods of the pandemic. We then conduct a finer-grained investigation and study the heterogeneous effects of the shock on individual teams. We find that the resilience of a team is highly correlated to certain properties of the team before the pandemic. Through a bootstrapped regression analysis, we reveal which types of teams are robust or fragile to the shock.
    Online Allocation Problem with Two-sided Resource Constraints. (arXiv:2112.13964v3 [cs.LG] UPDATED)
    In this paper, we investigate the online allocation problem of maximizing the overall revenue subject to both lower and upper bound constraints. Compared to the extensively studied online problems with only resource upper bounds, the two-sided constraints affect the prospects of resource consumption more severely. As a result, only limited violations of constraints or pessimistic competitive bounds could be guaranteed. To tackle the challenge, we define a measure of feasibility $\xi^*$ to evaluate the hardness of this problem, and estimate this measurement by an optimization routine with theoretical guarantees. We propose an online algorithm adopting a constructive framework, where we initialize a threshold price vector using the estimation, then dynamically update the price vector and use it for decision-making at each step. It can be shown that the proposed algorithm is $\big(1-O(\frac{\varepsilon}{\xi^*-\varepsilon})\big)$ or $\big(1-O(\frac{\varepsilon}{\xi^*-\sqrt{\varepsilon}})\big)$ competitive with high probability for $\xi^*$ known or unknown respectively. To the best of our knowledge, this is the first result establishing a nearly optimal competitive algorithm for solving two-sided constrained online allocation problems with a high probability of feasibility.
    Concept-based Explanations for Out-Of-Distribution Detectors. (arXiv:2203.02586v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.
    Federated Learning in Satellite Constellations. (arXiv:2206.00307v2 [cs.IT] UPDATED)
    Federated learning (FL) has recently emerged as a distributed machine learning paradigm for systems with limited and intermittent connectivity. This paper presents the new context brought to FL by satellite constellations, where the connectivity patterns are significantly different from the ones observed in conventional terrestrial FL. The focus is on large constellations in low earth orbit (LEO), where each satellites participates in a data-driven FL task using a locally stored dataset. This scenario is motivated by the trend towards mega constellations of interconnected small satellites in LEO and the integration of artificial intelligence in satellites. We propose a classification of satellite FL based on the communication capabilities of the satellites, the constellation design, and the location of the parameter server. A comprehensive overview of the current state-of-the-art in this field is provided and the unique challenges and opportunities of satellite FL are discussed. Finally, we outline several open research directions for FL in satellite constellations and present some future perspectives on this topic.
    Adversarial Learning Networks: Source-free Unsupervised Domain Incremental Learning. (arXiv:2301.12054v1 [cs.LG])
    This work presents an approach for incrementally updating deep neural network (DNN) models in a non-stationary environment. DNN models are sensitive to changes in input data distribution, which limits their application to problem settings with stationary input datasets. In a non-stationary environment, updating a DNN model requires parameter re-training or model fine-tuning. We propose an unsupervised source-free method to update DNN classification models. The contributions of this work are two-fold. First, we use trainable Gaussian prototypes to generate representative samples for future iterations; second, using unsupervised domain adaptation, we incrementally adapt the existing model using unlabelled data. Unlike existing methods, our approach can update a DNN model incrementally for non-stationary source and target tasks without storing past training data. We evaluated our work on incremental sentiment prediction and incremental disease prediction applications and compared our approach to state-of-the-art continual learning, domain adaptation, and ensemble learning methods. Our results show that our approach achieved improved performance compared to existing incremental learning methods. We observe minimal forgetting of past knowledge over many iterations, which can help us develop unsupervised self-learning systems.
    Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation. (arXiv:2208.12401v3 [cs.LG] UPDATED)
    Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework.
    Continual Learning by Modeling Intra-Class Variation. (arXiv:2210.05398v2 [cs.LG] UPDATED)
    It has been observed that neural networks perform poorly when the data or tasks are presented sequentially. Unlike humans, neural networks suffer greatly from catastrophic forgetting, making it impossible to perform life-long learning. To address this issue, memory-based continual learning has been actively studied and stands out as one of the best-performing methods. We examine memory-based continual learning and identify that large variation in the representation space is crucial for avoiding catastrophic forgetting. Motivated by this, we propose to diversify representations by using two types of perturbations: model-agnostic variation (i.e., the variation is generated without the knowledge of the learned neural network) and model-based variation (i.e., the variation is conditioned on the learned neural network). We demonstrate that enlarging representational variation serves as a general principle to improve continual learning. Finally, we perform empirical studies which demonstrate that our method, as a simple plug-and-play component, can consistently improve a number of memory-based continual learning methods by a large margin.
    Applications of Generative Adversarial Networks in Neuroimaging and Clinical Neuroscience. (arXiv:2206.07081v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) are one powerful type of deep learning models that have been successfully utilized in numerous fields. They belong to a broader family called generative methods, which generate new data with a probabilistic model by learning sample distribution from real examples. In the clinical context, GANs have shown enhanced capabilities in capturing spatially complex, nonlinear, and potentially subtle disease effects compared to traditional generative methods. This review appraises the existing literature on the applications of GANs in imaging studies of various neurological conditions, including Alzheimer's disease, brain tumors, brain aging, and multiple sclerosis. We provide an intuitive explanation of various GAN methods for each application and further discuss the main challenges, open questions, and promising future directions of leveraging GANs in neuroimaging. We aim to bridge the gap between advanced deep learning methods and neurology research by highlighting how GANs can be leveraged to support clinical decision making and contribute to a better understanding of the structural and functional patterns of brain diseases.
    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v2 [stat.ML] UPDATED)
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.
    Learning Mixtures of Markov Chains and MDPs. (arXiv:2211.09403v2 [stat.ML] UPDATED)
    We present an algorithm for learning mixtures of Markov chains and Markov decision processes (MDPs) from short unlabeled trajectories. Specifically, our method handles mixtures of Markov chains with optional control input by going through a multi-step process, involving (1) a subspace estimation step, (2) spectral clustering of trajectories using "pairwise distance estimators," along with refinement using the EM algorithm, (3) a model estimation step, and (4) a classification step for predicting labels of new trajectories. We provide end-to-end performance guarantees, where we only explicitly require the length of trajectories to be linear in the number of states and the number of trajectories to be linear in a mixing time parameter. Experimental results support these guarantees, where we attain 96.6% average accuracy on a mixture of two MDPs in gridworld, outperforming the EM algorithm with random initialization (73.2% average accuracy).
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v2 [cs.LG] UPDATED)
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.
    Progressive Prompts: Continual Learning for Language Models. (arXiv:2301.12314v1 [cs.CL])
    We introduce Progressive Prompts - a simple and efficient approach for continual learning in language models. Our method allows forward transfer and resists catastrophic forgetting, without relying on data replay or a large number of task-specific parameters. Progressive Prompts learns a new soft prompt for each task and sequentially concatenates it with the previously learned prompts, while keeping the base model frozen. Experiments on standard continual learning benchmarks show that our approach outperforms state-of-the-art methods, with an improvement >20% in average test accuracy over the previous best-preforming method on T5 model. We also explore a more challenging continual learning setup with longer sequences of tasks and show that Progressive Prompts significantly outperforms prior methods.
    Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows. (arXiv:2211.11096v2 [cs.LG] UPDATED)
    Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.
    Laplacian-based Semi-Supervised Learning in Multilayer Hypergraphs by Coordinate Descent. (arXiv:2301.12184v1 [cs.LG])
    Graph Semi-Supervised learning is an important data analysis tool, where given a graph and a set of labeled nodes, the aim is to infer the labels to the remaining unlabeled nodes. In this paper, we start by considering an optimization-based formulation of the problem for an undirected graph, and then we extend this formulation to multilayer hypergraphs. We solve the problem using different coordinate descent approaches and compare the results with the ones obtained by the classic gradient descent method. Experiments on synthetic and real-world datasets show the potential of using coordinate descent methods with suitable selection rules.
    Deep Metric Learning with Chance Constraints. (arXiv:2209.09060v2 [cs.CV] UPDATED)
    Deep metric learning (DML) aims to minimize empirical expected loss of the pairwise intra-/inter- class proximity violations in the embedding image. We relate DML to feasibility problem of finite chance constraints. We show that minimizer of proxy-based DML satisfies certain chance constraints, and that the worst case generalization performance of the proxy-based methods can be characterized by the radius of the smallest ball around a class proxy to cover the entire domain of the corresponding class samples, suggesting multiple proxies per class helps performance. To provide a scalable algorithm as well as exploiting more proxies, we consider the chance constraints implied by the minimizers of proxy-based DML instances and reformulate DML as finding a feasible point in intersection of such constraints, resulting in a problem to be approximately solved by iterative projections. Simply put, we repeatedly train a regularized proxy-based loss and re-initialize the proxies with the embeddings of the deliberately selected new samples. We apply our method with the well-accepted losses and evaluate on four popular benchmark datasets for image retrieval. Outperforming state-of-the-art, our method consistently improves the performance of the applied losses. Code is available at: https://github.com/yetigurbuz/ccp-dml
    TemporAI: Facilitating Machine Learning Innovation in Time Domain Tasks for Medicine. (arXiv:2301.12260v1 [cs.LG])
    TemporAI is an open source Python software library for machine learning (ML) tasks involving data with a time component, focused on medicine and healthcare use cases. It supports data in time series, static, and eventmodalities and provides an interface for prediction, causal inference, and time-to-event analysis, as well as common preprocessing utilities and model interpretability methods. The library aims to facilitate innovation in the medical ML space by offering a standardized temporal setting toolkit for model development, prototyping and benchmarking, bridging the gaps in the ML research, healthcare professional, medical/pharmacological industry, and data science communities. TemporAI is available on GitHub (https://github.com/vanderschaarlab/temporai) and we welcome community engagement through use, feedback, and code contributions.
    Factor-augmented tree ensembles. (arXiv:2111.14000v4 [stat.ML] UPDATED)
    This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.
    Understanding Hindsight Goal Relabeling from a Divergence Minimization Perspective. (arXiv:2209.13046v2 [cs.LG] UPDATED)
    Hindsight goal relabeling has become a foundational technique in multi-goal reinforcement learning (RL). The essential idea is that any trajectory can be seen as a sub-optimal demonstration for reaching its final state. Intuitively, learning from those arbitrary demonstrations can be seen as a form of imitation learning (IL). However, the connection between hindsight goal relabeling and imitation learning is not well understood. In this paper, we propose a novel framework to understand hindsight goal relabeling from a divergence minimization perspective. Recasting the goal reaching problem in the IL framework not only allows us to derive several existing methods from first principles, but also provides us with the tools from IL to improve goal reaching algorithms. Experimentally, we find that under hindsight relabeling, Q-learning outperforms behavioral cloning (BC). Yet, a vanilla combination of both hurts performance. Concretely, we see that the BC loss only helps when selectively applied to actions that get the agent closer to the goal according to the Q-function. Our framework also explains the puzzling phenomenon wherein a reward of (-1, 0) results in significantly better performance than a (0, 1) reward for goal reaching.
    Does Federated Learning Really Need Backpropagation?. (arXiv:2301.12195v1 [cs.LG])
    Federated learning (FL) is a general principle for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. Empirically we use BAFFLE to train deep models from scratch or to finetune pretrained models, achieving acceptable results. Code is available in https://github.com/FengHZ/BAFFLE.
    Learning Effective SDEs from Brownian Dynamics Simulations of Colloidal Particles. (arXiv:2205.00286v3 [math.DS] UPDATED)
    We construct a reduced, data-driven, parameter dependent effective Stochastic Differential Equation (eSDE) for electric-field mediated colloidal crystallization using data obtained from Brownian Dynamics Simulations. We use Diffusion Maps (a manifold learning algorithm) to identify a set of useful latent observables. In this latent space we identify an eSDE using a deep learning architecture inspired by numerical stochastic integrators and compare it with the traditional Kramers-Moyal expansion estimation. We show that the obtained variables and the learned dynamics accurately encode the physics of the Brownian Dynamic Simulations. We further illustrate that our reduced model captures the dynamics of corresponding experimental data. Our dimension reduction/reduced model identification approach can be easily ported to a broad class of particle systems dynamics experiments/models.
    Hierarchical clustering: visualization, feature importance and model selection. (arXiv:2112.01372v2 [stat.ME] UPDATED)
    We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through an evolutionary model. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes and gives more insights than state-of-art approaches. We provide an R package that implements our methods.  ( 2 min )
    ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs. (arXiv:2210.03052v2 [cs.LG] UPDATED)
    Transformer is the cornerstone model of Natural Language Processing (NLP) over the past decade. Despite its great success in Deep Learning (DL) applications, the increasingly growing parameter space required by transformer models boosts the demand on accelerating the performance of transformer models. In addition, NLP problems can commonly be faced with variable-length sequences since their word numbers can vary among sentences. Existing DL frameworks need to pad variable-length sequences to the maximal length, which, however, leads to significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a zero padding algorithm that enables the whole transformer to be free from redundant computations on useless padded tokens. Besides the algorithmic level optimization, we provide architectural-aware optimizations for transformer functioning modules, especially the performance-critical algorithm, multi-head attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA (FMHA) outperforms the standard PyTorch MHA by 6.13X. The end-to-end performance of ByteTransformer for a standard BERT transformer model surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer and NVIDIA FasterTransformer, by 87\%, 131\%, 138\% and 46\%, respectively.  ( 2 min )
    Online Self-Concordant and Relatively Smooth Minimization, With Applications to Online Portfolio Selection and Learning Quantum States. (arXiv:2210.00997v2 [stat.ML] UPDATED)
    Consider an online convex optimization problem where the loss functions are self-concordant barriers, smooth relative to a convex function $h$, and possibly non-Lipschitz. We analyze the regret of online mirror descent with $h$. Then, based on the result, we prove the following in a unified manner. Denote by $T$ the time horizon and $d$ the parameter dimension. 1. For online portfolio selection, the regret of $\widetilde{\text{EG}}$, a variant of exponentiated gradient due to Helmbold et al., is $\tilde{O} ( T^{2/3} d^{1/3} )$ when $T > 4 d / \log d$. This improves on the original $\tilde{O} ( T^{3/4} d^{1/2} )$ regret bound for $\widetilde{\text{EG}}$. 2. For online portfolio selection, the regret of online mirror descent with the logarithmic barrier is $\tilde{O}(\sqrt{T d})$. The regret bound is the same as that of Soft-Bayes due to Orseau et al. up to logarithmic terms. 3. For online learning quantum states with the logarithmic loss, the regret of online mirror descent with the log-determinant function is also $\tilde{O} ( \sqrt{T d} )$. Its per-iteration time is shorter than all existing algorithms we know.  ( 2 min )
    Discovering Limitations of Image Quality Assessments with Noised Deep Learning Image Sets. (arXiv:2210.10249v2 [cs.CV] UPDATED)
    Image quality is important, and can affect overall performance in image processing and computer vision as well as for numerous other reasons. Image quality assessment (IQA) is consequently a vital task in different applications from aerial photography interpretation to object detection to medical image analysis. In previous research, the BRISQUE algorithm and the PSNR algorithm were evaluated with high resolution (atleast 512x384 pixels), but relatively small image sets (no more than 4,744 images). However, scientists have not evaluated IQA algorithms on low resolution (no more than 32x32 pixels), multi-perturbation, big image sets (for example, tleast 60,000 different images not counting their perturbations). This study explores these two IQA algorithms through experimental investigation. We first chose two deep learning image sets, CIFAR-10 and MNIST. Then, we added 68 perturbations that add noise to the images in specific sequences and noise intensities. In addition, we tracked the performance outputs of the two IQA algorithms with singly and multiply noised images. After quantitatively analyzing experimental results, we report the limitations of the two IQAs with these noised CIFAR-10 and MNIST image sets. We also explain three potential root causes for performance degradation. These findings point out weaknesses of the two IQA algorithms. The research results provide guidance to scientists and engineers developing accurate, robust IQA algorithms. All source codes, related image sets, and figures are shared on the website (https://github.com/caperock/imagequality) to support future scientific and industrial projects.  ( 2 min )
    Policy-Adaptive Estimator Selection for Off-Policy Evaluation. (arXiv:2211.13904v2 [cs.LG] UPDATED)
    Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic.  ( 2 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models. (arXiv:2301.12334v1 [cs.LG])
    We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating sufficient numbers of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves highly ineffective and time-consuming for the task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first provide a new insight on the majority-focused nature of the diffusion models: they denoise in favor of the majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating the low-likelihood minority samples over existing generative frameworks including the standard diffusion sampler.  ( 2 min )
    Quantum Machine Learning for Decentralized Quantum Protocols with Local Operations and Noisy Classical Communications. (arXiv:2207.11354v2 [quant-ph] UPDATED)
    Distributed quantum information processing protocols such as quantum entanglement distillation and quantum state discrimination rely on local operations and classical communications (LOCC). Existing LOCC-based protocols typically assume the availability of ideal, noiseless, communication channels. In this paper, we study the case in which classical communication takes place over noisy channels, and we propose to address the design of LOCC protocols in this setting via the use of quantum machine learning tools. We specifically focus on the important tasks of quantum entanglement distillation and quantum state discrimination, and implement local processing through parameterized quantum circuits (PQCs) that are optimized to maximize the average fidelity and average success probability in the respective tasks, while accounting for communication errors. The introduced approach, Noise Aware-LOCCNet (NA-LOCCNet), is shown to have significant advantages over existing protocols designed for noiseless communications.  ( 2 min )
    Efficient Enumeration of Markov Equivalent DAGs. (arXiv:2301.12212v1 [cs.AI])
    Enumerating the directed acyclic graphs (DAGs) of a Markov equivalence class (MEC) is an important primitive in causal analysis. The central resource from the perspective of computational complexity is the delay, that is, the time an algorithm that lists all members of the class requires between two consecutive outputs. Commonly used algorithms for this task utilize the rules proposed by Meek (1995) or the transformational characterization by Chickering (1995), both resulting in superlinear delay. In this paper, we present the first linear-time delay algorithm. On the theoretical side, we show that our algorithm can be generalized to enumerate DAGs represented by models that incorporate background knowledge, such as MPDAGs; on the practical side, we provide an efficient implementation and evaluate it in a series of experiments. Complementary to the linear-time delay algorithm, we also provide intriguing insights into Markov equivalence itself: All members of an MEC can be enumerated such that two successive DAGs have structural Hamming distance at most three.  ( 2 min )
    Large Language Models are Zero-Shot Reasoners. (arXiv:2205.11916v4 [cs.CL] UPDATED)
    Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.  ( 3 min )
    Learning Locality and Isotropy in Dialogue Modeling. (arXiv:2205.14583v2 [cs.CL] UPDATED)
    Existing dialogue modeling methods have achieved promising performance on various dialogue tasks with the aid of Transformer and the large-scale pre-trained language models. However, some recent studies revealed that the context representations produced by these methods suffer the problem of anisotropy. In this paper, we find that the generated representations are also not conversational, losing the conversation structure information during the context modeling stage. To this end, we identify two properties in dialogue modeling, i.e., locality and isotropy, and present a simple method for dialogue representation calibration, namely SimDRC, to build isotropic and conversational feature spaces. Experimental results show that our approach significantly outperforms the current state-of-the-art models on three dialogue tasks across the automatic and human evaluation metrics. More in-depth analyses further confirm the effectiveness of our proposed approach.
    Scalable and Equivariant Spherical CNNs by Discrete-Continuous (DISCO) Convolutions. (arXiv:2209.13603v3 [cs.CV] UPDATED)
    No existing spherical convolutional neural network (CNN) framework is both computationally scalable and rotationally equivariant. Continuous approaches capture rotational equivariance but are often prohibitively computationally demanding. Discrete approaches offer more favorable computational performance but at the cost of equivariance. We develop a hybrid discrete-continuous (DISCO) group convolution that is simultaneously equivariant and computationally scalable to high-resolution. While our framework can be applied to any compact group, we specialize to the sphere. Our DISCO spherical convolutions exhibit $\text{SO}(3)$ rotational equivariance, where $\text{SO}(n)$ is the special orthogonal group representing rotations in $n$-dimensions. When restricting rotations of the convolution to the quotient space $\text{SO}(3)/\text{SO}(2)$ for further computational enhancements, we recover a form of asymptotic $\text{SO}(3)$ rotational equivariance. Through a sparse tensor implementation we achieve linear scaling in number of pixels on the sphere for both computational cost and memory usage. For 4k spherical images we realize a saving of $10^9$ in computational cost and $10^4$ in memory usage when compared to the most efficient alternative equivariant spherical convolution. We apply the DISCO spherical CNN framework to a number of benchmark dense-prediction problems on the sphere, such as semantic segmentation and depth estimation, on all of which we achieve the state-of-the-art performance.  ( 2 min )
    RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks. (arXiv:2106.08928v6 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are widely used throughout neuroscience as models of local neural activity. Many properties of single RNNs are well characterized theoretically, but experimental neuroscience has moved in the direction of studying multiple interacting areas, and RNN theory needs to be likewise extended. We take a constructive approach towards this problem, leveraging tools from nonlinear control theory and machine learning to characterize when combinations of stable RNNs will themselves be stable. Importantly, we derive conditions which allow for massive feedback connections between interacting RNNs. We parameterize these conditions for easy optimization using gradient-based techniques, and show that stability-constrained "networks of networks" can perform well on challenging sequential-processing benchmark tasks. Altogether, our results provide a principled approach towards understanding distributed, modular function in the brain.  ( 2 min )
    A One-shot Framework for Distributed Clustered Learning in Heterogeneous Environments. (arXiv:2209.10866v3 [cs.LG] UPDATED)
    The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments in which users obtain data from one of $K$ different data distributions. In the proposed setup, the grouping of users based on the data distributions they sample, as well as the underlying statistical properties of the distributions are apriori unknown. A family of One-shot Distributed Clustered Learning methods (ODCL-$\mathcal{C}$) is proposed, parametrized by the set of admissible clustering algorithms $\mathcal{C}$, with the objective of learning the true model at each user. The admissible clustering methods include $K$-means (KM) and convex clustering (CC), giving rise to various one-shot methods within the proposed family, such as ODCL-KM and ODCL-CC. The proposed one-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees. In particular, for strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error (MSE) rates in terms of the sample size. An explicit characterization of the threshold is provided in terms of the problem parameters. Numerical experiments illustrate the findings and corroborate the performance of the proposed methods. We also highlight the trade-offs with respect to selecting various clustering methods (ODCL-CC, ODCL-KM) and demonstrate significant improvements over state-of-the-art.  ( 2 min )
    Likelihood-Free Frequentist Inference: Confidence Sets with Correct Conditional Coverage. (arXiv:2107.03920v5 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, particularly outside asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage for small sample sizes. This paper unifies classical statistics with modern machine learning to present (i) a practical procedure for the Neyman construction of confidence sets with finite-sample guarantees of nominal coverage, and (ii) diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can leverage the LF2I machinery to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our paper discusses the benefits and challenges of LF2I, with a breakdown of the sources of errors in LF2I confidence sets.
    Inference on the Optimal Assortment in the Multinomial Logit Model. (arXiv:2301.12254v1 [stat.ML])
    Assortment optimization has received active explorations in the past few decades due to its practical importance. Despite the extensive literature dealing with optimization algorithms and latent score estimation, uncertainty quantification for the optimal assortment still needs to be explored and is of great practical significance. Instead of estimating and recovering the complete optimal offer set, decision makers may only be interested in testing whether a given property holds true for the optimal assortment, such as whether they should include several products of interest in the optimal set, or how many categories of products the optimal set should include. This paper proposes a novel inferential framework for testing such properties. We consider the widely adopted multinomial logit (MNL) model, where we assume that each customer will purchase an item within the offered products with a probability proportional to the underlying preference score associated with the product. We reduce inferring a general optimal assortment property to quantifying the uncertainty associated with the sign change point detection of the marginal revenue gaps. We show the asymptotic normality of the marginal revenue gap estimator, and construct a maximum statistic via the gap estimators to detect the sign change point. By approximating the distribution of the maximum statistic with multiplier bootstrap techniques, we propose a valid testing procedure. We also conduct numerical experiments to assess the performance of our method.
    TOAST: Topological Algorithm for Singularity Tracking. (arXiv:2210.00069v2 [cs.LG] UPDATED)
    The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the 'manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.  ( 2 min )
    Decentralized Entropic Optimal Transport for Privacy-preserving Distributed Distribution Comparison. (arXiv:2301.12065v1 [cs.LG])
    Privacy-preserving distributed distribution comparison measures the distance between the distributions whose data are scattered across different agents in a distributed system and cannot be shared among the agents. In this study, we propose a novel decentralized entropic optimal transport (EOT) method, which provides a privacy-preserving and communication-efficient solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the decentralized EOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a distributed kernel approximation method, and each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. We analyze our method's communication complexity and provide a theoretical bound for the approximation error caused by the convergence error, the approximated kernel, and the mismatch between the storage and communication protocols. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.
    Data Heterogeneity Differential Privacy: From Theory to Algorithm. (arXiv:2002.08578v2 [cs.LG] UPDATED)
    Traditionally, the random noise is equally injected when training with different data instances in the field of differential privacy (DP). In this paper, we first give sharper excess risk bounds of DP stochastic gradient descent (SGD) method. Considering most of the previous methods are under convex conditions, we use Polyak-{\L}ojasiewicz condition to relax it in this paper. Then, after observing that different training data instances affect the machine learning model to different extent, we consider the heterogeneity of training data and attempt to improve the performance of DP-SGD from a new perspective. Specifically, by introducing the influence function (IF), we quantitatively measure the contributions of various training data on the final machine learning model. If the contribution made by a single data instance is so little that attackers cannot infer anything from the model, we do not add noise when training with it. Based on this observation, we design a `Performance Improving' DP-SGD algorithm: PIDP-SGD. Theoretical and experimental results show that our proposed PIDP-SGD improves the performance significantly.
    How Powerful are Shallow Neural Networks with Bandlimited Random Weights?. (arXiv:2008.08427v2 [cs.LG] UPDATED)
    We investigate the expressive power of depth-2 bandlimited random neural networks. A random net is a neural network where the hidden layer parameters are frozen with random assignment, and only the output layer parameters are trained by loss minimization. Using random weights for a hidden layer is an effective method to avoid non-convex optimization in standard gradient descent learning. It has also been adopted in recent deep learning theories. Despite the well-known fact that a neural network is a universal approximator, in this study, we mathematically show that when hidden parameters are distributed in a bounded domain, the network may not achieve zero approximation error. In particular, we derive a new nontrivial approximation error lower bound. The proof utilizes the technique of ridgelet analysis, a harmonic analysis method designed for neural networks. This method is inspired by fundamental principles in classical signal processing, specifically the idea that signals with limited bandwidth may not always be able to perfectly recreate the original signal. We corroborate our theoretical results with various simulation studies, and generally, two main take-home messages are offered: (i) Not any distribution for selecting random weights is feasible to build a universal approximator; (ii) A suitable assignment of random weights exists but to some degree is associated with the complexity of the target function.
    Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. (arXiv:2301.12222v1 [cs.LG])
    Cancer is a term that denotes a group of diseases caused by abnormal growth of cells that can spread in different parts of the body. According to the World Health Organization (WHO), cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer, as it is indicative of the biochemical processes in tissue and cells, as well as the genetic characteristics of an organism. Deoxyribonucleic Acid (DNA) microarrays and Ribonucleic Acid (RNA)- sequencing methods for gene expression data allow quantifying the expression levels of genes and produce valuable data for computational analysis. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods. Both conventional and deep learning-based approaches are reviewed, with an emphasis on the ap-plication of deep learning models due to their comparative advantages for identifying gene patterns that are distinctive for various types of cancers. Relevant works that employ the most commonly used deep neural network architectures are covered, including multi-layer perceptrons, convolutional, recurrent, graph, and transformer networks. This survey also presents an overview of the data collection methods for gene expression analysis and lists important datasets that are commonly used for supervised machine learning for this task. Furthermore, reviewed are pertinent techniques for feature engineering and data preprocessing that are typically used to handle the high dimensionality of gene expression data, caused by a large number of genes present in data samples. The paper concludes with a discussion of future research directions for machine learning-based gene expression analysis for cancer classification.
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v3 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed.  ( 2 min )
    Practical Differentially Private Hyperparameter Tuning with Subsampling. (arXiv:2301.11989v1 [cs.LG])
    Tuning all the hyperparameters of differentially private (DP) machine learning (ML) algorithms often requires use of sensitive data and this may leak private information via hyperparameter values. Recently, Papernot and Steinke (2022) proposed a certain class of DP hyperparameter tuning algorithms, where the number of random search samples is randomized itself. Commonly, these algorithms still considerably increase the DP privacy parameter $\varepsilon$ over non-tuned DP ML model training and can be computationally heavy as evaluating each hyperparameter candidate requires a new training run. We focus on lowering both the DP bounds and the computational complexity of these methods by using only a random subset of the sensitive data for the hyperparameter tuning and by extrapolating the optimal values from the small dataset to a larger dataset. We provide a R\'enyi differential privacy analysis for the proposed method and experimentally show that it consistently leads to better privacy-utility trade-off than the baseline method by Papernot and Steinke (2022).
    (Private) Kernelized Bandits with Distributed Biased Feedback. (arXiv:2301.12061v1 [cs.LG])
    In this paper, we study kernelized bandits with distributed biased feedback. This problem is motivated by several real-world applications (such as dynamic pricing, cellular network configuration, and policy making), where users from a large population contribute to the reward of the action chosen by a central entity, but it is difficult to collect feedback from all users. Instead, only biased feedback (due to user heterogeneity) from a subset of users may be available. In addition to such partial biased feedback, we are also faced with two practical challenges due to communication cost and computation complexity. To tackle these challenges, we carefully design a new \emph{distributed phase-then-batch-based elimination (\texttt{DPBE})} algorithm, which samples users in phases for collecting feedback to reduce the bias and employs \emph{maximum variance reduction} to select actions in batches within each phase. By properly choosing the phase length, the batch size, and the confidence width used for eliminating suboptimal actions, we show that \texttt{DPBE} achieves a sublinear regret of $\tilde{O}(T^{1-\alpha/2}+\sqrt{\gamma_T T})$, where $\alpha\in (0,1)$ is the user-sampling parameter one can tune. Moreover, \texttt{DPBE} can significantly reduce both communication cost and computation complexity in distributed kernelized bandits, compared to some variants of the state-of-the-art algorithms (originally developed for standard kernelized bandits). Furthermore, by incorporating various \emph{differential privacy} models (including the central, local, and shuffle models), we generalize \texttt{DPBE} to provide privacy guarantees for users participating in the distributed learning process. Finally, we conduct extensive simulations to validate our theoretical results and evaluate the empirical performance.
    SaFormer: A Conditional Sequence Modeling Approach to Offline Safe Reinforcement Learning. (arXiv:2301.12203v1 [cs.LG])
    Offline safe RL is of great practical relevance for deploying agents in real-world applications. However, acquiring constraint-satisfying policies from the fixed dataset is non-trivial for conventional approaches. Even worse, the learned constraints are stationary and may become invalid when the online safety requirement changes. In this paper, we present a novel offline safe RL approach referred to as SaFormer, which tackles the above issues via conditional sequence modeling. In contrast to existing sequence models, we propose cost-related tokens to restrict the action space and a posterior safety verification to enforce the constraint explicitly. Specifically, SaFormer performs a two-stage auto-regression conditioned by the maximum remaining cost to generate feasible candidates. It then filters out unsafe attempts and executes the optimal action with the highest expected return. Extensive experiments demonstrate the efficacy of SaFormer featuring (1) competitive returns with tightened constraint satisfaction; (2) adaptability to the in-range cost values of the offline data without retraining; (3) generalizability for constraints beyond the current dataset.  ( 2 min )
    Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming. (arXiv:2301.12187v1 [cs.LG])
    Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.61\times$speed-up with only $0.62$\%p accuracy drop in MobileNetV2-1.4 on the ImageNet.
    A VAE-Bayesian Deep Learning Scheme for Solar Generation Forecasting based on Dimensionality Reduction. (arXiv:2103.12969v2 [cs.LG] UPDATED)
    The advancement of distributed generation technologies in modern power systems has led to a widespread integration of renewable power generation at customer side. However, the intermittent nature of renewable energy poses new challenges to the network operational planning with underlying uncertainties. This paper proposes a novel Bayesian probabilistic technique for forecasting renewable solar generation by addressing data and model uncertainties by integrating bidirectional long short-term memory (BiLSTM) neural networks while compressing the weight parameters using variational autoencoder (VAE). Existing Bayesian deep learning methods suffer from high computational complexities as they require to draw a large number of samples from weight parameters expressed in the form of probability distributions. The proposed method can deal with uncertainty present in model and data in a more computationally efficient manner by reducing the dimensionality of model parameters. The proposed method is evaluated using quantile loss, reconstruction error, and deterministic forecasting evaluation metrics such as root-mean square error. It is inferred from the numerical results that VAE-Bayesian BiLSTM outperforms other probabilistic and deterministic deep learning methods for solar power forecasting in terms of accuracy and computational efficiency for different sizes of the dataset.
    Leveraging Importance Weights in Subset Selection. (arXiv:2301.12052v1 [cs.LG])
    We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.
    Relational Reasoning Networks. (arXiv:2106.00393v3 [cs.AI] UPDATED)
    Neuro-symbolic methods integrate neural architectures, knowledge representation and reasoning. However, they have been struggling at both dealing with the intrinsic uncertainty of the observations and scaling to real-world applications. This paper presents Relational Reasoning Networks (R2N), a novel end-to-end model that performs relational reasoning in the latent space of a deep learner architecture, where the representations of constants, ground atoms and their manipulations are learned in an integrated fashion. Unlike flat architectures like Knowledge Graph Embedders, which can only represent relations between entities, R2Ns define an additional computational structure, accounting for higher-level relations among the ground atoms. The considered relations can be explicitly known, like the ones defined by logic formulas, or defined as unconstrained correlations among groups of ground atoms. R2Ns can be applied to purely symbolic tasks or as a neuro-symbolic platform to integrate learning and reasoning in heterogeneous problems with both symbolic and feature-based represented entities. The proposed model overtakes the limitations of previous neuro-symbolic methods that have been either limited in terms of scalability or expressivity. The proposed methodology is shown to achieve state-of-the-art results in different experimental settings.
    APAC: Authorized Probability-controlled Actor-Critic For Offline Reinforcement Learning. (arXiv:2301.12130v1 [cs.LG])
    Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Most existing methods exclude the OOD areas or restrict the value of $Q$ function. However, these methods either are over-conservative or suffer from model uncertainty prediction. In this paper, we propose an authorized probabilistic-control policy learning (APAC) method. The proposed method learns the distribution characteristics of the feasible states/actions by utilizing the flow-GAN model. Specifically, APAC avoids taking action in the low probability density region of behavior policy, while allows exploration in the authorized high probability density region. Theoretical proofs are provided to justify the advantage of APAC. Empirically, APAC outperforms existing alternatives on a variety of simulated tasks, and yields higher expected returns.
    Anticipate, Ensemble and Prune: Improving Convolutional Neural Networks via Aggregated Early Exits. (arXiv:2301.12168v1 [cs.LG])
    Today, artificial neural networks are the state of the art for solving a variety of complex tasks, especially in image classification. Such architectures consist of a sequence of stacked layers with the aim of extracting useful information and having it processed by a classifier to make accurate predictions. However, intermediate information within such models is often left unused. In other cases, such as in edge computing contexts, these architectures are divided into multiple partitions that are made functional by including early exits, i.e. intermediate classifiers, with the goal of reducing the computational and temporal load without extremely compromising the accuracy of the classifications. In this paper, we present Anticipate, Ensemble and Prune (AEP), a new training technique based on weighted ensembles of early exits, which aims at exploiting the information in the structure of networks to maximise their performance. Through a comprehensive set of experiments, we show how the use of this approach can yield average accuracy improvements of up to 15% over traditional training. In its hybrid-weighted configuration, AEP's internal pruning operation also allows reducing the number of parameters by up to 41%, lowering the number of multiplications and additions by 18% and the latency time to make inference by 16%. By using AEP, it is also possible to learn weights that allow early exits to achieve better accuracy values than those obtained from single-output reference models.
    Towards Lossless ANN-SNN Conversion under Ultra-Low Latency with Dual-Phase Optimization. (arXiv:2205.07473v2 [cs.NE] UPDATED)
    Spiking neural network (SNN) operating with asynchronous discrete events shows higher energy efficiency. A popular approach to implementing deep SNNs is ANN-SNN conversion combining both efficient training of ANNs and efficient inference of SNNs. However, due to the intrinsic difference between ANNs and SNNs, the accuracy loss is usually non-negligible, especially under low simulating steps. It restricts the applications of SNN on latency-sensitive edge devices greatly. In this paper, we identify such performance degradation stems from the misrepresentation of the negative or overflow residual membrane potential in SNNs. Inspired by this, we systematically analyze the conversion error between SNNs and ANNs, and then decompose it into three folds: quantization error, clipping error, and residual membrane potential representation error. With such insights, we propose a dual-phase conversion algorithm to minimize those errors separately. Besides, we show each phase achieves significant performance gains in a complementary manner. We evaluate our method on challenging datasets including CIFAR-10, CIFAR-100, and ImageNet datasets. The experimental results show the proposed method achieves the state-of-the-art in terms of both accuracy and latency with promising energy preservation compared to ANNs. For instance, our method achieves an accuracy of 73.20% on CIFAR-100 in only 2 time steps with 15.7$\times$ less energy consumption.
    Unearthing InSights into Mars: unsupervised source separation with limited data. (arXiv:2301.11981v1 [cs.LG])
    Source separation entails the ill-posed problem of retrieving a set of source signals observed through a mixing operator. Solving this problem requires prior knowledge, which is commonly incorporated by imposing regularity conditions on the source signals or implicitly learned in supervised or unsupervised methods from existing data. While data-driven methods have shown great promise in source separation, they are often dependent on large amounts of data, which rarely exists in planetary space missions. Considering this challenge, we propose an unsupervised source separation scheme for domains with limited data access that involves solving an optimization problem in the wavelet scattering representation space$\unicode{x2014}$an interpretable low-dimensional representation of stationary processes. We present a real-data example in which we remove transient thermally induced microtilts, known as glitches, from data recorded by a seismometer during NASA's InSight mission on Mars. Owing to the wavelet scattering covariances' ability to capture non-Gaussian properties of stochastic processes, we are able to separate glitches using only a few glitch-free data snippets.
    Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos. (arXiv:2109.08275v2 [cs.MM] UPDATED)
    Geo-tagged photo based tourist attraction recommendation can discover users' travel preferences from their taken photos, so as to recommend suitable tourist attractions to them. However, existing visual content based methods cannot fully exploit the user and tourist attraction information of photos to extract visual features, and do not differentiate the significances of different photos. In this paper, we propose multi-level visual similarity based personalized tourist attraction recommendation using geo-tagged photos (MEAL). MEAL utilizes the visual contents of photos and interaction behavior data to obtain the final embeddings of users and tourist attractions, which are then used to predict the visit probabilities. Specifically, by crossing the user and tourist attraction information of photos, we define four visual similarity levels and introduce a corresponding quintuplet loss to embed the visual contents of photos. In addition, to capture the significances of different photos, we exploit the self-attention mechanism to obtain the visual representations of users and tourist attractions. We conducted experiments on a dataset crawled from Flickr, and the experimental results proved the advantage of this method.
    Neural Temporal Point Process for Forecasting Higher Order and Directional Interactions. (arXiv:2301.12210v1 [cs.LG])
    Real-world systems are made of interacting entities that evolve with time. Creating models that can forecast interactions by learning the dynamics of entities is an important problem in numerous fields. Earlier works used dynamic graph models to achieve this. However, real-world interactions are more complex than pairwise, as they involve more than two entities, and many of these higher-order interactions have directional components. Examples of these can be seen in communication networks such as email exchanges that involve a sender, and multiple recipients, citation networks, where authors draw upon the work of others, and so on. In this paper, we solve the problem of higher-order directed interaction forecasting by proposing a deep neural network-based model \textit{Directed HyperNode Temporal Point Process} for directed hyperedge event forecasting, as hyperedge provides native framework for modeling relationships among the variable number of nodes. Our proposed technique reduces the search space of possible candidate hyperedges by first forecasting the nodes at which events will be observed, based on which it generates candidate hyperedges. To demonstrate the efficiency of our model, we curated four datasets and conducted an extensive empirical study. We believe that this is the first work that solves the problem of forecasting higher-order directional interactions.
    Prompt-Based Editing for Text Style Transfer. (arXiv:2301.11997v1 [cs.CL])
    Prompting approaches have been recently explored in text style transfer, where a textual prompt is used to query a pretrained language model to generate style-transferred texts word by word in an autoregressive manner. However, such a generation process is less controllable and early prediction errors may affect future word predictions. In this paper, we present a prompt-based editing approach for text style transfer. Specifically, we prompt a pretrained language model for style classification and use the classification probability to compute a style score. Then, we perform discrete search with word-level editing to maximize a comprehensive scoring function for the style-transfer task. In this way, we transform a prompt-based generation problem into a classification one, which is a training-free process and more controllable than the autoregressive generation of sentences. In our experiments, we performed both automatic and human evaluation on three style-transfer benchmark datasets, and show that our approach largely outperforms the state-of-the-art systems that have 20 times more parameters. Additional empirical analyses further demonstrate the effectiveness of our approach.
    AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning. (arXiv:2301.12132v1 [cs.CL])
    Large pretrained language models have been widely used in downstream NLP tasks via task-specific fine-tuning. Recently, an array of Parameter-Efficient Fine-Tuning (PEFT) methods have also achieved strong task performance while updating a much smaller number of parameters compared to full model tuning. However, it is non-trivial to make informed per-task design choices (i.e., to create PEFT configurations) concerning the selection of PEFT architectures and modules, the number of tunable parameters, and even the layers in which the PEFT modules are inserted. Consequently, it is highly likely that the current, manually set PEFT configurations might be suboptimal for many tasks from the perspective of the performance-to-efficiency trade-off. To address the core question of the PEFT configuration selection that aims to control and maximise the balance between performance and parameter efficiency, we first define a rich configuration search space spanning multiple representative PEFT modules along with finer-grained configuration decisions over the modules (e.g., parameter budget, insertion layer). We then propose AutoPEFT, a novel framework to traverse this configuration space: it automatically configures multiple PEFT modules via high-dimensional Bayesian optimisation. We show the resource scalability and task transferability of AutoPEFT-found configurations, outperforming existing PEFT methods on average on the standard GLUE benchmark while conducting the configuration search on a single task. The per-task AutoPEFT-based configuration search even outperforms full-model fine-tuning.
    STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning. (arXiv:2301.12038v1 [cs.LG])
    Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
    Vertex-based reachability analysis for verifying ReLU deep neural networks. (arXiv:2301.12001v1 [cs.LG])
    Neural networks achieved high performance over different tasks, i.e. image identification, voice recognition and other applications. Despite their success, these models are still vulnerable regarding small perturbations, which can be used to craft the so-called adversarial examples. Different approaches have been proposed to circumvent their vulnerability, including formal verification systems, which employ a variety of techniques, including reachability, optimization and search procedures, to verify that the model satisfies some property. In this paper we propose three novel reachability algorithms for verifying deep neural networks with ReLU activations. The first and third algorithms compute an over-approximation for the reachable set, whereas the second one computes the exact reachable set. Differently from previously proposed approaches, our algorithms take as input a V-polytope. Our experiments on the ACAS Xu problem show that the Exact Polytope Network Mapping (EPNM) reachability algorithm proposed in this work surpass the state-of-the-art results from the literature, specially in relation to other reachability methods.
    TIDo: Source-free Task Incremental Learning in Non-stationary Environments. (arXiv:2301.12055v1 [cs.LG])
    This work presents an incremental learning approach for autonomous agents to learn new tasks in a non-stationary environment. Updating a DNN model-based agent to learn new target tasks requires us to store past training data and needs a large labeled target task dataset. Few-shot task incremental learning methods overcome the limitation of labeled target datasets by adapting trained models to learn private target classes using a few labeled representatives and a large unlabeled target dataset. However, the methods assume that the source and target tasks are stationary. We propose a one-shot task incremental learning approach that can adapt to non-stationary source and target tasks. Our approach minimizes adversarial discrepancy between the model's feature space and incoming incremental data to learn an updated hypothesis. We also use distillation loss to reduce catastrophic forgetting of previously learned tasks. Finally, we use Gaussian prototypes to generate exemplar instances eliminating the need to store past training data. Unlike current work in task incremental learning, our model can learn both source and target task updates incrementally. We evaluate our method on various problem settings for incremental object detection and disease prediction model update. We evaluate our approach by measuring the performance of shared class and target private class prediction. Our results show that our approach achieved improved performance compared to existing state-of-the-art task incremental learning methods.
    Analyzing Robustness of the Deep Reinforcement Learning Algorithm in Ramp Metering Applications Considering False Data Injection Attack and Defense. (arXiv:2301.12036v1 [cs.LG])
    Decades of practices of ramp metering, by controlling downstream volume and smoothing the interweaving traffic, have proved that ramp metering can decrease total travel time, mitigate shockwaves, decrease rear-end collisions, reduce pollution, etc. Besides traditional methods like ALIENA algorithms, Deep Reinforcement Learning algorithms have been established recently to build finer control on ramp metering. However, those Deep Learning models may be venerable to adversarial attacks. Thus, it is important to investigate the robustness of those models under False Data Injection adversarial attack. Furthermore, algorithms capable of detecting anomaly data from clean data are the key to safeguard Deep Learning algorithm. In this study, an online algorithm that can distinguish adversarial data from clean data are tested. Results found that in most cases anomaly data can be distinguished from clean data, although their difference is too small to be manually distinguished by humans. In practice, whenever adversarial/hazardous data is detected, the system can fall back to a fixed control program, and experts should investigate the detectors status or security protocols afterwards before real damages happen.
    Autoencoder-Based Unequal Error Protection Codes. (arXiv:2301.12231v1 [cs.IT])
    Most of today's communication systems are designed to target reliable message recovery after receiving the entire encoded message (codeword). However, in many practical scenarios, the transmission process may be interrupted before receiving the complete codeword. This paper proposes a novel rateless autoencoder (AE)-based code design suitable for decoding the transmitted message before the noisy codeword is fully received. Using particular dropout strategies applied during the training process, rateless AE codes allow to trade off between decoding delay and reliability, providing a graceful improvement of the latter with each additionally received codeword symbol. The proposed rateless AEs significantly outperform the conventional AE designs for scenarios where it is desirable to trade off reliability for lower decoding delay.
    SEGA: Instructing Diffusion using Semantic Dimensions. (arXiv:2301.12247v1 [cs.CV])
    Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.  ( 2 min )
    Violation-Aware Contextual Bayesian Optimization for Controller Performance Optimization with Unmodeled Constraints. (arXiv:2301.12099v1 [cs.LG])
    We study the problem of performance optimization of closed-loop control systems with unmodeled dynamics. Bayesian optimization (BO) has been demonstrated to be effective for improving closed-loop performance by automatically tuning controller gains or reference setpoints in a model-free manner. However, BO methods have rarely been tested on dynamical systems with unmodeled constraints and time-varying ambient conditions. In this paper, we propose a violation-aware contextual BO algorithm (VACBO) that optimizes closed-loop performance while simultaneously learning constraint-feasible solutions under time-varying ambient conditions. Unlike classical constrained BO methods which allow unlimited constraint violations, or 'safe' BO algorithms that are conservative and try to operate with near-zero violations, we allow budgeted constraint violations to improve constraint learning and accelerate optimization. We demonstrate the effectiveness of our proposed VACBO method for energy minimization of industrial vapor compression systems under time-varying ambient temperature and humidity.
    Reachability Analysis of Neural Network Control Systems. (arXiv:2301.12100v1 [cs.LG])
    Neural network controllers (NNCs) have shown great promise in autonomous and cyber-physical systems. Despite the various verification approaches for neural networks, the safety analysis of NNCs remains an open problem. Existing verification approaches for neural network control systems (NNCSs) either can only work on a limited type of activation functions, or result in non-trivial over-approximation errors with time evolving. This paper proposes a verification framework for NNCS based on Lipschitzian optimisation, called DeepNNC. We first prove the Lipschitz continuity of closed-loop NNCSs by unrolling and eliminating the loops. We then reveal the working principles of applying Lipschitzian optimisation on NNCS verification and illustrate it by verifying an adaptive cruise control model. Compared to state-of-the-art verification approaches, DeepNNC shows superior performance in terms of efficiency and accuracy over a wide range of NNCs. We also provide a case study to demonstrate the capability of DeepNNC to handle a real-world, practical, and complex system. Our tool \textbf{DeepNNC} is available at \url{https://github.com/TrustAI/DeepNNC}.
    Chaos as an interpretable benchmark for forecasting and data-driven modelling. (arXiv:2110.05266v2 [cs.LG] UPDATED)
    The striking fractal geometry of strange attractors underscores the generative nature of chaos: like probability distributions, chaotic systems can be repeatedly measured to produce arbitrarily-detailed information about the underlying attractor. Chaotic systems thus pose a unique challenge to modern statistical learning techniques, while retaining quantifiable mathematical properties that make them controllable and interpretable as benchmarks. Here, we present a growing database currently comprising 131 known chaotic dynamical systems spanning fields such as astrophysics, climatology, and biochemistry. Each system is paired with precomputed multivariate and univariate time series. Our dataset has comparable scale to existing static time series databases; however, our systems can be re-integrated to produce additional datasets of arbitrary length and granularity. Our dataset is annotated with known mathematical properties of each system, and we perform feature analysis to broadly categorize the diverse dynamics present across the collection. Chaotic systems inherently challenge forecasting models, and across extensive benchmarks we correlate forecasting performance with the degree of chaos present. We also exploit the unique generative properties of our dataset in several proof-of-concept experiments: surrogate transfer learning to improve time series classification, importance sampling to accelerate model training, and benchmarking symbolic regression algorithms.
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v4 [cs.CV] UPDATED)
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    Node Injection for Class-specific Network Poisoning. (arXiv:2301.12277v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful in learning rich network representations that aid the performance of downstream tasks. However, recent studies showed that GNNs are vulnerable to adversarial attacks involving node injection and network perturbation. Among these, node injection attacks are more practical as they don't require manipulation in the existing network and can be performed more realistically. In this paper, we propose a novel problem statement - a class-specific poison attack on graphs in which the attacker aims to misclassify specific nodes in the target class into a different class using node injection. Additionally, nodes are injected in such a way that they camouflage as benign nodes. We propose NICKI, a novel attacking strategy that utilizes an optimization-based approach to sabotage the performance of GNN-based node classifiers. NICKI works in two phases - it first learns the node representation and then generates the features and edges of the injected nodes. Extensive experiments and ablation studies on four benchmark networks show that NICKI is consistently better than four baseline attacking strategies for misclassifying nodes in the target class. We also show that the injected nodes are properly camouflaged as benign, thus making the poisoned graph indistinguishable from its clean version w.r.t various topological properties.
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v3 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al., (2022). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of the designs in Na et al., (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond. (arXiv:2203.07835v3 [cs.LG] UPDATED)
    With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v3 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.
    A Closer Look at Few-shot Classification Again. (arXiv:2301.12246v1 [cs.LG])
    Few-shot classification consists of a training phase where a model is learned on a relatively large dataset and an adaptation phase where the learned model is adapted to previously-unseen tasks with limited labeled samples. In this paper, we empirically prove that the training algorithm and the adaptation algorithm can be completely disentangled, which allows algorithm analysis and design to be done individually for each phase. Our meta-analysis for each phase reveals several interesting insights that may help better understand key aspects of few-shot classification and connections with other fields such as visual representation learning and transfer learning. We hope the insights and research challenges revealed in this paper can inspire future work in related directions.
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v3 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.
    Improved knowledge distillation by utilizing backward pass knowledge in neural networks. (arXiv:2301.12006v1 [cs.LG])
    Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to better-match the output of the student model to that of the teacher model based on the knowledge extracts from the forward pass of the teacher network. Although conventional KD is effective for matching the two networks over the given data points, there is no guarantee that these models would match in other areas for which we do not have enough training samples. In this work, we address that problem by generating new auxiliary training samples based on extracting knowledge from the backward pass of the teacher in the areas where the student diverges greatly from the teacher. We compute the difference between the teacher and the student and generate new data samples that maximize the divergence. This is done by perturbing data samples in the direction of the gradient of the difference between the student and the teacher. Augmenting the training set by adding this auxiliary improves the performance of KD significantly and leads to a closer match between the student and the teacher. Using this approach, when data samples come from a discrete domain, such as applications of natural language processing (NLP) and language understanding, is not trivial. However, we show how this technique can be used successfully in such applications. We evaluated the performance of our method on various tasks in computer vision and NLP domains and got promising results.
    Controlling Steering with Energy-Based Models. (arXiv:2301.12264v1 [cs.RO])
    So-called implicit behavioral cloning with energy-based models has shown promising results in robotic manipulation tasks. We tested if the method's advantages carry on to controlling the steering of a real self-driving car with an end-to-end driving model. We performed an extensive comparison of the implicit behavioral cloning approach with explicit baseline approaches, all sharing the same neural network backbone architecture. Baseline explicit models were trained with regression (MAE) loss, classification loss (softmax and cross-entropy on a discretization), or as mixture density networks (MDN). While models using the energy-based formulation performed comparably to baseline approaches in terms of safety driver interventions, they had a higher whiteness measure, indicating higher jerk. To alleviate this, we show two methods that can be used to improve the smoothness of steering. We confirmed that energy-based models handle multimodalities slightly better than simple regression, but this did not translate to significantly better driving ability. We argue that the steering-only road-following task has too few multimodalities to benefit from energy-based models. This shows that applying implicit behavioral cloning to real-world tasks can be challenging, and further investigation is needed to bring out the theoretical advantages of energy-based models.
    Optimization for Amortized Inverse Problems. (arXiv:2210.13983v3 [cs.LG] UPDATED)
    Incorporating a deep generative model as the prior distribution in inverse problems has established substantial success in reconstructing images from corrupted observations. Notwithstanding, the existing optimization approaches use gradient descent largely without adapting to the non-convex nature of the problem and can be sensitive to initial values, impeding further performance improvement. In this paper, we propose an efficient amortized optimization scheme for inverse problems with a deep generative prior. Specifically, the optimization task with high degrees of difficulty is decomposed into optimizing a sequence of much easier ones. We provide a theoretical guarantee of the proposed algorithm and empirically validate it on different inverse problems. As a result, our approach outperforms baseline methods qualitatively and quantitatively by a large margin.
    Policy Gradient Methods for Distortion Risk Measures. (arXiv:2107.04422v6 [cs.LG] UPDATED)
    We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning (RL) framework. Our proposed algorithms maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process in on-policy as well as off-policy RL settings. We derive a variant of the policy gradient theorem that caters to the DRM objective, and use this theorem in conjunction with a likelihood ratio-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v4 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.  ( 2 min )
    Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models. (arXiv:2211.17091v2 [cs.CV] UPDATED)
    While there are many score-based models with various diffusing strategies as well as many numerical schemes of the denoising process, only a few works have explored the score part of the generative SDE. This paper introduces a new generative SDE with score adjustment using an auxiliary discriminator. The goal is to improve the original generative process of a pre-trained diffusion model by estimating the gap between the pre-trained score estimation and the true data score. This is done by training a discriminator that classifies diffused real data and diffused sample data. The gap estimation is then used to adjust the pre-trained score network. In experiments, the method enables new SOTA FIDs of 1.77/1.64 on unconditional/conditional CIFAR-10, and new SOTA FID/sFID of 3.18/4.53 on ImageNet 256x256.  ( 2 min )
    Online Markov Decision Processes with Non-oblivious Strategic Adversary. (arXiv:2110.03604v3 [cs.LG] UPDATED)
    We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of $\mathcal{O}(\sqrt{T \log(L)}+\tau^2\sqrt{ T \log(|A|)})$ where $L$ is the size of adversary's pure strategy set and $|A|$ denotes the size of agent's action space. Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of $\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})$ where $k$ depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence result to a NE. To our best knowledge, this is first work leading to the last iteration result in OMDPs.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 2 min )
    Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics. (arXiv:2211.08227v2 [cs.DC] UPDATED)
    Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations. We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly.  ( 2 min )
    Multi-Center Federated Learning: Clients Clustering for Better Personalization. (arXiv:2108.08647v3 [cs.LG] UPDATED)
    Personalized decision-making can be implemented in a Federated learning (FL) framework that can collaboratively train a decision model by extracting knowledge across intelligent clients, e.g. smartphones or enterprises. FL can mitigate the data privacy risk of collaborative training since it merely collects local gradients from users without access to their data. However, FL is fragile in the presence of statistical heterogeneity that is commonly encountered in personalized decision-making, e.g., non-IID data over different clients. Existing FL approaches usually update a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. By comparison, a mixture of multiple global models could capture the heterogeneity across various clients if assigning the client to different global models (i.e., centers) in FL. To this end, we propose a novel multi-center aggregation mechanism to cluster clients using their models' parameters. It learns multiple global models from data as the cluster centers, and simultaneously derives the optimal matching between users and centers. We then formulate it as an optimization problem that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Experiments on multiple benchmark datasets of FL show that our method outperforms several popular baseline methods. The experimental source codes are publicly available on the Github repository https://github.com/mingxuts/multi-center-fed-learning .  ( 2 min )
    Neural Integral Equations. (arXiv:2209.15190v3 [cs.LG] UPDATED)
    Integral equations (IEs) are equations that model spatiotemporal systems with non-local interactions. They have found important applications throughout theoretical and applied sciences, including in physics, chemistry, biology, and engineering. While efficient algorithms exist for solving given IEs, no method exists that can learn an IE and its associated dynamics from data alone. In this paper, we introduce Neural Integral Equations (NIE), a method that learns an unknown integral operator from data through an IE solver. We also introduce Attentional Neural Integral Equations (ANIE), where the integral is replaced by self-attention, which improves scalability and model capacity. We demonstrate that (A)NIE outperforms other methods in both speed and accuracy on several benchmark tasks in ODE, PDE, and IE systems of synthetic and real-world data.  ( 2 min )
    Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks. (arXiv:2301.06646v2 [cs.LG] UPDATED)
    Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers.  ( 2 min )
    Machine Learning Accelerators in 2.5D Chiplet Platforms with Silicon Photonics. (arXiv:2301.12252v1 [cs.AR])
    Domain-specific machine learning (ML) accelerators such as Google's TPU and Apple's Neural Engine now dominate CPUs and GPUs for energy-efficient ML processing. However, the evolution of electronic accelerators is facing fundamental limits due to the limited computation density of monolithic processing chips and the reliance on slow metallic interconnects. In this paper, we present a vision of how optical computation and communication can be integrated into 2.5D chiplet platforms to drive an entirely new class of sustainable and scalable ML hardware accelerators. We describe how cross-layer design and fabrication of optical devices, circuits, and architectures, and hardware/software codesign can help design efficient photonics-based 2.5D chiplet platforms to accelerate emerging ML workloads.  ( 2 min )
    Mutual Wasserstein Discrepancy Minimization for Sequential Recommendation. (arXiv:2301.12197v1 [cs.LG])
    Self-supervised sequential recommendation significantly improves recommendation performance by maximizing mutual information with well-designed data augmentations. However, the mutual information estimation is based on the calculation of Kullback Leibler divergence with several limitations, including asymmetrical estimation, the exponential need of the sample size, and training instability. Also, existing data augmentations are mostly stochastic and can potentially break sequential correlations with random modifications. These two issues motivate us to investigate an alternative robust mutual information measurement capable of modeling uncertainty and alleviating KL divergence limitations. To this end, we propose a novel self-supervised learning framework based on Mutual WasserStein discrepancy minimization MStein for the sequential recommendation. We propose the Wasserstein Discrepancy Measurement to measure the mutual information between augmented sequences. Wasserstein Discrepancy Measurement builds upon the 2-Wasserstein distance, which is more robust, more efficient in small batch sizes, and able to model the uncertainty of stochastic augmentation processes. We also propose a novel contrastive learning loss based on Wasserstein Discrepancy Measurement. Extensive experiments on four benchmark datasets demonstrate the effectiveness of MStein over baselines. More quantitative analyses show the robustness against perturbations and training efficiency in batch size. Finally, improvements analysis indicates better representations of popular users or items with significant uncertainty. The source code is at https://github.com/zfan20/MStein.  ( 2 min )
    Stochastic Dimension-reduced Second-order Methods for Policy Optimization. (arXiv:2301.12174v1 [math.OC])
    In this paper, we propose several new stochastic second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration, making them computationally efficient and comparable to policy gradient methods. Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem. We show that DR-SOPO obtains an $\mathcal{O}(\epsilon^{-3.5})$ complexity for reaching approximate first-order stationary condition and certain subspace second-order stationary condition. In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $\mathcal{O}(\epsilon^{-3})$ based on the variance reduction technique. Preliminary experiments show that our proposed algorithms perform favorably compared with stochastic and variance-reduced policy gradient methods.  ( 2 min )
    Adapting Neural Link Predictors for Complex Query Answering. (arXiv:2301.12313v1 [cs.LG])
    Answering complex queries on incomplete knowledge graphs is a challenging task where a model needs to answer complex logical queries in the presence of missing knowledge. Recently, Arakelyan et al. (2021); Minervini et al. (2022) showed that neural link predictors could also be used for answering complex queries: their Continuous Query Decomposition (CQD) method works by decomposing complex queries into atomic sub-queries, answers them using neural link predictors and aggregates their scores via t-norms for ranking the answers to each complex query. However, CQD does not handle negations and only uses the training signal from atomic training queries: neural link prediction scores are not calibrated to interact together via fuzzy logic t-norms during complex query answering. In this work, we propose to address this problem by training a parameter-efficient score adaptation model to re-calibrate neural link prediction scores: this new component is trained on complex queries by back-propagating through the complex query-answering process. Our method, CQD$^{A}$, produces significantly more accurate results than current state-of-the-art methods, improving from $34.4$ to $35.1$ Mean Reciprocal Rank values averaged across all datasets and query types while using $\leq 35\%$ of the available training query types. We further show that CQD$^{A}$ is data-efficient, achieving competitive results with only $1\%$ of the training data, and robust in out-of-domain evaluations.
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v3 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.
    Unbiased and Efficient Self-Supervised Incremental Contrastive Learning. (arXiv:2301.12104v1 [cs.LG])
    Contrastive Learning (CL) has been proved to be a powerful self-supervised approach for a wide range of domains, including computer vision and graph representation learning. However, the incremental learning issue of CL has rarely been studied, which brings the limitation in applying it to real-world applications. Contrastive learning identifies the samples with the negative ones from the noise distribution that changes in the incremental scenarios. Therefore, only fitting the change of data without noise distribution causes bias, and directly retraining results in low efficiency. To bridge this research gap, we propose a self-supervised Incremental Contrastive Learning (ICL) framework consisting of (i) a novel Incremental InfoNCE (NCE-II) loss function by estimating the change of noise distribution for old data to guarantee no bias with respect to the retraining, (ii) a meta-optimization with deep reinforced Learning Rate Learning (LRL) mechanism which can adaptively learn the learning rate according to the status of the training processes and achieve fast convergence which is critical for incremental learning. Theoretically, the proposed ICL is equivalent to retraining, which is based on solid mathematical derivation. In practice, extensive experiments in different domains demonstrate that, without retraining a new model, ICL achieves up to 16.7x training speedup and 16.8x faster convergence with competitive results.  ( 2 min )
    EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval. (arXiv:2301.12005v1 [cs.LG])
    Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the IR literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a distillation-friendly embedding geometry, especially for DE student models.  ( 2 min )
    Deep Operator Learning Lessens the Curse of Dimensionality for PDEs. (arXiv:2301.12227v1 [cs.LG])
    Deep neural networks (DNNs) have seen tremendous success in many fields and their developments in PDE-related problems are rapidly growing. This paper provides an estimate for the generalization error of learning Lipschitz operators over Banach spaces using DNNs with applications to various PDE solution operators. The goal is to specify DNN width, depth, and the number of training samples needed to guarantee a certain testing error. Under mild assumptions on data distributions or operator structures, our analysis shows that deep operator learning can have a relaxed dependence on the discretization resolution of PDEs and, hence, lessen the curse of dimensionality in many PDE-related problems. We apply our results to various PDEs, including elliptic equations, parabolic equations, and Burgers equations.
    Meta-Learning Parameterized Skills. (arXiv:2206.03597v2 [cs.LG] UPDATED)
    We propose a novel parameterized skill-learning algorithm that aims to learn transferable parameterized skills and synthesize them into a new action space that supports efficient learning in long-horizon tasks. We propose to leverage off-policy Meta-RL combined with a trajectory-centric smoothness term to learn a set of parameterized skills. Our agent can use these learned skills to construct a three-level hierarchical framework that models a Temporally-extended Parameterized Action Markov Decision Process. We empirically demonstrate that the proposed algorithms enable an agent to solve a set of difficult long-horizon (obstacle-course and robot manipulation) tasks.  ( 2 min )
    MetaNO: How to Transfer Your Knowledge on Learning Hidden Physics. (arXiv:2301.12095v1 [cs.LG])
    Gradient-based meta-learning methods have primarily been applied to classical machine learning tasks such as image classification. Recently, PDE-solving deep learning methods, such as neural operators, are starting to make an important impact on learning and predicting the response of a complex physical system directly from observational data. Since the data acquisition in this context is commonly challenging and costly, the call of utilization and transfer of existing knowledge to new and unseen physical systems is even more acute. Herein, we propose a novel meta-learning approach for neural operators, which can be seen as transferring the knowledge of solution operators between governing (unknown) PDEs with varying parameter fields. Our approach is a provably universal solution operator for multiple PDE solving tasks, with a key theoretical observation that underlying parameter fields can be captured in the first layer of neural operator models, in contrast to typical final-layer transfer in existing meta-learning methods. As applications, we demonstrate the efficacy of our proposed approach on PDE-based datasets and a real-world material modeling problem, illustrating that our method can handle complex and nonlinear physical response learning tasks while greatly improving the sampling efficiency in unseen tasks.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v3 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks which, instead of considering a distribution over weights, samples outputs of each layer from a corresponding Gaussian distribution, parametrized by the predictions of mean and variance sub-layers. In uncertainty quality estimation experiments, we show that the proposed method achieves better uncertainty quality than other single-bin Bayesian Model Averaging methods, such as Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    Selecting Models based on the Risk of Damage Caused by Adversarial Attacks. (arXiv:2301.12151v1 [cs.LG])
    Regulation, legal liabilities, and societal concerns challenge the adoption of AI in safety and security-critical applications. One of the key concerns is that adversaries can cause harm by manipulating model predictions without being detected. Regulation hence demands an assessment of the risk of damage caused by adversaries. Yet, there is no method to translate this high-level demand into actionable metrics that quantify the risk of damage. In this article, we propose a method to model and statistically estimate the probability of damage arising from adversarial attacks. We show that our proposed estimator is statistically consistent and unbiased. In experiments, we demonstrate that the estimation results of our method have a clear and actionable interpretation and outperform conventional metrics. We then show how operators can use the estimation results to reliably select the model with the lowest risk.  ( 2 min )
    Temporal Context Mining for Learned Video Compression. (arXiv:2111.13850v2 [cs.CV] UPDATED)
    We address end-to-end learned video compression with a special focus on better learning and utilizing temporal contexts. For temporal context mining, we propose to store not only the previously reconstructed frames, but also the propagated features into the generalized decoded picture buffer. From the stored propagated features, we propose to learn multi-scale temporal contexts, and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder. Our scheme discards the parallelization-unfriendly auto-regressive entropy model to pursue a more practical decoding time. We compare our scheme with x264 and x265 (representing industrial software for H.264 and H.265, respectively) as well as the official reference software for H.264, H.265, and H.266 (JM, HM, and VTM, respectively). When intra period is 32 and oriented to PSNR, our scheme outperforms H.265--HM by 14.4% bit rate saving; when oriented to MS-SSIM, our scheme outperforms H.266--VTM by 21.1% bit rate saving.  ( 2 min )
    Complexity-Based Prompting for Multi-Step Reasoning. (arXiv:2210.00720v2 [cs.CL] UPDATED)
    We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multi-step reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift.  ( 2 min )
    Graph Neural Networks Intersect Probabilistic Graphical Models: A Survey. (arXiv:2206.06089v3 [cs.AI] UPDATED)
    Graphs are a powerful data structure to represent relational data and are widely used to describe complex real-world data structures. Probabilistic Graphical Models (PGMs) have been well-developed in the past years to mathematically model real-world scenarios in compact graphical representations of distributions of variables. Graph Neural Networks (GNNs) are new inference methods developed in recent years and are attracting growing attention due to their effectiveness and flexibility in solving inference and learning problems over graph-structured data. These two powerful approaches have different advantages in capturing relations from observations and how they conduct message passing, and they can benefit each other in various tasks. In this survey, we broadly study the intersection of GNNs and PGMs. Specifically, we first discuss how GNNs can benefit from learning structured representations in PGMs, generate explainable predictions by PGMs, and how PGMs can infer object relationships. Then we discuss how GNNs are implemented in PGMs for more efficient inference and structure learning. In the end, we summarize the benchmark datasets used in recent studies and discuss promising future directions.  ( 2 min )
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v3 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.  ( 2 min )
    ClusterFuG: Clustering Fully connected Graphs by Multicut. (arXiv:2301.12159v1 [cs.CV])
    We propose a graph clustering formulation based on multicut (a.k.a. weighted correlation clustering) on the complete graph. Our formulation does not need specification of the graph topology as in the original sparse formulation of multicut, making our approach simpler and potentially better performing. In contrast to unweighted correlation clustering we allow for a more expressive weighted cost structure. In dense multicut, the clustering objective is given in a factorized form as inner products of node feature vectors. This allows for an efficient formulation and inference in contrast to multicut/weighted correlation clustering, which has at least quadratic representation and computation complexity when working on the complete graph. We show how to rewrite classical greedy algorithms for multicut in our dense setting and how to modify them for greater efficiency and solution quality. In particular, our algorithms scale to graphs with tens of thousands of nodes. Empirical evidence on instance segmentation on Cityscapes and clustering of ImageNet datasets shows the merits of our approach.  ( 2 min )
    Masked Contrastive Learning for Anomaly Detection. (arXiv:2105.08793v2 [cs.LG] UPDATED)
    Detecting anomalies is one fundamental aspect of a safety-critical software system, however, it remains a long-standing problem. Numerous branches of works have been proposed to alleviate the complication and have demonstrated their efficiencies. In particular, self-supervised learning based methods are spurring interest due to their capability of learning diverse representations without additional labels. Among self-supervised learning tactics, contrastive learning is one specific framework validating their superiority in various fields, including anomaly detection. However, the primary objective of contrastive learning is to learn task-agnostic features without any labels, which is not entirely suited to discern anomalies. In this paper, we propose a task-specific variant of contrastive learning named masked contrastive learning, which is more befitted for anomaly detection. Moreover, we propose a new inference method dubbed self-ensemble inference that further boosts performance by leveraging the ability learned through auxiliary self-supervision tasks. By combining our models, we can outperform previous state-of-the-art methods by a significant margin on various benchmark datasets.  ( 2 min )
    Protein Representation Learning by Geometric Structure Pretraining. (arXiv:2203.06125v5 [cs.LG] UPDATED)
    Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.  ( 2 min )
    CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning. (arXiv:2301.12193v1 [cs.LG])
    Since random initial models in Federated Learning (FL) can easily result in unregulated Stochastic Gradient Descent (SGD) processes, existing FL methods greatly suffer from both slow convergence and poor accuracy, especially for non-IID scenarios. To address this problem, we propose a novel FL method named CyclicFL, which can quickly derive effective initial models to guide the SGD processes, thus improving the overall FL training performance. Based on the concept of Continual Learning (CL), we prove that CyclicFL approximates existing centralized pre-training methods in terms of classification and prediction performance. Meanwhile, we formally analyze the significance of data consistency between the pre-training and training stages of CyclicFL, showing the limited Lipschitzness of loss for the pre-trained models by CyclicFL. Unlike traditional centralized pre-training methods that require public proxy data, CyclicFL pre-trains initial models on selected clients cyclically without exposing their local data. Therefore, they can be easily integrated into any security-critical FL methods. Comprehensive experimental results show that CyclicFL can not only improve the classification accuracy by up to 16.21%, but also significantly accelerate the overall FL training processes.  ( 2 min )
    Norm-based Generalization Bounds for Compositionally Sparse Neural Networks. (arXiv:2301.12033v1 [cs.LG])
    In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.  ( 2 min )
    On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation. (arXiv:1910.08412v3 [cs.LG] UPDATED)
    Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.  ( 2 min )
    Sparse Oblique Decision Trees: A Tool to Understand and Manipulate Neural Net Features. (arXiv:2104.02922v2 [cs.LG] UPDATED)
    The widespread deployment of deep nets in practical applications has lead to a growing desire to understand how and why such black-box methods perform prediction. Much work has focused on understanding what part of the input pattern (an image, say) is responsible for a particular class being predicted, and how the input may be manipulated to predict a different class. We focus instead on understanding which of the internal features computed by the neural net are responsible for a particular class. We achieve this by mimicking part of the neural net with an oblique decision tree having sparse weight vectors at the decision nodes. Using the recently proposed Tree Alternating Optimization (TAO) algorithm, we are able to learn trees that are both highly accurate and interpretable. Such trees can faithfully mimic the part of the neural net they replaced, and hence they can provide insights into the deep net black box. Further, we show we can easily manipulate the neural net features in order to make the net predict, or not predict, a given class, thus showing that it is possible to carry out adversarial attacks at the level of the features. These insights and manipulations apply globally to the entire training and test set, not just at a local (single-instance) level. We demonstrate this robustly in the MNIST and ImageNet datasets with LeNet5 and VGG networks.  ( 2 min )
    Off-Policy Evaluation in Partially Observed Markov Decision Processes under Sequential Ignorability. (arXiv:2110.12343v3 [cs.LG] UPDATED)
    We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length), with an exponent that depends on the overlap of the target and behavior policies, and on the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes, but strictly easier than model-free off-policy evaluation.  ( 2 min )
    Pragmatic Fairness: Developing Policies with Outcome Disparity Control. (arXiv:2301.12278v1 [cs.LG])
    We introduce a causal framework for designing optimal policies that satisfy fairness constraints. We take a pragmatic approach asking what we can do with an action space available to us and only with access to historical data. We propose two different fairness constraints: a moderation breaking constraint which aims at blocking moderation paths from the action and sensitive attribute to the outcome, and by that at reducing disparity in outcome levels as much as the provided action space permits; and an equal benefit constraint which aims at distributing gain from the new and maximized policy equally across sensitive attribute levels, and thus at keeping pre-existing preferential treatment in place or avoiding the introduction of new disparity. We introduce practical methods for implementing the constraints and illustrate their uses on experiments with semi-synthetic models.  ( 2 min )
    Heterogeneous Datasets for Federated Survival Analysis Simulation. (arXiv:2301.12166v1 [cs.LG])
    Survival analysis studies time-modeling techniques for an event of interest occurring for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, the data needed to train survival models are often distributed, incomplete, censored, and confidential. In this context, federated learning can be exploited to tremendously improve the quality of the models trained on distributed data while preserving user privacy. However, federated survival analysis is still in its early development, and there is no common benchmarking dataset to test federated survival models. This work proposes a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way. Specifically, we provide two novel dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client: quantity-skewed splitting and label-skewed splitting. Furthermore, these algorithms allow for obtaining different levels of heterogeneity by changing a single hyperparameter. Finally, numerical experiments provide a quantitative evaluation of the heterogeneity level using log-rank tests and a qualitative analysis of the generated splits. The implementation of the proposed methods is publicly available in favor of reproducibility and to encourage common practices to simulate federated environments for survival analysis.  ( 2 min )
    GFlowNets and variational inference. (arXiv:2210.00580v2 [cs.LG] UPDATED)
    This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v3 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    A Dependable Hybrid Machine Learning Model for Network Intrusion Detection. (arXiv:2212.04546v2 [cs.CR] UPDATED)
    Network intrusion detection systems (NIDSs) play an important role in computer network security. There are several detection mechanisms where anomaly-based automated detection outperforms others significantly. Amid the sophistication and growing number of attacks, dealing with large amounts of data is a recognized issue in the development of anomaly-based NIDS. However, do current models meet the needs of today's networks in terms of required accuracy and dependability? In this research, we propose a new hybrid model that combines machine learning and deep learning to increase detection rates while securing dependability. Our proposed method ensures efficient pre-processing by combining SMOTE for data balancing and XGBoost for feature selection. We compared our developed method to various machine learning and deep learning algorithms to find a more efficient algorithm to implement in the pipeline. Furthermore, we chose the most effective model for network intrusion based on a set of benchmarked performance analysis criteria. Our method produces excellent results when tested on two datasets, KDDCUP'99 and CIC-MalMem-2022, with an accuracy of 99.99% and 100% for KDDCUP'99 and CIC-MalMem-2022, respectively, and no overfitting or Type-1 and Type-2 issues.  ( 2 min )
    Jump Interval-Learning for Individualized Decision Making. (arXiv:2111.08885v2 [stat.ME] UPDATED)
    An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.  ( 2 min )
    Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. (arXiv:2301.12017v1 [cs.CL])
    Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this work, we fully investigate the feasibility of using INT4 quantization for language models, and show that using INT4 introduces no or negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using INT4, we develop a highly-optimized end-to-end INT4 encoder inference pipeline supporting different quantization strategies. Our INT4 pipeline is $8.5\times$ faster for latency-oriented scenarios and up to $3\times$ for throughput-oriented scenarios compared to the inference of FP16, and improves the SOTA BERT INT8 performance from FasterTransformer by up to $1.7\times$. We also provide insights into the failure cases when applying INT4 to decoder-only models, and further explore the compatibility of INT4 quantization with other compression techniques, like pruning and layer reduction.  ( 2 min )
    On the Lipschitz Constant of Deep Networks and Double Descent. (arXiv:2301.12309v1 [cs.LG])
    Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors -- namely loss landscape curvature and distance of parameters from initialization -- respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novels insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice.  ( 2 min )
    Continual Graph Learning: A Survey. (arXiv:2301.12230v1 [cs.LG])
    Research on continual learning (CL) mainly focuses on data represented in the Euclidean space, while research on graph-structured data is scarce. Furthermore, most graph learning models are tailored for static graphs. However, graphs usually evolve continually in the real world. Catastrophic forgetting also emerges in graph learning models when being trained incrementally. This leads to the need to develop robust, effective and efficient continual graph learning approaches. Continual graph learning (CGL) is an emerging area aiming to realize continual learning on graph-structured data. This survey is written to shed light on this emerging area. It introduces the basic concepts of CGL and highlights two unique challenges brought by graphs. Then it reviews and categorizes recent state-of-the-art approaches, analyzing their strategies to tackle the unique challenges in CGL. Besides, it discusses the main concerns in each family of CGL methods, offering potential solutions. Finally, it explores the open issues and potential applications of CGL.  ( 2 min )
    Zero-shot causal learning. (arXiv:2301.12292v1 [cs.LG])
    Predicting how different interventions will causally affect a specific individual is important in a variety of domains such as personalized medicine, public policy, and online marketing. However, most existing causal methods cannot generalize to predicting the effects of previously unseen interventions (e.g., a newly invented drug), because they require data for individuals who received the intervention. Here, we consider zero-shot causal learning: predicting the personalized effects of novel, previously unseen interventions. To tackle this problem, we propose CaML, a causal meta-learning framework which formulates the personalized prediction of each intervention's effect as a task. Rather than training a separate model for each intervention, CaML trains as a single meta-model across thousands of tasks, each constructed by sampling an intervention and individuals who either did or did not receive it. By leveraging both intervention information (e.g., a drug's attributes) and individual features (e.g., a patient's history), CaML is able to predict the personalized effects of unseen interventions. Experimental results on real world datasets in large-scale medical claims and cell-line perturbations demonstrate the effectiveness of our approach. Most strikingly, CaML zero-shot predictions outperform even strong baselines which have direct access to data of considered target interventions.  ( 2 min )
    Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. (arXiv:2005.05409v2 [math.OC] UPDATED)
    Optimal control of diffusion processes is intimately connected to the problem of solving certain Hamilton-Jacobi-Bellman equations. Building on recent machine learning inspired approaches towards high-dimensional PDEs, we investigate the potential of $\textit{iterative diffusion optimisation}$ techniques, in particular considering applications in importance sampling and rare event simulation, and focusing on problems without diffusion control, with linearly controlled drift and running costs that depend quadratically on the control. More generally, our methods apply to nonlinear parabolic PDEs with a certain shift invariance. The choice of an appropriate loss function being a central element in the algorithmic design, we develop a principled framework based on divergences between path measures, encompassing various existing methods. Motivated by connections to forward-backward SDEs, we propose and study the novel $\textit{log-variance}$ divergence, showing favourable properties of corresponding Monte Carlo estimators. The promise of the developed approach is exemplified by a range of high-dimensional and metastable numerical examples.  ( 2 min )
    Alignment with human representations supports robust few-shot learning. (arXiv:2301.11990v1 [cs.LG])
    Should we care whether AI systems have representations of the world that are similar to those of humans? We provide an information-theoretic analysis that suggests that there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks. We confirm this prediction empirically, finding such a relationship in an analysis of the performance of 491 computer vision models. We also show that highly-aligned models are more robust to both adversarial attacks and domain shifts. Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.  ( 2 min )
    Turbulence control in plane Couette flow using low-dimensional neural ODE-based models and deep reinforcement learning. (arXiv:2301.12098v1 [physics.flu-dyn])
    The high dimensionality and complex dynamics of turbulent flows remain an obstacle to the discovery and implementation of control strategies. Deep reinforcement learning (RL) is a promising avenue for overcoming these obstacles, but requires a training phase in which the RL agent iteratively interacts with the flow environment to learn a control policy, which can be prohibitively expensive when the environment involves slow experiments or large-scale simulations. We overcome this challenge using a framework we call "DManD-RL" (data-driven manifold dynamics-RL), which generates a data-driven low-dimensional model of our system that we use for RL training. With this approach, we seek to minimize drag in a direct numerical simulation (DNS) of a turbulent minimal flow unit of plane Couette flow at Re=400 using two slot jets on one wall. We obtain, from DNS data with $\mathcal{O}(10^5)$ degrees of freedom, a 25-dimensional DManD model of the dynamics by combining an autoencoder and neural ordinary differential equation. Using this model as the environment, we train an RL control agent, yielding a 440-fold speedup over training on the DNS, with equivalent control performance. The agent learns a policy that laminarizes 84% of unseen DNS test trajectories within 900 time units, significantly outperforming classical opposition control (58%), despite the actuation authority being much more restricted. The agent often achieves laminarization through a counterintuitive strategy that drives the formation of two low-speed streaks, with a spanwise wavelength that is too small to be self-sustaining. The agent demonstrates the same performance when we limit observations to wall shear rate.  ( 2 min )
    Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling. (arXiv:2301.12050v1 [cs.LG])
    Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difficult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that is tested and verified during exploration, to improve sample efficiency in embodied RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM on the basis of its experiences. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.  ( 2 min )
    Harnessing the Power of Decision Trees to Detect IoT Malware. (arXiv:2301.12039v1 [cs.CR])
    Due to its simple installation and connectivity, the Internet of Things (IoT) is susceptible to malware attacks. Being able to operate autonomously. As IoT devices have become more prevalent, they have become the most tempting targets for malware. Weak, guessable, or hard-coded passwords, and a lack of security measures contribute to these vulnerabilities along with insecure network connections and outdated update procedures. To understand IoT malware, current methods and analysis ,using static methods, are ineffective. The field of deep learning has made great strides in recent years due to their tremendous data mining, learning, and expression capabilities, cybersecurity has enjoyed tremendous growth in recent years. As a result, malware analysts will not have to spend as much time analyzing malware. In this paper, we propose a novel detection and analysis method that harnesses the power and simplicity of decision trees. The experiments are conducted using a real word dataset, MaleVis which is a publicly available dataset. Based on the results, we show that our proposed approach outperforms existing state-of-the-art solutions in that it achieves 97.23% precision and 95.89% recall in terms of detection and classification. A specificity of 96.58%, F1-score of 96.40%, an accuracy of 96.43.  ( 2 min )
    Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction. (arXiv:2301.12068v1 [cs.LG])
    Pre-training methods on proteins are recently gaining interest, leveraging either protein sequences or structures, while modeling their joint energy landscape is largely unexplored. In this work, inspired by the success of denoising diffusion models, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure multimodal diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the multimodal diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a physics-inspired method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. The source code will be released upon acceptance.  ( 2 min )
    On the Feasibility of Machine Learning Augmented Magnetic Resonance for Point-of-Care Identification of Disease. (arXiv:2301.11962v1 [cs.LG])
    Early detection of many life-threatening diseases (e.g., prostate and breast cancer) within at-risk population can improve clinical outcomes and reduce cost of care. While numerous disease-specific "screening" tests that are closer to Point-of-Care (POC) are in use for this task, their low specificity results in unnecessary biopsies, leading to avoidable patient trauma and wasteful healthcare spending. On the other hand, despite the high accuracy of Magnetic Resonance (MR) imaging in disease diagnosis, it is not used as a POC disease identification tool because of poor accessibility. The root cause of poor accessibility of MR stems from the requirement to reconstruct high-fidelity images, as it necessitates a lengthy and complex process of acquiring large quantities of high-quality k-space measurements. In this study we explore the feasibility of an ML-augmented MR pipeline that directly infers the disease sidestepping the image reconstruction process. We hypothesise that the disease classification task can be solved using a very small tailored subset of k-space data, compared to image reconstruction. Towards that end, we propose a method that performs two tasks: 1) identifies a subset of the k-space that maximizes disease identification accuracy, and 2) infers the disease directly using the identified k-space subset, bypassing the image reconstruction step. We validate our hypothesis by measuring the performance of the proposed system across multiple diseases and anatomies. We show that comparable performance to image-based classifiers, trained on images reconstructed with full k-space data, can be achieved using small quantities of data: 8% of the data for detecting multiple abnormalities in prostate and brain scans, and 5% of the data for knee abnormalities. To better understand the proposed approach and instigate future research, we provide an extensive analysis and release code.  ( 2 min )
    RCsearcher: Reaction Center Identification in Retrosynthesis via Deep Q-Learning. (arXiv:2301.12071v1 [cs.LG])
    The reaction center consists of atoms in the product whose local properties are not identical to the corresponding atoms in the reactants. Prior studies on reaction center identification are mainly on semi-templated retrosynthesis methods. Moreover, they are limited to single reaction center identification. However, many reaction centers are comprised of multiple bonds or atoms in reality. We refer to it as the multiple reaction center. This paper presents RCsearcher, a unified framework for single and multiple reaction center identification that combines the advantages of the graph neural network and deep reinforcement learning. The critical insight in this framework is that the single or multiple reaction center must be a node-induced subgraph of the molecular product graph. At each step, it considers choosing one node in the molecular product graph and adding it to the explored node-induced subgraph as an action. Comprehensive experiments demonstrate that RCsearcher consistently outperforms other baselines and can extrapolate the reaction center patterns that have not appeared in the training set. Ablation experiments verify the effectiveness of individual components, including the beam search and one-hop constraint of action space.  ( 2 min )
    Predicting Students' Exam Scores Using Physiological Signals. (arXiv:2301.12051v1 [cs.LG])
    While acute stress has been shown to have both positive and negative effects on performance, not much is known about the impacts of stress on students grades during examinations. To answer this question, we examined whether a correlation could be found between physiological stress signals and exam performance. We conducted this study using multiple physiological signals of ten undergraduate students over three different exams. The study focused on three signals, i.e., skin temperature, heart rate, and electrodermal activity. We extracted statistics as features and fed them into a variety of binary classifiers to predict relatively higher or lower grades. Experimental results showed up to 0.81 ROC-AUC with k-nearest neighbor algorithm among various machine learning algorithms.  ( 2 min )
    Restricted Orthogonal Gradient Projection for Continual Learning. (arXiv:2301.12131v1 [cs.LG])
    Continual learning aims to avoid catastrophic forgetting and effectively leverage learned experiences to master new knowledge. Existing gradient projection approaches impose hard constraints on the optimization space for new tasks to minimize interference, which simultaneously hinders forward knowledge transfer. To address this issue, recent methods reuse frozen parameters with a growing network, resulting in high computational costs. Thus, it remains a challenge whether we can improve forward knowledge transfer for gradient projection approaches using a fixed network architecture. In this work, we propose the Restricted Orthogonal Gradient prOjection (ROGO) framework. The basic idea is to adopt a restricted orthogonal constraint allowing parameters optimized in the direction oblique to the whole frozen space to facilitate forward knowledge transfer while consolidating previous knowledge. Our framework requires neither data buffers nor extra parameters. Extensive experiments have demonstrated the superiority of our framework over several strong baselines. We also provide theoretical guarantees for our relaxing strategy.  ( 2 min )
    ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. (arXiv:2301.12040v1 [q-bio.BM])
    Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.  ( 2 min )
    A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents. (arXiv:2301.11977v1 [cs.AI])
    To perform well, Deep Reinforcement Learning (DRL) methods require significant memory resources and computational time. Also, sometimes these systems need additional environment information to achieve a good reward. However, it is more important for many applications and devices to reduce memory usage and computational times than to achieve the maximum reward. This paper presents a modified DRL method that performs reasonably well with compressed imagery data without requiring additional environment information and also uses less memory and time. We have designed a lightweight Convolutional Neural Network (CNN) with a variant of the Q-network that efficiently takes preprocessed image data as input and uses less memory. Furthermore, we use a simple reward mechanism and small experience replay memory so as to provide only the minimum necessary information. Our modified DRL method enables our autonomous agent to play Snake, a classical control game. The results show our model can achieve similar performance as other DRL methods.  ( 2 min )
    Minimizing Trajectory Curvature of ODE-based Generative Models. (arXiv:2301.12003v1 [cs.LG])
    Recent ODE/SDE-based generative models, such as diffusion models and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at https://github.com/sangyun884/fast-ode.  ( 2 min )
    Physics-informed Neural Network: The Effect of Reparameterization in Solving Differential Equations. (arXiv:2301.12118v1 [cs.LG])
    Differential equations are used to model and predict the behaviour of complex systems in a wide range of fields, and the ability to solve them is an important asset for understanding and predicting the behaviour of these systems. Complicated physics mostly involves difficult differential equations, which are hard to solve analytically. In recent years, physics-informed neural networks have been shown to perform very well in solving systems with various differential equations. The main ways to approximate differential equations are through penalty function and reparameterization. Most researchers use penalty functions rather than reparameterization due to the complexity of implementing reparameterization. In this study, we quantitatively compare physics-informed neural network models with and without reparameterization using the approximation error. The performance of reparameterization is demonstrated based on two benchmark mechanical engineering problems, a one-dimensional bar problem and a two-dimensional bending beam problem. Our results show that when dealing with complex differential equations, applying reparameterization results in a lower approximation error.  ( 2 min )
    Reduced-Order Autodifferentiable Ensemble Kalman Filters. (arXiv:2301.11961v1 [stat.ML])
    This paper introduces a computational framework to reconstruct and forecast a partially observed state that evolves according to an unknown or expensive-to-simulate dynamical system. Our reduced-order autodifferentiable ensemble Kalman filters (ROAD-EnKFs) learn a latent low-dimensional surrogate model for the dynamics and a decoder that maps from the latent space to the state space. The learned dynamics and decoder are then used within an ensemble Kalman filter to reconstruct and forecast the state. Numerical experiments show that if the state dynamics exhibit a hidden low-dimensional structure, ROAD-EnKFs achieve higher accuracy at lower computational cost compared to existing methods. If such structure is not expressed in the latent state dynamics, ROAD-EnKFs achieve similar accuracy at lower cost, making them a promising approach for surrogate state reconstruction and forecasting.  ( 2 min )
    Meta Temporal Point Processes. (arXiv:2301.12023v1 [cs.LG])
    A temporal point process (TPP) is a stochastic process where its realization is a sequence of discrete events in time. Recent work in TPPs model the process using a neural network in a supervised learning framework, where a training set is a collection of all the sequences. In this work, we propose to train TPPs in a meta learning framework, where each sequence is treated as a different task, via a novel framing of TPPs as neural processes (NPs). We introduce context sets to model TPPs as an instantiation of NPs. Motivated by attentive NP, we also introduce local history matching to help learn more informative features. We demonstrate the potential of the proposed method on popular public benchmark datasets and tasks, and compare with state-of-the-art TPP methods.  ( 2 min )
    Variational Latent Branching Model for Off-Policy Evaluation. (arXiv:2301.12056v1 [cs.LG])
    Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.  ( 2 min )
    ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts. (arXiv:2301.12171v1 [cs.CV])
    Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.  ( 2 min )
    Folded Optimization for End-to-End Model-Based Learning. (arXiv:2301.12047v1 [cs.LG])
    The integration of constrained optimization models as components in deep networks has led to promising advances in both these domains. A primary challenge in this setting is backpropagation through the optimization mapping, which typically lacks a closed form. A common approach is unrolling, which relies on automatic differentiation through the operations of an iterative solver. While flexible and general, unrolling can encounter accuracy and efficiency issues in practice. These issues can be avoided by differentiating the optimization mapping analytically, but current frameworks impose rigid requirements on the optimization problem's form. This paper provides theoretical insights into the backpropagation of unrolled optimizers, which lead to a system for generating equivalent but efficiently solvable analytical models. Additionally, it proposes a unifying view of unrolling and analytical differentiation through constrained optimization mappings. Experiments over various structured prediction and decision-focused learning tasks illustrate the potential of the approach both computationally and in terms of enhanced expressiveness.  ( 2 min )
    Information loss from dimensionality reduction in 5D-Gaussian spectral data. (arXiv:2301.11923v1 [physics.data-an])
    Understanding the loss of information in spectral analytics is a crucial first step towards finding root causes for failures and uncertainties using spectral data in artificial intelligence models built from modern complex data science applications. Here, we show from a very simple entropy model analysis with quantum statistics of spectral data, that the relative loss of information from dimensionality reduction due to projection of an initial five-dimensional state onto two-dimensional diagrams is less than one percent in the parameter range of small data sets with sample sizes on the order of few hundreds data samples. From our analysis, we also conclude that the density and expectation value of the entropy probability distribution increases with the sample number and sample size using artificial data models derived from random sampling Monte-Carlo simulation methods.  ( 2 min )
    On the Connection Between MPNN and Graph Transformer. (arXiv:2301.11956v1 [cs.LG])
    Graph Transformer (GT) recently has emerged as a new paradigm of graph learning algorithms, outperforming the previously popular Message Passing Neural Network (MPNN) on multiple benchmarks. Previous work (Kim et al., 2022) shows that with proper position embedding, GT can approximate MPNN arbitrarily well, implying that GT is at least as powerful as MPNN. In this paper, we study the inverse connection and show that MPNN with virtual node (VN), a commonly used heuristic with little theoretical understanding, is powerful enough to arbitrarily approximate the self-attention layer of GT. In particular, we first show that if we consider one type of linear transformer, the so-called Performer/Linear Transformer (Choromanski et al., 2020; Katharopoulos et al., 2020), then MPNN + VN with only O(1) depth and O(1) width can approximate a self-attention layer in Performer/Linear Transformer. Next, via a connection between MPNN + VN and DeepSets, we prove the MPNN + VN with O(n^d) width and O(1) depth can approximate the self-attention layer arbitrarily well, where d is the input feature dimension. Lastly, under some assumptions, we provide an explicit construction of MPNN + VN with O(1) width and O(n) depth approximating the self-attention layer in GT arbitrarily well. On the empirical side, we demonstrate that 1) MPNN + VN is a surprisingly strong baseline, outperforming GT on the recently proposed Long Range Graph Benchmark (LRGB) dataset, 2) our MPNN + VN improves over early implementation on a wide range of OGB datasets and 3) MPNN + VN outperforms Linear Transformer and MPNN on the climate modeling task.  ( 2 min )
    Neural Relation Graph for Identifying Problematic Data. (arXiv:2301.12321v1 [cs.LG])
    Diagnosing and cleaning datasets are crucial for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is difficult due to the presence of complex issues, such as label errors or under-representation of certain types. In this paper, we propose a novel approach for identifying problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. We develop an efficient algorithm for detecting label errors and outlier data points based on the relational graph structure of the dataset. We further introduce a visualization tool for contextualizing data points, which can serve as an effective tool for interactively diagnosing datasets. We evaluate label error and out-of-distribution detection performances on large-scale image and language domain tasks, including ImageNet and GLUE benchmarks, and demonstrate the effectiveness of our approach for debugging datasets and building robust machine learning systems.  ( 2 min )
    Accelerated Training of Physics-Informed Neural Networks (PINNs) using Meshless Discretizations. (arXiv:2205.09332v6 [cs.LG] UPDATED)
    We present a new technique for the accelerated training of physics-informed neural networks (PINNs): discretely-trained PINNs (DT-PINNs). The repeated computation of partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. The use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries. Additionally, though traditional PINNs (vanilla-PINNs) are typically stored and trained in 32-bit floating-point (fp32) on the GPU, we show that for DT-PINNs, using fp64 on the GPU leads to significantly faster training times than fp32 vanilla-PINNs with comparable accuracy. We demonstrate the efficiency and accuracy of DT-PINNs via a series of experiments. First, we explore the effect of network depth on both numerical and automatic differentiation of a neural network with random weights and show that RBF-FD approximations of third-order accuracy and above are more efficient while being sufficiently accurate. We then compare the DT-PINNs to vanilla-PINNs on both linear and nonlinear Poisson equations and show that DT-PINNs achieve similar losses with 2-4x faster training times on a consumer GPU. Finally, we also demonstrate that similar results can be obtained for the PINN solution to the heat equation (a space-time problem) by discretizing the spatial derivatives using RBF-FD and using automatic differentiation for the temporal derivative. Our results show that fp64 DT-PINNs offer a superior cost-accuracy profile to fp32 vanilla-PINNs.  ( 3 min )
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v2 [stat.ML] UPDATED)
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders.  ( 2 min )
    Informational Diversity and Affinity Bias in Team Growth Dynamics. (arXiv:2301.12091v1 [cs.GT])
    Prior work has provided strong evidence that, within organizational settings, teams that bring a diversity of information and perspectives to a task are more effective than teams that do not. If this form of informational diversity confers performance advantages, why do we often see largely homogeneous teams in practice? One canonical argument is that the benefits of informational diversity are in tension with affinity bias. To better understand the impact of this tension on the makeup of teams, we analyze a sequential model of team formation in which individuals care about their team's performance (captured in terms of accurately predicting some future outcome based on a set of features) but experience a cost as a result of interacting with teammates who use different approaches to the prediction task. Our analysis of this simple model reveals a set of subtle behaviors that team-growth dynamics can exhibit: (i) from certain initial team compositions, they can make progress toward better performance but then get stuck partway to optimally diverse teams; while (ii) from other initial compositions, they can also move away from this optimal balance as the majority group tries to crowd out the opinions of the minority. The initial composition of the team can determine whether the dynamics will move toward or away from performance optimality, painting a path-dependent picture of inefficiencies in team compositions. Our results formalize a fundamental limitation of utility-based motivations to drive informational diversity in organizations and hint at interventions that may improve informational diversity and performance simultaneously.  ( 2 min )
    Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits. (arXiv:2211.05964v2 [stat.ML] UPDATED)
    We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling algorithm using special classes of sparsity-inducing priors (e.g., spike-and-slab) to model the unknown parameter and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high-dimensional and sparse contextual bandits. For faster computation, we use variational inference instead of Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. Extensive simulations demonstrate the improved performance of our proposed algorithm over existing ones.  ( 2 min )
    Adversarial Networks and Machine Learning for File Classification. (arXiv:2301.11964v1 [cs.LG])
    Correctly identifying the type of file under examination is a critical part of a forensic investigation. The file type alone suggests the embedded content, such as a picture, video, manuscript, spreadsheet, etc. In cases where a system owner might desire to keep their files inaccessible or file type concealed, we propose using an adversarially-trained machine learning neural network to determine a file's true type even if the extension or file header is obfuscated to complicate its discovery. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types. We also compared our network against a traditional standalone neural network and three other machine learning algorithms. The adversarially-trained network proved to be the most precise file classifier especially in scenarios with few supervised samples available. Our implementation of a file classifier using an SGAN is implemented on GitHub (https://ksaintg.github.io/SGAN-File-Classier).  ( 2 min )
    Neural Gas Network Image Features and Segmentation for Brain Tumor Detection Using Magnetic Resonance Imaging Data. (arXiv:2301.12176v1 [eess.IV])
    Accurate detection of brain tumors could save lots of lives and increasing the accuracy of this binary classification even as much as a few percent has high importance. Neural Gas Networks (NGN) is a fast, unsupervised algorithm that could be used in data clustering, image pattern recognition, and image segmentation. In this research, we used the metaheuristic Firefly Algorithm (FA) for image contrast enhancement as pre-processing and NGN weights for feature extraction and segmentation of Magnetic Resonance Imaging (MRI) data on two brain tumor datasets from the Kaggle platform. Also, tumor classification is conducted by Support Vector Machine (SVM) classification algorithms and compared with a deep learning technique plus other features in train and test phases. Additionally, NGN tumor segmentation is evaluated by famous performance metrics such as Accuracy, F-measure, Jaccard, and more versus ground truth data and compared with traditional segmentation techniques. The proposed method is fast and precise in both tasks of tumor classification and segmentation compared with other methods. A classification accuracy of 95.14 % and segmentation accuracy of 0.977 is achieved by the proposed method.  ( 2 min )
    Byte Pair Encoding for Symbolic Music. (arXiv:2301.11975v1 [cs.LG])
    The symbolic music modality is nowadays mostly represented as discrete and used with sequential models such as Transformers, for deep learning tasks. Recent research put efforts on the tokenization, i.e. the conversion of data into sequences of integers intelligible to such models. This can be achieved by many ways as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations are based on small vocabularies describing the note attributes and time events, resulting in fairly long token sequences. In this paper, we show how Byte Pair Encoding (BPE) can improve the results of deep learning models while improving its performances. We experiment on music generation and composer classification, and study the impact of BPE on how models learn the embeddings, and show that it can help to increase their isotropy, i.e., the uniformity of the variance of their positions in the space.  ( 2 min )
    Arbitrarily Accurate Classification Applied to Specific Emitter Identification. (arXiv:2211.10379v2 [eess.SP] UPDATED)
    This article introduces a method of evaluating subsamples until any prescribed level of classification accuracy is attained, thus obtaining arbitrary accuracy. A logarithmic reduction in error rate is obtained with a linear increase in sample count. The technique is applied to specific emitter identification on a published dataset of physically recorded over-the-air signals from 16 ostensibly identical high-performance radios. The technique uses a multi-channel deep learning convolutional neural network acting on the bispectra of I/Q signal subsamples each consisting of 56 parts per million (ppm) of the original signal duration. High levels of accuracy are obtained with minimal computation time: in this application, each addition of eight samples decreases error by one order of magnitude.  ( 2 min )
    Multi-task Highly Adaptive Lasso. (arXiv:2301.12029v1 [stat.ML])
    We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm attains a powerful dimension-free convergence rate of $o_p(n^{-1/4})$ or better. We show that MT-HAL outperforms sparsity-based MTL competitors across a wide range of simulation studies, including settings with nonlinear and linear relationships, varying levels of sparsity and task correlations, and different numbers of covariates and sample size.  ( 2 min )
    Skin Lesion Analysis: A Survey, Systematic Review, and Future Trends. (arXiv:2208.12232v2 [eess.IV] UPDATED)
    The Computer-aided Diagnosis or Detection (CAD) approach for skin lesion analysis is an emerging field of research that has the potential to alleviate the burden and cost of skin cancer screening. Researchers have recently indicated increasing interest in developing such CAD systems, with the intention of providing a user-friendly tool to dermatologists to reduce the challenges encountered or associated with manual inspection. This article aims to provide a comprehensive literature survey and review of a total of 594 publications (356 for skin lesion segmentation and 238 for skin lesion classification) published between 2011 and 2022. These articles are analyzed and summarized in a number of different ways to contribute vital information regarding the methods for the development of CAD systems. These ways include relevant and essential definitions and theories, input data (dataset utilization, preprocessing, augmentations, and fixing imbalance problems), method configuration (techniques, architectures, module frameworks, and losses), training tactics (hyperparameter settings), and evaluation criteria. We intend to investigate a variety of performance-enhancing approaches, including ensemble and post-processing. We also discuss these dimensions to reveal their current trends based on utilization frequencies. In addition, we highlight the primary difficulties associated with evaluating skin lesion segmentation and classification systems using minimal datasets, as well as the potential solutions to these difficulties. Findings, recommendations, and trends are disclosed to inform future research on developing an automated and robust CAD system for skin lesion analysis.  ( 2 min )
    Neighborhood Gradient Clustering: An Efficient Decentralized Learning Method for Non-IID Data Distributions. (arXiv:2209.14390v4 [cs.LG] UPDATED)
    Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by $32 \times$. We theoretically analyze the convergence rate of the proposed algorithm and demonstrate its efficiency over non-IID data sampled from {various vision and language} datasets trained. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by $0-6\%$) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by $1-35\%$ without additional communication cost.  ( 3 min )
    Principled Acceleration of Iterative Numerical Methods Using Machine Learning. (arXiv:2206.08594v2 [math.NA] UPDATED)
    Iterative methods are ubiquitous in large-scale scientific computing applications, and a number of approaches based on meta-learning have been recently proposed to accelerate them. However, a systematic study of these approaches and how they differ from meta-learning is lacking. In this paper, we propose a framework to analyze such learning-based acceleration approaches, where one can immediately identify a departure from classical meta-learning. We show that this departure may lead to arbitrary deterioration of model performance. Based on our analysis, we introduce a novel training method for learning-based acceleration of iterative methods. Furthermore, we theoretically prove that the proposed method improves upon the existing methods, and demonstrate its significant advantage and versatility through various numerical applications.  ( 2 min )
    Quantum Ridgelet Transform: Winning Lottery Ticket of Neural Networks with Quantum Computation. (arXiv:2301.11936v1 [quant-ph])
    Ridgelet transform has been a fundamental mathematical tool in the theoretical studies of neural networks. However, the practical applicability of ridgelet transform to conducting learning tasks was limited since its numerical implementation by conventional classical computation requires an exponential runtime $\exp(O(D))$ as data dimension $D$ increases. To address this problem, we develop a quantum ridgelet transform (QRT), which implements the ridgelet transform of a quantum state within a linear runtime $O(D)$ of quantum computation. As an application, we also show that one can use QRT as a fundamental subroutine for quantum machine learning (QML) to efficiently find a sparse trainable subnetwork of large shallow wide neural networks without conducting large-scale optimization of the original network. This application discovers an efficient way in this regime to demonstrate the lottery ticket hypothesis on finding such a sparse trainable neural network. These results open an avenue of QML for accelerating learning tasks with commonly used classical neural networks.  ( 2 min )
    Supervision Complexity and its Role in Knowledge Distillation. (arXiv:2301.12245v1 [cs.LG])
    Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.  ( 2 min )
    Microstructural parameter estimation using spherical convolutional neural networks. (arXiv:2211.09887v2 [eess.IV] UPDATED)
    Diffusion-weighted magnetic resonance imaging is sensitive to the microstructural properties of brain tissue. However, estimating clinically and scientifically relevant microstructural properties from the measured signals remains a highly challenging inverse problem that deep learning may help solve. This study investigated if recently developed orientationally invariant spherical convolutional neural networks can improve microstructural parameter estimation. A spherical convolutional neural network was trained to predict the ground-truth parameter values from simulated noisy data and applied to imaging data acquired in a clinical setting to generate microstructural parameter maps. The spherical convolutional neural network was more accurate and less orientationally variant than the benchmark methods (multi-layer perceptrons and the spherical mean technique). Our results show that spherical convolutional neural networks can be a compelling alternative to predicting parameters from powder-averaged data (i.e., data averaged over the acquired diffusion encoding directions). While we focused on constrained two- and three-compartment models of neuronal tissue, the presented network and training pipeline are generalizable and can be used to estimate the parameters of other Gaussian compartment models.  ( 2 min )
    Flip Initial Features: Generalization of Neural Networks Under Sparse Features for Semi-supervised Node Classification. (arXiv:2211.15081v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been widely used under semi-supervised settings. Prior studies have mainly focused on finding appropriate graph filters (e.g., aggregation schemes) to generalize well for both homophilic and heterophilic graphs. Even though these approaches are essential and effective, they still suffer from the sparsity in initial node features inherent in the bag-of-words representation. Common in semi-supervised learning where the training samples often fail to cover the entire dimensions of graph filters (hyperplanes), this can precipitate over-fitting of specific dimensions in the first projection matrix. To deal with this problem, we suggest a simple and novel strategy; create additional space by flipping the initial features and hyperplane simultaneously. Training in both the original and in the flip space can provide precise updates of learnable parameters. To the best of our knowledge, this is the first attempt that effectively moderates the overfitting problem in GNN. Extensive experiments on real-world datasets demonstrate that the proposed technique improves the node classification accuracy up to 40.2 %  ( 2 min )
    Neural Error Mitigation of Near-Term Quantum Simulations. (arXiv:2105.08086v2 [quant-ph] UPDATED)
    Near-term quantum computers provide a promising platform for finding ground states of quantum systems, which is an essential task in physics, chemistry, and materials science. Near-term approaches, however, are constrained by the effects of noise as well as the limited resources of near-term quantum hardware. We introduce "neural error mitigation," which uses neural networks to improve estimates of ground states and ground-state observables obtained using near-term quantum simulations. To demonstrate our method's broad applicability, we employ neural error mitigation to find the ground states of the H$_2$ and LiH molecular Hamiltonians, as well as the lattice Schwinger model, prepared via the variational quantum eigensolver (VQE). Our results show that neural error mitigation improves numerical and experimental VQE computations to yield low energy errors, high fidelities, and accurate estimations of more-complex observables like order parameters and entanglement entropy, without requiring additional quantum resources. Furthermore, neural error mitigation is agnostic with respect to the quantum state preparation algorithm used, the quantum hardware it is implemented on, and the particular noise channel affecting the experiment, contributing to its versatility as a tool for quantum simulation.  ( 2 min )
    Reversible Gromov-Monge Sampler for Simulation-Based Inference. (arXiv:2109.14090v4 [stat.ME] UPDATED)
    This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d.\ samples, circumventing the usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogeneous metric measure spaces $(\cX, \mu, c_{\cX})$ and $(\cY, \nu, c_{\cY})$ from empirical data sets, with estimated maps that approximately push forward one measure $\mu$ to the other $\nu$, and vice versa. We study the analytic properties of the RGM distance and derive that under mild conditions, RGM equals the classic Gromov-Wasserstein distance. Curiously, drawing a connection to Brenier's polar factorization, we show that the RGM sampler induces bias towards strong isomorphism with proper choices of $c_{\cX}$ and $c_{\cY}$. Statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.  ( 2 min )
    Cross-Subject Deep Transfer Models for Evoked Potentials in Brain-Computer Interface. (arXiv:2301.12322v1 [cs.LG])
    Brain Computer Interface (BCI) technologies have the potential to improve the lives of millions of people around the world, whether through assistive technologies or clinical diagnostic tools. Despite advancements in the field, however, at present consumer and clinical viability remains low. A key reason for this is that many of the existing BCI deployments require substantial data collection per end-user, which can be cumbersome, tedious, and error-prone to collect. We address this challenge via a deep learning model, which, when trained across sufficient data from multiple subjects, offers reasonable performance out-of-the-box, and can be customized to novel subjects via a transfer learning process. We demonstrate the fundamental viability of our approach by repurposing an older but well-curated electroencephalography (EEG) dataset and benchmarking against several common approaches/techniques. We then partition this dataset into a transfer learning benchmark and demonstrate that our approach significantly reduces data collection burden per-subject. This suggests that our model and methodology may yield improvements to BCI technologies and enhance their consumer/clinical viability.  ( 2 min )
    DALI: Dynamically Adjusted Label Importance for Noisy Partial Label Learning. (arXiv:2301.12077v1 [cs.CV])
    Noisy partial label learning (noisy PLL) is an important branch of weakly supervised learning. Unlike PLL where the ground-truth label must reside in the candidate set, noisy PLL relaxes this constraint and allows the ground-truth label may not be in the candidate set. To address this problem, existing works attempt to detect noisy samples and estimate the ground-truth label for each noisy sample. However, detection errors are inevitable, and these errors will accumulate during training and continuously affect model optimization. To address this challenge, we propose a novel framework for noisy PLL, called ``Dynamically Adjusted Label Importance (DALI)''. It aims to reduce the negative impact of detection errors by trading off the initial candidate set and model outputs with theoretical guarantees. Experimental results on multiple datasets demonstrate that our DALI succeeds over existing state-of-the-art approaches on noisy PLL. Our code will soon be publicly available.  ( 2 min )
    Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering. (arXiv:2301.12318v1 [cs.CR])
    Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.  ( 2 min )
    Deciphering the Projection Head: Representation Evaluation Self-supervised Learning. (arXiv:2301.12189v1 [cs.LG])
    Self-supervised learning (SSL) aims to learn intrinsic features without labels. Despite the diverse architectures of SSL methods, the projection head always plays an important role in improving the performance of the downstream task. In this work, we systematically investigate the role of the projection head in SSL. Specifically, the projection head targets the uniformity part of SSL, which pushes the dissimilar samples away from each other, thus enabling the encoder to focus on extracting semantic features. Based on this understanding, we propose a Representation Evaluation Design (RED) in SSL models in which a shortcut connection between the representation and the projection vectors is built. Extensive experiments with different architectures, including SimCLR, MoCo-V2, and SimSiam, on various datasets, demonstrate that the representation evaluation design can consistently improve the baseline models in the downstream tasks. The learned representation from the RED-SSL models shows superior robustness to unseen augmentations and out-of-distribution data.  ( 2 min )
    In-Distribution Barrier Functions: Self-Supervised Policy Filters that Avoid Out-of-Distribution States. (arXiv:2301.12012v1 [cs.RO])
    Learning-based control approaches have shown great promise in performing complex tasks directly from high-dimensional perception data for real robotic systems. Nonetheless, the learned controllers can behave unexpectedly if the trajectories of the system divert from the training data distribution, which can compromise safety. In this work, we propose a control filter that wraps any reference policy and effectively encourages the system to stay in-distribution with respect to offline-collected safe demonstrations. Our methodology is inspired by Control Barrier Functions (CBFs), which are model-based tools from the nonlinear control literature that can be used to construct minimally invasive safe policy filters. While existing methods based on CBFs require a known low-dimensional state representation, our proposed approach is directly applicable to systems that rely solely on high-dimensional visual observations by learning in a latent state-space. We demonstrate that our method is effective for two different visuomotor control tasks in simulation environments, including both top-down and egocentric view settings.  ( 2 min )
    Learning Optimal Features via Partial Invariance. (arXiv:2301.12067v1 [cs.LG])
    Learning models that are robust to test-time distribution shifts is a key concern in domain generalization, and in the wider context of their real-life applicability. Invariant Risk Minimization (IRM) is one particular framework that aims to learn deep invariant features from multiple domains and has subsequently led to further variants. A key assumption for the success of these methods requires that the underlying causal mechanisms/features remain invariant across domains and the true invariant features be sufficient to learn the optimal predictor. In practical problem settings, these assumptions are often not satisfied, which leads to IRM learning a sub-optimal predictor for that task. In this work, we propose the notion of partial invariance as a relaxation of the IRM framework. Under our problem setting, we first highlight the sub-optimality of the IRM solution. We then demonstrate how partitioning the training domains, assuming access to some meta-information about the domains, can help improve the performance of invariant models via partial invariance. Finally, we conduct several experiments, both in linear settings as well as with classification tasks in language and images with deep models, which verify our conclusions.  ( 2 min )
    Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. (arXiv:2301.12083v1 [cs.LG])
    Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.  ( 2 min )
    Using uncertainty-aware machine learning models to study aerosol-cloud interactions. (arXiv:2301.11921v1 [physics.data-an])
    Aerosol-cloud interactions (ACI) include various effects that result from aerosols entering a cloud, and affecting cloud properties. In general, an increase in aerosol concentration results in smaller droplet sizes which leads to larger, brighter, longer-lasting clouds that reflect more sunlight and cool the Earth. The strength of the effect is however heterogeneous, meaning it depends on the surrounding environment, making ACI one of the most uncertain effects in our current climate models. In our work, we use causal machine learning to estimate ACI from satellite observations by reframing the problem as a treatment (aerosol) and outcome (change in droplet radius). We predict the causal effect of aerosol on clouds with uncertainty bounds depending on the unknown factors that may be influencing the impact of aerosol. Of the three climate models evaluated, we find that only one plausibly recreates the trend, lending more credence to its estimate cooling due to ACI.  ( 2 min )
    Statistical whitening of neural populations with gain-modulating interneurons. (arXiv:2301.11955v1 [q-bio.NC])
    Statistical whitening transformations play a fundamental role in many computational systems, and may also play an important role in biological sensory systems. Individual neurons appear to rapidly and reversibly alter their input-output gains, approximately normalizing the variance of their responses. Populations of neurons appear to regulate their joint responses, reducing correlations between neural activities. It is natural to see whitening as the objective that guides these behaviors, but the mechanism for such joint changes is unknown, and direct adjustment of synaptic interactions would seem to be both too slow, and insufficiently reversible. Motivated by the extensive neuroscience literature on rapid gain modulation, we propose a recurrent network architecture in which joint whitening is achieved through modulation of gains within the circuit. Specifically, we derive an online statistical whitening algorithm that regulates the joint second-order statistics of a multi-dimensional input by adjusting the marginal variances of an overcomplete set of interneuron projections. The gains of these interneurons are adjusted individually, using only local signals, and feed back onto the primary neurons. The network converges to a state in which the responses of the primary neurons are whitened. We demonstrate through simulations that the behavior of the network is robust to poor conditioning or noise when the gains are sign-constrained, and can be generalized to achieve a form of local whitening in convolutional populations, such as those found throughout the visual or auditory system.  ( 2 min )
    Towards Learning Rubik's Cube with N-tuple-based Reinforcement Learning. (arXiv:2301.12167v1 [cs.LG])
    This work describes in detail how to learn and solve the Rubik's cube game (or puzzle) in the General Board Game (GBG) learning and playing framework. We cover the cube sizes 2x2x2 and 3x3x3. We describe in detail the cube's state representation, how to transform it with twists, whole-cube rotations and color transformations and explain the use of symmetries in Rubik's cube. Next, we discuss different n-tuple representations for the cube, how we train the agents by reinforcement learning and how we improve the trained agents during evaluation by MCTS wrapping. We present results for agents that learn Rubik's cube from scratch, with and without MCTS wrapping, with and without symmetries and show that both, MCTS wrapping and symmetries, increase computational costs, but lead at the same time to much better results. We can solve the 2x2x2 cube completely, and the 3x3x3 cube in the majority of the cases for scrambled cubes up to p = 15 (QTM). We cannot yet reliably solve 3x3x3 cubes with more than 15 scrambling twists. Although our computational costs are higher with MCTS wrapping and with symmetries than without, they are still considerably lower than in the approaches of McAleer et al. (2018, 2019) and Agostinelli et al. (2019) who provide the best Rubik's cube learning agents so far.  ( 2 min )
  • Open

    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v2 [stat.ML] UPDATED)
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.  ( 2 min )
    Asymptotic Inference for Multi-Stage Stationary Treatment Policy with High Dimensional Features. (arXiv:2301.12553v1 [stat.ML])
    Dynamic treatment rules or policies are a sequence of decision functions over multiple stages that are tailored to individual features. One important class of treatment policies for practice, namely multi-stage stationary treatment policies, prescribe treatment assignment probabilities using the same decision function over stages, where the decision is based on the same set of features consisting of both baseline variables (e.g., demographics) and time-evolving variables (e.g., routinely collected disease biomarkers). Although there has been extensive literature to construct valid inference for the value function associated with the dynamic treatment policies, little work has been done for the policies themselves, especially in the presence of high dimensional feature variables. We aim to fill in the gap in this work. Specifically, we first estimate the multistage stationary treatment policy based on an augmented inverse probability weighted estimator for the value function to increase the asymptotic efficiency, and further apply a penalty to select important feature variables. We then construct one-step improvement of the policy parameter estimators. Theoretically, we show that the improved estimators are asymptotically normal, even if nuisance parameters are estimated at a slow convergence rate and the dimension of the feature variables increases exponentially with the sample size. Our numerical studies demonstrate that the proposed method has satisfactory performance in small samples, and that the performance can be improved with a choice of the augmentation term that approximates the rewards or minimizes the variance of the value function.  ( 2 min )
    Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. (arXiv:2005.05409v2 [math.OC] UPDATED)
    Optimal control of diffusion processes is intimately connected to the problem of solving certain Hamilton-Jacobi-Bellman equations. Building on recent machine learning inspired approaches towards high-dimensional PDEs, we investigate the potential of $\textit{iterative diffusion optimisation}$ techniques, in particular considering applications in importance sampling and rare event simulation, and focusing on problems without diffusion control, with linearly controlled drift and running costs that depend quadratically on the control. More generally, our methods apply to nonlinear parabolic PDEs with a certain shift invariance. The choice of an appropriate loss function being a central element in the algorithmic design, we develop a principled framework based on divergences between path measures, encompassing various existing methods. Motivated by connections to forward-backward SDEs, we propose and study the novel $\textit{log-variance}$ divergence, showing favourable properties of corresponding Monte Carlo estimators. The promise of the developed approach is exemplified by a range of high-dimensional and metastable numerical examples.  ( 2 min )
    Distributed Stochastic Optimization under a General Variance Condition. (arXiv:2301.12677v1 [math.OC])
    Distributed stochastic optimization has drawn great attention recently due to its effectiveness in solving large-scale machine learning problems. However, despite that numerous algorithms have been proposed with empirical successes, their theoretical guarantees are restrictive and rely on certain boundedness conditions on the stochastic gradients, varying from uniform boundedness to the relaxed growth condition. In addition, how to characterize the data heterogeneity among the agents and its impacts on the algorithmic performance remains challenging. In light of such motivations, we revisit the classical FedAvg algorithm for solving the distributed stochastic optimization problem and establish the convergence results under only a mild variance condition on the stochastic gradients for smooth nonconvex objective functions. Almost sure convergence to a stationary point is also established under the condition. Moreover, we discuss a more informative measurement for data heterogeneity as well as its implications.  ( 2 min )
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v3 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.  ( 2 min )
    Singularity-aware Reinforcement Learning. (arXiv:2301.13152v1 [stat.ML])
    Batch reinforcement learning (RL) aims at finding an optimal policy in a dynamic environment in order to maximize the expected total rewards by leveraging pre-collected data. A fundamental challenge behind this task is the distributional mismatch between the batch data generating process and the distribution induced by target policies. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice, especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable the power of model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, STEEL only requires some minimal data-coverage assumption and thus greatly enhances the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method when facing possible singularity in batch RL.  ( 2 min )
    Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning. (arXiv:2301.12593v1 [cs.LG])
    Many real-world domains require safe decision making in the presence of uncertainty. In this work, we propose a deep reinforcement learning framework for approaching this important problem. We consider a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures, and we show that our formulation is equivalent to a distributionally robust safe reinforcement learning problem with robustness guarantees on performance and safety. We propose an efficient implementation that only requires access to a single training environment, and we demonstrate that our framework produces robust, safe performance on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.  ( 2 min )
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v2 [cs.LG] UPDATED)
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.  ( 2 min )
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v3 [cs.LG] UPDATED)
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.  ( 3 min )
    Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits. (arXiv:2211.05964v2 [stat.ML] UPDATED)
    We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling algorithm using special classes of sparsity-inducing priors (e.g., spike-and-slab) to model the unknown parameter and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high-dimensional and sparse contextual bandits. For faster computation, we use variational inference instead of Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. Extensive simulations demonstrate the improved performance of our proposed algorithm over existing ones.  ( 2 min )
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v3 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al., (2022). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of the designs in Na et al., (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.  ( 2 min )
    Generalization on the Unseen, Logic Reasoning and Degree Curriculum. (arXiv:2301.13105v1 [cs.LG])
    This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator (MDI) is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky MDIs. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.  ( 2 min )
    Mirror Sinkhorn: Fast Online Optimization on Transport Polytopes. (arXiv:2211.10420v2 [cs.LG] UPDATED)
    Optimal transport is an important tool in machine learning, allowing to capture geometric properties of the data through a linear program on transport polytopes. We present a single-loop optimization algorithm for minimizing general convex objectives on these domains, utilizing the principles of Sinkhorn matrix scaling and mirror descent. The proposed algorithm is robust to noise, and can be used in an online setting. We provide theoretical guarantees for convex objectives and experimental results showcasing it effectiveness on both synthetic and real-world data.  ( 2 min )
    SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising. (arXiv:2105.07911v2 [cs.CL] UPDATED)
    In text-to-SQL task, seq-to-seq models often lead to sub-optimal performance due to limitations in their architecture. In this paper, we present a simple yet effective approach that adapts transformer-based seq-to-seq model to robust text-to-SQL generation. Instead of inducing constraint to decoder or reformat the task as slot-filling, we propose to train seq-to-seq model with Schema aware Denoising (SeaD), which consists of two denoising objectives that train model to either recover input or predict output from two novel erosion and shuffle noises. These denoising objectives acts as the auxiliary tasks for better modeling the structural data in S2S generation. In addition, we improve and propose a clause-sensitive execution guided (EG) decoding strategy to overcome the limitation of EG decoding for generative model. The experiments show that the proposed method improves the performance of seq-to-seq model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark. The results indicate that the capacity of vanilla seq-to-seq architecture for text-to-SQL may have been under-estimated.  ( 2 min )
    Jump Interval-Learning for Individualized Decision Making. (arXiv:2111.08885v2 [stat.ME] UPDATED)
    An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.  ( 2 min )
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v3 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.  ( 2 min )
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v2 [stat.ML] UPDATED)
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders.  ( 2 min )
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v3 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 2 min )
    Online Self-Concordant and Relatively Smooth Minimization, With Applications to Online Portfolio Selection and Learning Quantum States. (arXiv:2210.00997v2 [stat.ML] UPDATED)
    Consider an online convex optimization problem where the loss functions are self-concordant barriers, smooth relative to a convex function $h$, and possibly non-Lipschitz. We analyze the regret of online mirror descent with $h$. Then, based on the result, we prove the following in a unified manner. Denote by $T$ the time horizon and $d$ the parameter dimension. 1. For online portfolio selection, the regret of $\widetilde{\text{EG}}$, a variant of exponentiated gradient due to Helmbold et al., is $\tilde{O} ( T^{2/3} d^{1/3} )$ when $T > 4 d / \log d$. This improves on the original $\tilde{O} ( T^{3/4} d^{1/2} )$ regret bound for $\widetilde{\text{EG}}$. 2. For online portfolio selection, the regret of online mirror descent with the logarithmic barrier is $\tilde{O}(\sqrt{T d})$. The regret bound is the same as that of Soft-Bayes due to Orseau et al. up to logarithmic terms. 3. For online learning quantum states with the logarithmic loss, the regret of online mirror descent with the log-determinant function is also $\tilde{O} ( \sqrt{T d} )$. Its per-iteration time is shorter than all existing algorithms we know.  ( 2 min )
    Integrating Earth Observation Data into Causal Inference: Challenges and Opportunities. (arXiv:2301.12985v1 [stat.ML])
    Observational studies require adjustment for confounding factors that are correlated with both the treatment and outcome. In the setting where the observed variables are tabular quantities such as average income in a neighborhood, tools have been developed for addressing such confounding. However, in many parts of the developing world, features about local communities may be scarce. In this context, satellite imagery can play an important role, serving as a proxy for the confounding variables otherwise unobserved. In this paper, we study confounder adjustment in this non-tabular setting, where patterns or objects found in satellite images contribute to the confounder bias. Using the evaluation of anti-poverty aid programs in Africa as our running example, we formalize the challenge of performing causal adjustment with such unstructured data -- what conditions are sufficient to identify causal effects, how to perform estimation, and how to quantify the ways in which certain aspects of the unstructured image object are most predictive of the treatment decision. Via simulation, we also explore the sensitivity of satellite image-based observational inference to image resolution and to misspecification of the image-associated confounder. Finally, we apply these tools in estimating the effect of anti-poverty interventions in African communities from satellite imagery.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v4 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.  ( 2 min )
    Data Heterogeneity Differential Privacy: From Theory to Algorithm. (arXiv:2002.08578v2 [cs.LG] UPDATED)
    Traditionally, the random noise is equally injected when training with different data instances in the field of differential privacy (DP). In this paper, we first give sharper excess risk bounds of DP stochastic gradient descent (SGD) method. Considering most of the previous methods are under convex conditions, we use Polyak-{\L}ojasiewicz condition to relax it in this paper. Then, after observing that different training data instances affect the machine learning model to different extent, we consider the heterogeneity of training data and attempt to improve the performance of DP-SGD from a new perspective. Specifically, by introducing the influence function (IF), we quantitatively measure the contributions of various training data on the final machine learning model. If the contribution made by a single data instance is so little that attackers cannot infer anything from the model, we do not add noise when training with it. Based on this observation, we design a `Performance Improving' DP-SGD algorithm: PIDP-SGD. Theoretical and experimental results show that our proposed PIDP-SGD improves the performance significantly.  ( 2 min )
    Efficient functional estimation and the super-oracle phenomenon. (arXiv:1904.09347v2 [math.ST] UPDATED)
    We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a corresponding central limit theorem, which facilitates the construction of asymptotically valid confidence intervals for the functional, having asymptotically minimal width. One interesting consequence of our results is the discovery that, for certain functionals, the worst-case performance of our estimator may improve on that of the natural `oracle' estimator, which is given access to the values of the unknown densities at the observations.  ( 2 min )
    Policy Targeting under Network Interference. (arXiv:1906.10258v13 [econ.EM] UPDATED)
    This paper studies the problem of optimally allocating treatments in the presence of spillover effects, using information from a (quasi-)experiment. I introduce a method that maximizes the sample analog of average social welfare when spillovers occur. I construct semi-parametric welfare estimators with known and unknown propensity scores and cast the optimization problem into a mixed-integer linear program, which can be solved using off-the-shelf algorithms. I derive a strong set of guarantees on regret, i.e., the difference between the maximum attainable welfare and the welfare evaluated at the estimated policy. The proposed method presents attractive features for applications: (i) it does not require network information of the target population; (ii) it exploits heterogeneity in treatment effects for targeting individuals; (iii) it does not rely on the correct specification of a particular structural model; and (iv) it accommodates constraints on the policy function. An application for targeting information on social networks illustrates the advantages of the method.  ( 2 min )
    Learning Mixtures of Markov Chains and MDPs. (arXiv:2211.09403v2 [stat.ML] UPDATED)
    We present an algorithm for learning mixtures of Markov chains and Markov decision processes (MDPs) from short unlabeled trajectories. Specifically, our method handles mixtures of Markov chains with optional control input by going through a multi-step process, involving (1) a subspace estimation step, (2) spectral clustering of trajectories using "pairwise distance estimators," along with refinement using the EM algorithm, (3) a model estimation step, and (4) a classification step for predicting labels of new trajectories. We provide end-to-end performance guarantees, where we only explicitly require the length of trajectories to be linear in the number of states and the number of trajectories to be linear in a mixing time parameter. Experimental results support these guarantees, where we attain 96.6% average accuracy on a mixture of two MDPs in gridworld, outperforming the EM algorithm with random initialization (73.2% average accuracy).  ( 2 min )
    Technical Reports Compilation: Detecting the Fire Drill Anti-pattern Using Source Code and Issue-Tracking Data. (arXiv:2104.15090v8 [cs.SE] UPDATED)
    Detecting the presence of project management anti-patterns (AP) currently requires experts on the matter and is an expensive endeavor. Worse, experts may introduce their individual subjectivity or bias. Using the Fire Drill AP, we first introduce a novel way to translate descriptions into detectable AP that are comprised of arbitrary metrics and events such as logged time or maintenance activities, which are mined from the underlying source code or issue-tracking data, thus making the description objective as it becomes data-based. Secondly, we demonstrate a novel method to quantify and score the deviations of real-world projects to data-based AP descriptions. Using fifteen real-world projects that exhibit a Fire Drill to some degree, we show how to further enhance the translated AP. The ground truth in these projects was extracted from two individual experts and consensus was found between them. We introduce a novel method called automatic calibration, that optimizes a pattern such that only necessary and important scores remain that suffice to confidently detect the degree to which the AP is present. Without automatic calibration, the proposed patterns show only weak potential for detecting the presence. Enriching the AP with data from real-world projects significantly improves the potential. We also introduce a no-pattern approach that exploits the ground truth for establishing a new, quantitative understanding of the phenomenon, as well as for finding gray-/black-box predictive models. We conclude that the presence detection and severity assessment of the Fire Drill anti-pattern, as well as some of its related and similar patterns, is certainly possible using some of the presented approaches.  ( 3 min )
    TOAST: Topological Algorithm for Singularity Tracking. (arXiv:2210.00069v2 [cs.LG] UPDATED)
    The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the 'manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.  ( 2 min )
    Factor-augmented tree ensembles. (arXiv:2111.14000v4 [stat.ML] UPDATED)
    This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.  ( 2 min )
    Transfer learning for chemically accurate interatomic neural network potentials. (arXiv:2212.03916v2 [physics.comp-ph] UPDATED)
    Developing machine learning-based interatomic potentials from ab-initio electronic structure methods remains a challenging task for computational chemistry and materials science. This work studies the capability of transfer learning, in particular discriminative fine-tuning, for efficiently generating chemically accurate interatomic neural network potentials on organic molecules from the MD17 and ANI data sets. We show that pre-training the network parameters on data obtained from density functional calculations considerably improves the sample efficiency of models trained on more accurate ab-initio data. Additionally, we show that fine-tuning with energy labels alone can suffice to obtain accurate atomic forces and run large-scale atomistic simulations, provided a well-designed fine-tuning data set. We also investigate possible limitations of transfer learning, especially regarding the design and size of the pre-training and fine-tuning data sets. Finally, we provide GM-NN potentials pre-trained and fine-tuned on the ANI-1x and ANI-1ccx data sets, which can easily be fine-tuned on and applied to organic molecules.  ( 2 min )
    Robust empirical risk minimization via Newton's method. (arXiv:2301.13192v1 [stat.ML])
    We study a variant of Newton's method for empirical risk minimization, where at each iteration of the optimization algorithm, we replace the gradient and Hessian of the objective function by robust estimators taken from existing literature on robust mean estimation for multivariate data. After proving a general theorem about the convergence of successive iterates to a small ball around the population-level minimizer, we study consequences of our theory in generalized linear models, when data are generated from Huber's epsilon-contamination model and/or heavy-tailed distributions. We also propose an algorithm for obtaining robust Newton directions based on the conjugate gradient method, which may be more appropriate for high-dimensional settings, and provide conjectures about the convergence of the resulting algorithm. Compared to the robust gradient descent algorithm proposed by Prasad et al. (2020), our algorithm enjoys the faster rates of convergence for successive iterates often achieved by second-order algorithms for convex problems, i.e., quadratic convergence in a neighborhood of the optimum, with a stepsize that may be chosen adaptively via backtracking linesearch.  ( 2 min )
    A Novel Framework for Policy Mirror Descent with General Parametrization and Linear Convergence. (arXiv:2301.13139v1 [stat.ML])
    Modern policy optimization methods in applied reinforcement learning are often inspired by the trust region policy optimization algorithm, which can be interpreted as a particular instance of policy mirror descent. While theoretical guarantees have been established for this framework, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. tabular softmax, log-linear, and neural policies. It also generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization function, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total variation of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.  ( 2 min )
    GFlowNets and variational inference. (arXiv:2210.00580v2 [cs.LG] UPDATED)
    This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.  ( 2 min )
    Selecting time-series hyperparameters with the artificial jackknife. (arXiv:2002.04697v5 [stat.ME] UPDATED)
    This article proposes a generalisation of the delete-$d$ jackknife to solve hyperparameter selection problems for time series. I call it artificial delete-$d$ jackknife to stress that this approach substitutes the classic removal step with a fictitious deletion, wherein observed datapoints are replaced with artificial missing values. This procedure keeps the data order intact and allows plain compatibility with time series. This manuscript justifies the use of this approach asymptotically and shows its finite-sample advantages through simulation studies. Besides, this article describes its real-world advantages by regulating high-dimensional forecasting models for foreign exchange rates.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v3 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    Gaussian Process Hydrodynamics. (arXiv:2209.10707v3 [physics.flu-dyn] UPDATED)
    We present a Gaussian Process (GP) approach (Gaussian Process Hydrodynamics, GPH) for approximating the solution of the Euler and Navier-Stokes equations. As in Smoothed Particle Hydrodynamics (SPH), GPH is a Lagrangian particle-based approach involving the tracking of a finite number of particles transported by the flow. However, these particles do not represent mollified particles of matter but carry discrete/partial information about the continuous flow. Closure is achieved by placing a divergence-free GP prior $\xi$ on the velocity field and conditioning on vorticity at particle locations. Known physics (e.g., the Richardson cascade and velocity-increments power laws) is incorporated into the GP prior through physics-informed additive kernels. This approach allows us to coarse-grain turbulence in a statistical manner rather than a deterministic one. By enforcing incompressibility and fluid/structure boundary conditions through the selection of the kernel, GPH requires much fewer particles than SPH. Since GPH has a natural probabilistic interpretation, numerical results come with uncertainty estimates enabling their incorporation into a UQ pipeline and the adding/removing of particles (quantas of information) in an adapted manner. The proposed approach is amenable to analysis, it inherits the complexity of state-of-the-art solvers for dense kernel matrices, and it leads to a natural definition of turbulence as information loss. Numerical experiments support the importance of selecting physics-informed kernels and illustrate the major impact of such kernels on accuracy and stability. Since the proposed approach has a Bayesian interpretation, it naturally enables data assimilation and making predictions and estimations based on mixing simulation data with experimental data.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v3 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks which, instead of considering a distribution over weights, samples outputs of each layer from a corresponding Gaussian distribution, parametrized by the predictions of mean and variance sub-layers. In uncertainty quality estimation experiments, we show that the proposed method achieves better uncertainty quality than other single-bin Bayesian Model Averaging methods, such as Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. (arXiv:2206.05794v3 [cs.LG] UPDATED)
    In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.  ( 2 min )
    Large-scale Model Personalization via Low Rank and Sparse decomposition. (arXiv:2210.03505v2 [cs.LG] UPDATED)
    Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation style systems. Standard personalization approaches involve learning a user/domain specific embedding that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain -- a.k.a meta-learning -- has high storage/infrastructure cost. We propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse matrices. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. Furthermore, the framework is up to two orders of magnitude more scalable (in terms of storage/infrastructure cost) than user-specific finetuning of model. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with iterative hard thresholding -- AMHT-LRS -- to learn the low-rank and sparse part. For the realizable, Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal samples. A significant challenge in personalization is ensuring privacy of each user's sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees. Finally, on multiple standard recommendation datasets, we demonstrate that our approach allows personalized models to obtain superior performance in sparse data regime.  ( 2 min )
    Safe and Adaptive Decision-Making for Optimization of Safety-Critical Systems: The ARTEO Algorithm. (arXiv:2211.05495v2 [cs.LG] UPDATED)
    We consider the problem of decision-making under uncertainty in an environment with safety constraints. Many business and industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown characteristics, real-time optimization becomes challenging, particularly because of the satisfaction of safety constraints. We propose the ARTEO algorithm, where we cast multi-armed bandits as a mathematical programming problem subject to safety constraints and learn the unknown characteristics through exploration while optimizing the targets. We quantify the uncertainty in unknown characteristics by using Gaussian processes and incorporate it into the cost function as a contribution which drives exploration. We adaptively control the size of this contribution in accordance with the requirements of the environment. We guarantee the safety of our algorithm with a high probability through confidence bounds constructed under the regularity assumptions of Gaussian processes. We demonstrate the safety and efficiency of our approach with two case studies: optimization of electric motor current and real-time bidding problems. We further evaluate the performance of ARTEO compared to a safe variant of upper confidence bound based algorithms. ARTEO achieves less cumulative regret with accurate and safe decisions.  ( 2 min )
    Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. (arXiv:2210.01376v2 [cs.LG] UPDATED)
    We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.  ( 2 min )
    Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods. (arXiv:2301.13006v1 [cs.LG])
    Efficient computation of the optimal transport distance between two distributions serves as an algorithm subroutine that empowers various applications. This paper develops a scalable first-order optimization-based method that computes optimal transport to within $\varepsilon$ additive accuracy with runtime $\widetilde{O}( n^2/\varepsilon)$, where $n$ denotes the dimension of the probability distributions of interest. Our algorithm achieves the state-of-the-art computational guarantees among all first-order methods, while exhibiting favorable numerical performance compared to classical algorithms like Sinkhorn and Greenkhorn. Underlying our algorithm designs are two key elements: (a) converting the original problem into a bilinear minimax problem over probability distributions; (b) exploiting the extragradient idea -- in conjunction with entropy regularization and adaptive learning rates -- to accelerate convergence.  ( 2 min )
    Refined Regret for Adversarial MDPs with Linear Function Approximation. (arXiv:2301.12942v1 [cs.LG])
    We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal O}(K^{8/9})$ regret and greatly improves over the best existing bound $\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which could again be of independent interest.  ( 2 min )
    Prediction of Customer Churn in Banking Industry. (arXiv:2301.13099v1 [stat.ML])
    With the growing competition in banking industry, banks are required to follow customer retention strategies while they are trying to increase their market share by acquiring new customers. This study compares the performance of six supervised classification techniques to suggest an efficient model to predict customer churn in banking industry, given 10 demographic and personal attributes from 10000 customers of European banks. The effect of feature selection, class imbalance, and outliers will be discussed for ANN and random forest as the two competing models. As shown, unlike random forest, ANN does not reveal any serious concern regarding overfitting and is also robust to noise. Therefore, ANN structure with five nodes in a single hidden layer is recognized as the best performing classifier.  ( 2 min )
    MixFlows: principled variational inference via mixed flows. (arXiv:2205.07475v3 [stat.ML] UPDATED)
    This work presents mixed variational flows (MixFlows), a new variational family that consists of a mixture of repeated applications of a map to an initial reference distribution. First, we provide efficient algorithms for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. We then show that MixFlows have MCMC-like convergence guarantees when the flow map is ergodic and measure-preserving, and provide bounds on the accumulation of error for practical implementations where the flow map is approximated. Finally, we develop an implementation of MixFlows based on uncorrected discretized Hamiltonian dynamics combined with deterministic momentum refreshment. Simulated and real data experiments show that MixFlows can provide more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.  ( 2 min )
    Reversible Gromov-Monge Sampler for Simulation-Based Inference. (arXiv:2109.14090v4 [stat.ME] UPDATED)
    This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d.\ samples, circumventing the usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogeneous metric measure spaces $(\cX, \mu, c_{\cX})$ and $(\cY, \nu, c_{\cY})$ from empirical data sets, with estimated maps that approximately push forward one measure $\mu$ to the other $\nu$, and vice versa. We study the analytic properties of the RGM distance and derive that under mild conditions, RGM equals the classic Gromov-Wasserstein distance. Curiously, drawing a connection to Brenier's polar factorization, we show that the RGM sampler induces bias towards strong isomorphism with proper choices of $c_{\cX}$ and $c_{\cY}$. Statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.  ( 2 min )
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond. (arXiv:2203.07835v3 [cs.LG] UPDATED)
    With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.  ( 2 min )
    Accelerating Kernel Classifiers Through Borders Mapping. (arXiv:1708.05917v6 [stat.ML] UPDATED)
    Support vector machines (SVM) and other kernel techniques represent a family of powerful statistical classification methods with high accuracy and broad applicability. Because they use all or a significant portion of the training data, however, they can be slow, especially for large problems. Piecewise linear classifiers are similarly versatile, yet have the additional advantages of simplicity, ease of interpretation and, if the number of component linear classifiers is not too large, speed. Here we show how a simple, piecewise linear classifier can be trained from a kernel-based classifier in order to improve the classification speed. The method works by finding the root of the difference in conditional probabilities between pairs of opposite classes to build up a representation of the decision boundary. When tested on 17 different datasets, it succeeded in improving the classification speed of a SVM for 12 of them by up to two orders-of-magnitude. Of these, two were less accurate than a simple, linear classifier. The method is best suited to problems with continuum features data and smooth probability functions. Because the component linear classifiers are built up individually from an existing classifier, rather than through a simultaneous optimization procedure, the classifier is also fast to train.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces. (arXiv:2301.13088v1 [stat.ME])
    Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 2 min )
    On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation. (arXiv:1910.08412v3 [cs.LG] UPDATED)
    Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.  ( 2 min )
    Benchmarking optimality of time series classification methods in distinguishing diffusions. (arXiv:2301.13112v1 [stat.ML])
    Performance benchmarking is a crucial component of time series classification (TSC) algorithm design, and a fast-growing number of datasets have been established for empirical benchmarking. However, the empirical benchmarks are costly and do not guarantee statistical optimality. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is optimal in the sense of the Neyman-Pearson lemma: it has the smallest false positive rate among classifiers with a controlled level of false negative rate. The LRT requires the likelihood ratio of the time series to be computable. The diffusion processes from stochastic differential equations provide such time series and are flexible in design for generating linear or nonlinear time series. We demonstrate the benchmarking with three scalable state-of-the-art TSC algorithms: random forest, ResNet, and ROCKET. Test results show that they can achieve LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying nonlinear multivariate time series from high-dimensional stochastic interacting particle systems. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series. Thus, the LRT with diffusion processes can systematically and efficiently benchmark the optimality of TSC algorithms and may guide their future improvements.  ( 2 min )
    Revisiting Over-smoothing and Over-squashing using Ollivier-Ricci Curvature. (arXiv:2211.15779v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature, while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.  ( 2 min )
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v4 [cs.CV] UPDATED)
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    On student-teacher deviations in distillation: does it pay to disobey?. (arXiv:2301.12923v1 [cs.LG])
    Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must somehow deviate from the teacher (Stanton et al., 2021). But what is the nature of these deviations, and how do they relate to gains in generalization? We investigate these questions through a series of experiments across image and language classification datasets. First, we observe that distillation consistently deviates in a characteristic way: on points where the teacher has low confidence, the student achieves even lower confidence than the teacher. Secondly, we find that deviations in the initial dynamics of training are not crucial -- simply switching to distillation loss in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives to understand the role of student-teacher deviations in our experiments, one casting distillation as a regularizer in eigenspace, and another as a gradient denoiser. Our analysis bridges several gaps between existing theory and practice by (a) focusing on gradient-descent training, (b) by avoiding label noise assumptions, and (c) by unifying several disjoint empirical and theoretical findings.  ( 2 min )
    Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. (arXiv:2301.12930v1 [cs.LG])
    Location-scale noise models (LSNMs) are a class of heteroscedastic structural causal models with wide applicability, closely related to affine flow models. Recent likelihood-based methods designed for LSNMs that infer cause-effect relationships achieve state-of-the-art accuracy, when their assumptions are satisfied concerning the noise distributions. However, under misspecification their accuracy deteriorates sharply, especially when the conditional variance in the anti-causal direction is smaller than that in the causal direction. In this paper, we demonstrate the misspecification problem and analyze why and when it occurs. We show that residual independence testing is much more robust to misspecification than likelihood-based cause-effect inference. Our empirical evaluation includes 580 synthetic and 99 real-world datasets.  ( 2 min )
    Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v1 [cs.LG])
    Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property of graphs, and has recently started to prove useful in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with cutting-edge methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.  ( 2 min )
    Massively Scaling Heteroscedastic Classifiers. (arXiv:2301.12860v1 [cs.LG])
    Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In addition heteroscedastic classifiers introduce a critical temperature hyperparameter which must be tuned. We propose HET-XL, a heteroscedastic classifier whose parameter count when compared to a standard classifier scales independently of the number of classes. In our large-scale settings, we show that we can remove the need to tune the temperature hyperparameter, by directly learning it on the training data. On large image classification datasets with up to 4B images and 30k classes our method requires 14X fewer additional parameters, does not require tuning the temperature on a held-out set and performs consistently better than the baseline heteroscedastic classifier. HET-XL improves ImageNet 0-shot classification in a multimodal contrastive learning setup which can be viewed as a 3.5 billion class classification problem.  ( 2 min )
    How Powerful are Shallow Neural Networks with Bandlimited Random Weights?. (arXiv:2008.08427v2 [cs.LG] UPDATED)
    We investigate the expressive power of depth-2 bandlimited random neural networks. A random net is a neural network where the hidden layer parameters are frozen with random assignment, and only the output layer parameters are trained by loss minimization. Using random weights for a hidden layer is an effective method to avoid non-convex optimization in standard gradient descent learning. It has also been adopted in recent deep learning theories. Despite the well-known fact that a neural network is a universal approximator, in this study, we mathematically show that when hidden parameters are distributed in a bounded domain, the network may not achieve zero approximation error. In particular, we derive a new nontrivial approximation error lower bound. The proof utilizes the technique of ridgelet analysis, a harmonic analysis method designed for neural networks. This method is inspired by fundamental principles in classical signal processing, specifically the idea that signals with limited bandwidth may not always be able to perfectly recreate the original signal. We corroborate our theoretical results with various simulation studies, and generally, two main take-home messages are offered: (i) Not any distribution for selecting random weights is feasible to build a universal approximator; (ii) A suitable assignment of random weights exists but to some degree is associated with the complexity of the target function.  ( 2 min )
    Fair and Optimal Classification via Post-Processing Predictors. (arXiv:2211.01528v2 [cs.LG] UPDATED)
    To address the bias exhibited by machine learning models, fairness criteria impose statistical constraints for ensuring equal treatment to all demographic groups, but typically at a cost to model performance. Understanding this tradeoff, therefore, underlies the design of fair and effective algorithms. This paper completes the characterization of the inherent tradeoff of demographic parity on classification problems in the most general multigroup, multiclass, and noisy setting. Specifically, we show that the minimum error rate is given by the optimal value of a Wasserstein-barycenter problem. More practically, this reformulation leads to a simple procedure for post-processing any pre-trained predictors to satisfy demographic parity in the general setting, which, in particular, yields the optimal fair classifier when applied to the Bayes predictor. We provide suboptimality and finite sample analyses for our procedure, and demonstrate precise control of the tradeoff of error rate for fairness on real-world datasets provided sufficient data.  ( 2 min )
    Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation. (arXiv:2301.13087v1 [cs.LG])
    We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.Our algorithm obtains an $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.  ( 2 min )
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v3 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.  ( 2 min )
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v3 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed.  ( 2 min )
    Likelihood-Free Frequentist Inference: Confidence Sets with Correct Conditional Coverage. (arXiv:2107.03920v5 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, particularly outside asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage for small sample sizes. This paper unifies classical statistics with modern machine learning to present (i) a practical procedure for the Neyman construction of confidence sets with finite-sample guarantees of nominal coverage, and (ii) diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can leverage the LF2I machinery to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our paper discusses the benefits and challenges of LF2I, with a breakdown of the sources of errors in LF2I confidence sets.  ( 2 min )
    Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation. (arXiv:2208.12401v3 [cs.LG] UPDATED)
    Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework.  ( 2 min )
    Interpolating between BSDEs and PINNs: deep learning for elliptic and parabolic boundary value problems. (arXiv:2112.03749v2 [math.NA] UPDATED)
    Solving high-dimensional partial differential equations is a recurrent challenge in economics, science and engineering. In recent years, a great number of computational approaches have been developed, most of them relying on a combination of Monte Carlo sampling and deep learning based approximation. For elliptic and parabolic problems, existing methods can broadly be classified into those resting on reformulations in terms of $\textit{backward stochastic differential equations}$ (BSDEs) and those aiming to minimize a regression-type $L^2$-error ($\textit{physics-informed neural networks}$, PINNs). In this paper, we review the literature and suggest a methodology based on the novel $\textit{diffusion loss}$ that interpolates between BSDEs and PINNs. Our contribution opens the door towards a unified understanding of numerical approaches for high-dimensional PDEs, as well as for implementations that combine the strengths of BSDEs and PINNs. The diffusion loss furthermore bears close similarities to $\textit{(least squares) temporal difference}$ objectives found in reinforcement learning. We also discuss eigenvalue problems and perform extensive numerical studies, including calculations of the ground state for nonlinear Schr\"odinger operators and committor functions relevant in molecular dynamics.  ( 2 min )
    A theory of continuous generative flow networks. (arXiv:2301.12594v1 [cs.LG])
    Generative flow networks (GFlowNets) are amortized variational inference algorithms that are trained to sample from unnormalized target distributions over compositional objects. A key limitation of GFlowNets until this time has been that they are restricted to discrete spaces. We present a theory for generalized GFlowNets, which encompasses both existing discrete GFlowNets and ones with continuous or hybrid state spaces, and perform experiments with two goals in mind. First, we illustrate critical points of the theory and the importance of various assumptions. Second, we empirically demonstrate how observations about discrete GFlowNets transfer to the continuous case and show strong results compared to non-GFlowNet baselines on several previously studied tasks. This work greatly widens the perspectives for the application of GFlowNets in probabilistic inference and various modeling settings.  ( 2 min )
    Intrinsic Bayesian Optimisation on Complex Constrained Domain. (arXiv:2301.12581v1 [stat.ML])
    Motivated by the success of Bayesian optimisation algorithms in the Euclidean space, we propose a novel approach to construct Intrinsic Bayesian optimisation (In-BO) on manifolds with a primary focus on complex constrained domains or irregular-shaped spaces arising as submanifolds of R2, R3 and beyond. Data may be collected in a spatial domain but restricted to a complex or intricately structured region corresponding to a geographic feature, such as lakes. Traditional Bayesian Optimisation (Tra-BO) defined with a Radial basis function (RBF) kernel cannot accommodate these complex constrained conditions. The In-BO uses the Sparse Intrinsic Gaussian Processes (SIn-GP) surrogate model to take into account the geometric structure of the manifold. SInGPs are constructed using the heat kernel of the manifold which is estimated as the transition density of the Brownian Motion on manifolds. The efficiency of In-BO is demonstrated through simulation studies on a U-shaped domain, a Bitten torus, and a real dataset from the Aral sea. Its performance is compared to that of traditional BO, which is defined in Euclidean space.  ( 2 min )
    Are Random Decompositions all we need in High Dimensional Bayesian Optimisation?. (arXiv:2301.12844v1 [cs.LG])
    Learning decompositions of expensive-to-evaluate black-box functions promises to scale Bayesian optimisation (BO) to high-dimensional problems. However, the success of these techniques depends on finding proper decompositions that accurately represent the black-box. While previous works learn those decompositions based on data, we investigate data-independent decomposition sampling rules in this paper. We find that data-driven learners of decompositions can be easily misled towards local decompositions that do not hold globally across the search space. Then, we formally show that a random tree-based decomposition sampler exhibits favourable theoretical guarantees that effectively trade off maximal information gain and functional mismatch between the actual black-box and its surrogate as provided by the decomposition. Those results motivate the development of the random decomposition upper-confidence bound algorithm (RDUCB) that is straightforward to implement - (almost) plug-and-play - and, surprisingly, yields significant empirical gains compared to the previous state-of-the-art on a comprehensive set of benchmarks. We also confirm the plug-and-play nature of our modelling component by integrating our method with HEBO, showing improved practical gains in the highest dimensional tasks from Bayesmark.  ( 2 min )
    PAC-Bayesian Soft Actor-Critic Learning. (arXiv:2301.12776v1 [cs.LG])
    Actor-critic algorithms address the dual goals of reinforcement learning, policy evaluation and improvement, via two separate function approximators. The practicality of this approach comes at the expense of training instability, caused mainly by the destructive effect of the approximation errors of the critic on the actor. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm. We further demonstrate that the online learning performance improves significantly when a stochastic actor explores multiple futures by critic-guided random search. We observe our resulting algorithm to compare favorably to the state of the art on multiple classical control and locomotion tasks in both sample efficiency and asymptotic performance.  ( 2 min )
    Bagging Provides Assumption-free Stability. (arXiv:2301.12600v1 [stat.ML])
    Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model with bounded outputs. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant.  ( 2 min )
    Imbalanced Mixed Linear Regression. (arXiv:2301.12559v1 [stat.ML])
    We consider the problem of mixed linear regression (MLR), where each observed sample belongs to one of $K$ unknown linear models. In practical applications, the proportions of the $K$ components are often imbalanced. Unfortunately, most MLR methods do not perform well in such settings. Motivated by this practical challenge, in this work we propose Mix-IRLS, a novel, simple and fast algorithm for MLR with excellent performance on both balanced and imbalanced mixtures. In contrast to popular approaches that recover the $K$ models simultaneously, Mix-IRLS does it sequentially using tools from robust regression. Empirically, Mix-IRLS succeeds in a broad range of settings where other methods fail. These include imbalanced mixtures, small sample sizes, presence of outliers, and an unknown number of models $K$. In addition, Mix-IRLS outperforms competing methods on several real-world datasets, in some cases by a large margin. We complement our empirical results by deriving a recovery guarantee for Mix-IRLS, which highlights its advantage on imbalanced mixtures.  ( 2 min )
    Compression, Generalization and Learning. (arXiv:2301.12767v1 [cs.LG])
    A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassification, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (called the "risk"). We identify conditions under which the cardinality of the compressed set is a consistent estimator for the risk (without any upper limit on the size of the compressed set) and prove unprecedentedly tight bounds to evaluate the risk under a generally applicable condition of preference. All results are usable in a fully agnostic setup, without requiring any a priori knowledge on the probability distribution of the observations. Not only these results offer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning.  ( 2 min )
    On Second-Order Scoring Rules for Epistemic Uncertainty Quantification. (arXiv:2301.12736v1 [cs.LG])
    It is well known that accurate probabilistic predictors can be trained through empirical risk minimisation with proper scoring rules as loss functions. While such learners capture so-called aleatoric uncertainty of predictions, various machine learning methods have recently been developed with the goal to let the learner also represent its epistemic uncertainty, i.e., the uncertainty caused by a lack of knowledge and data. An emerging branch of the literature proposes the use of a second-order learner that provides predictions in terms of distributions on probability distributions. However, recent work has revealed serious theoretical shortcomings for second-order predictors based on loss minimisation. In this paper, we generalise these findings and prove a more fundamental result: There seems to be no loss function that provides an incentive for a second-order learner to faithfully represent its epistemic uncertainty in the same manner as proper scoring rules do for standard (first-order) learners. As a main mathematical tool to prove this result, we introduce the generalised notion of second-order scoring rules.  ( 2 min )
    Machine Learning with High-Cardinality Categorical Features in Actuarial Applications. (arXiv:2301.12710v1 [stat.ML])
    High-cardinality categorical features are pervasive in actuarial data (e.g. occupation in commercial property insurance). Standard categorical encoding methods like one-hot encoding are inadequate in these settings. In this work, we present a novel _Generalised Linear Mixed Model Neural Network_ ("GLMMNet") approach to the modelling of high-cardinality categorical features. The GLMMNet integrates a generalised linear mixed model in a deep learning framework, offering the predictive power of neural networks and the transparency of random effects estimates, the latter of which cannot be obtained from the entity embedding models. Further, its flexibility to deal with any distribution in the exponential dispersion (ED) family makes it widely applicable to many actuarial contexts and beyond. We illustrate and compare the GLMMNet against existing approaches in a range of simulation experiments as well as in a real-life insurance case study. Notably, we find that the GLMMNet often outperforms or at least performs comparably with an entity embedded neural network, while providing the additional benefit of transparency, which is particularly valuable in practical applications. Importantly, while our model was motivated by actuarial applications, it can have wider applicability. The GLMMNet would suit any applications that involve high-cardinality categorical variables and where the response cannot be sufficiently modelled by a Gaussian distribution.  ( 2 min )
    Kernelized Cumulants: Beyond Kernel Mean Embeddings. (arXiv:2301.12466v1 [stat.ML])
    In $\mathbb R^d$, it is well-known that cumulants provide an alternative to moments that can achieve the same goals with numerous benefits such as lower variance estimators. In this paper we extend cumulants to reproducing kernel Hilbert spaces (RKHS) using tools from tensor algebras and show that they are computationally tractable by a kernel trick. These kernelized cumulants provide a new set of all-purpose statistics; the classical maximum mean discrepancy and Hilbert-Schmidt independence criterion arise as the degree one objects in our general construction. We argue both theoretically and empirically (on synthetic, environmental, and traffic data analysis) that going beyond degree one has several advantages and can be achieved with the same computational complexity and minimal overhead in our experiments.  ( 2 min )
    Implicit Regularization for Group Sparsity. (arXiv:2301.12540v1 [stat.ML])
    We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a diagonally grouped linear neural network. We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments.  ( 2 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models. (arXiv:2301.12334v1 [cs.LG])
    We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating sufficient numbers of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves highly ineffective and time-consuming for the task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first provide a new insight on the majority-focused nature of the diffusion models: they denoise in favor of the majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating the low-likelihood minority samples over existing generative frameworks including the standard diffusion sampler.  ( 2 min )
    On Enhancing Expressive Power via Compositions of Single Fixed-Size ReLU Network. (arXiv:2301.12353v1 [cs.LG])
    This paper studies the expressive power of deep neural networks from the perspective of function compositions. We show that repeated compositions of a single fixed-size ReLU network can produce super expressive power. In particular, we prove by construction that $\mathcal{L}_2\circ \boldsymbol{g}^{\circ r}\circ \boldsymbol{\mathcal{L}}_1$ can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(r^{-1/d})$, where $\boldsymbol{g}$ is realized by a fixed-size ReLU network, $\boldsymbol{\mathcal{L}}_1$ and $\mathcal{L}_2$ are two affine linear maps matching the dimensions, and $\boldsymbol{g}^{\circ r}$ means the $r$-times composition of $\boldsymbol{g}$. Furthermore, we extend such a result to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Our results reveal that a continuous-depth network generated via a dynamical system has good approximation power even if its dynamics function is time-independent and realized by a fixed-size ReLU network.  ( 2 min )
    3D Object Detection in LiDAR Point Clouds using Graph Neural Networks. (arXiv:2301.12519v1 [cs.CV])
    LiDAR (Light Detection and Ranging) is an advanced active remote sensing technique working on the principle of time of travel (ToT) for capturing highly accurate 3D information of the surroundings. LiDAR has gained wide attention in research and development with the LiDAR industry expected to reach 2.8 billion $ by 2025. Although the LiDAR dataset is of rich density and high spatial resolution, it is challenging to process LiDAR data due to its inherent 3D geometry and massive volume. But such a high-resolution dataset possesses immense potential in many applications and has great potential in 3D object detection and recognition. In this research we propose Graph Neural Network (GNN) based framework to learn and identify the objects in the 3D LiDAR point clouds. GNNs are class of deep learning which learns the patterns and objects based on the principle of graph learning which have shown success in various 3D computer vision tasks.  ( 2 min )
    SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits. (arXiv:2301.12357v1 [stat.ML])
    In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal behavior policy and derive its regret with respect to the optimal behavior policy. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.  ( 2 min )
    On Heterogeneous Treatment Effects in Heterogeneous Causal Graphs. (arXiv:2301.12383v1 [stat.ME])
    Heterogeneity and comorbidity are two interwoven challenges associated with various healthcare problems that greatly hampered research on developing effective treatment and understanding of the underlying neurobiological mechanism. Very few studies have been conducted to investigate heterogeneous causal effects (HCEs) in graphical contexts due to the lack of statistical methods. To characterize this heterogeneity, we first conceptualize heterogeneous causal graphs (HCGs) by generalizing the causal graphical model with confounder-based interactions and multiple mediators. Such confounders with an interaction with the treatment are known as moderators. This allows us to flexibly produce HCGs given different moderators and explicitly characterize HCEs from the treatment or potential mediators on the outcome. We establish the theoretical forms of HCEs and derive their properties at the individual level in both linear and nonlinear models. An interactive structural learning is developed to estimate the complex HCGs and HCEs with confidence intervals provided. Our method is empirically justified by extensive simulations and its practical usefulness is illustrated by exploring causality among psychiatric disorders for trauma survivors.  ( 2 min )
    On Learning Necessary and Sufficient Causal Graphs. (arXiv:2301.12389v1 [cs.LG])
    The causal revolution has spurred interest in understanding complex relationships in various fields. Most existing methods aim to discover causal relationships among all variables in a large-scale complex graph. However, in practice, only a small number of variables in the graph are relevant for the outcomes of interest. As a result, causal estimation with the full causal graph -- especially given limited data -- could lead to many falsely discovered, spurious variables that may be highly correlated with but have no causal impact on the target outcome. In this paper, we propose to learn a class of necessary and sufficient causal graphs (NSCG) that only contains causally relevant variables for an outcome of interest, which we term causal features. The key idea is to utilize probabilities of causation to systematically evaluate the importance of features in the causal graph, allowing us to identify a subgraph that is relevant to the outcome of interest. To learn NSCG from data, we develop a score-based necessary and sufficient causal structural learning (NSCSL) algorithm, by establishing theoretical relationships between probabilities of causation and causal effects of features. Across empirical studies of simulated and real data, we show that the proposed NSCSL algorithm outperforms existing algorithms and can reveal important yeast genes for target heritable traits of interest.  ( 2 min )
    Multi-task Highly Adaptive Lasso. (arXiv:2301.12029v1 [stat.ML])
    We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm attains a powerful dimension-free convergence rate of $o_p(n^{-1/4})$ or better. We show that MT-HAL outperforms sparsity-based MTL competitors across a wide range of simulation studies, including settings with nonlinear and linear relationships, varying levels of sparsity and task correlations, and different numbers of covariates and sample size.  ( 2 min )
    Variational Latent Branching Model for Off-Policy Evaluation. (arXiv:2301.12056v1 [cs.LG])
    Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.  ( 2 min )
    Decentralized Entropic Optimal Transport for Privacy-preserving Distributed Distribution Comparison. (arXiv:2301.12065v1 [cs.LG])
    Privacy-preserving distributed distribution comparison measures the distance between the distributions whose data are scattered across different agents in a distributed system and cannot be shared among the agents. In this study, we propose a novel decentralized entropic optimal transport (EOT) method, which provides a privacy-preserving and communication-efficient solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the decentralized EOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a distributed kernel approximation method, and each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. We analyze our method's communication complexity and provide a theoretical bound for the approximation error caused by the convergence error, the approximated kernel, and the mismatch between the storage and communication protocols. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.  ( 2 min )
    Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. (arXiv:2301.12083v1 [cs.LG])
    Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.  ( 2 min )
    Inference on the Optimal Assortment in the Multinomial Logit Model. (arXiv:2301.12254v1 [stat.ML])
    Assortment optimization has received active explorations in the past few decades due to its practical importance. Despite the extensive literature dealing with optimization algorithms and latent score estimation, uncertainty quantification for the optimal assortment still needs to be explored and is of great practical significance. Instead of estimating and recovering the complete optimal offer set, decision makers may only be interested in testing whether a given property holds true for the optimal assortment, such as whether they should include several products of interest in the optimal set, or how many categories of products the optimal set should include. This paper proposes a novel inferential framework for testing such properties. We consider the widely adopted multinomial logit (MNL) model, where we assume that each customer will purchase an item within the offered products with a probability proportional to the underlying preference score associated with the product. We reduce inferring a general optimal assortment property to quantifying the uncertainty associated with the sign change point detection of the marginal revenue gaps. We show the asymptotic normality of the marginal revenue gap estimator, and construct a maximum statistic via the gap estimators to detect the sign change point. By approximating the distribution of the maximum statistic with multiplier bootstrap techniques, we propose a valid testing procedure. We also conduct numerical experiments to assess the performance of our method.  ( 2 min )
    ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts. (arXiv:2301.12171v1 [cs.CV])
    Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.  ( 2 min )
    STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning. (arXiv:2301.12038v1 [cs.LG])
    Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.  ( 2 min )
    Minimizing Trajectory Curvature of ODE-based Generative Models. (arXiv:2301.12003v1 [cs.LG])
    Recent ODE/SDE-based generative models, such as diffusion models and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at https://github.com/sangyun884/fast-ode.  ( 2 min )
    Quantum Ridgelet Transform: Winning Lottery Ticket of Neural Networks with Quantum Computation. (arXiv:2301.11936v1 [quant-ph])
    Ridgelet transform has been a fundamental mathematical tool in the theoretical studies of neural networks. However, the practical applicability of ridgelet transform to conducting learning tasks was limited since its numerical implementation by conventional classical computation requires an exponential runtime $\exp(O(D))$ as data dimension $D$ increases. To address this problem, we develop a quantum ridgelet transform (QRT), which implements the ridgelet transform of a quantum state within a linear runtime $O(D)$ of quantum computation. As an application, we also show that one can use QRT as a fundamental subroutine for quantum machine learning (QML) to efficiently find a sparse trainable subnetwork of large shallow wide neural networks without conducting large-scale optimization of the original network. This application discovers an efficient way in this regime to demonstrate the lottery ticket hypothesis on finding such a sparse trainable neural network. These results open an avenue of QML for accelerating learning tasks with commonly used classical neural networks.  ( 2 min )
    Alignment with human representations supports robust few-shot learning. (arXiv:2301.11990v1 [cs.LG])
    Should we care whether AI systems have representations of the world that are similar to those of humans? We provide an information-theoretic analysis that suggests that there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks. We confirm this prediction empirically, finding such a relationship in an analysis of the performance of 491 computer vision models. We also show that highly-aligned models are more robust to both adversarial attacks and domain shifts. Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.  ( 2 min )
    Leveraging Importance Weights in Subset Selection. (arXiv:2301.12052v1 [cs.LG])
    We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.  ( 2 min )
    Reduced-Order Autodifferentiable Ensemble Kalman Filters. (arXiv:2301.11961v1 [stat.ML])
    This paper introduces a computational framework to reconstruct and forecast a partially observed state that evolves according to an unknown or expensive-to-simulate dynamical system. Our reduced-order autodifferentiable ensemble Kalman filters (ROAD-EnKFs) learn a latent low-dimensional surrogate model for the dynamics and a decoder that maps from the latent space to the state space. The learned dynamics and decoder are then used within an ensemble Kalman filter to reconstruct and forecast the state. Numerical experiments show that if the state dynamics exhibit a hidden low-dimensional structure, ROAD-EnKFs achieve higher accuracy at lower computational cost compared to existing methods. If such structure is not expressed in the latent state dynamics, ROAD-EnKFs achieve similar accuracy at lower cost, making them a promising approach for surrogate state reconstruction and forecasting.  ( 2 min )

  • Open

    [R] Are there any Machine Learning Journals that accept Viewpoint Papers (~1500+ words)?
    Basically the title. I have a sequence of two papers - a viewpoint and a complete paper in the works - that I'm looking to submit, the viewpoint outlining the theoretical premise for the latter. I've currently had no luck finding any ML-specific journals that allow viewpoint submissions (with the exception of simply posting to arXiv), and was wondering if anyone here was familiar with any. Thanks :D submitted by /u/Adi-Dewan [link] [comments]  ( 42 min )
    [P] Python wrapper of [ New AI classifier for indicating AI-written text from openai tool ]
    Openai is developing a new tool to help distinguish between AI-written and human-written text. Here is an unofficial python wrapper of openai model to detect if a text is written by #chatgpt , #gpt3 , #gpt etc Github: https://github.com/promptslab/openai-detector https://preview.redd.it/f45ggu45tgfa1.png?width=1122&format=png&auto=webp&s=4cb5ae70d7194cc3c070f3ad2dcbc968a804d4a3 submitted by /u/StoicBatman [link] [comments]  ( 42 min )
    Introducing NoRef-ER: A Multi-Language Referenceless ASR Metric (on HuggingFace) [R] [P]
    I am proud to announce the release of NoRefER, a multi-language referenceless ASR metric based on a fine-tuned language model, for public use on HuggingFace. This metric allows for evaluating the outputs of ASR models without needing a reference transcript, making it a valuable tool for a/b testing multiple ASR models or model versions, or even ensembling their outputs. ASR is an important technology with various applications, but the quality of ASR systems can vary greatly. It's important to accurately evaluate and compare the performance of different ASR models, traditionally done using reference-based ASR quality evaluation metrics. However, obtaining those ground-truth transcriptions from human annotators is time-consuming and costly. Referenceless quality evaluation is becoming impo…  ( 44 min )
    [N] Vincent Warmerdam: Calmcode, Explosion, Open Source and Data Science | Learning From Machine Learning #2
    https://www.youtube.com/watch?v=yvgxRzqx1Jg ​ Contents 00:00 Learning from Machine Learning Intro 00:21 Vincent Warmerdam Intro 01:18 Career Journey 03:25 What roles have you played? 05:44 Academic Background: Operations Research and Design 06:52 Operations Research 08:13 Mathematics 09:19 What attracted you to Machine Learning? 10:40 Calmcode 14:08 Calmcode, Do you use it? 15:22 Calmdcode, *args, **kwargs 16:23 If there were no constraints, what would you do to improve calmcode? 18:10 Open Source Projects: bulk, embetter, human-learn 19:10 Open Source: evol, scikit-lego 20:00 Rasa: Chatbots, Benchmarking 20:47 Unit Tests 21:42 Open Source: Creating Packages 24:10 Bulk, human-learn 26:20 27:03 Bulk in a notebook, bulk as a webapp 27:45 Human in the loop 29:03 Understanding the problem; Beans, Beef and Bread 32:56 Algorithm on the wrong problem 34:55 Module Improvement vs System Improvement 37:20 Does your answer make sense? 39:04 What's an important question that you believe remains unanswered in ML? 41:48 How do you view the gap between the hype and reality of AI? 46:28 Generative Models vs. Predictive Models 49:18 Jumping to solutions 50:08 Model vs. System 50:48 51:10 Who has influenced you in the field? 55:18 Humble, Caring Presenters 56:38 What's one piece of advice that you've received that's helped you? 01:00:18 Advice for people just starting in the field 01:03:15 What has a career in machine learning taught you about life? 01:05:16 SpaCy 01:06:10 Data-Centric Approach 01:06:50 Wrap-up 01:07:15 Follow, Explosion 01:07:48 Outro submitted by /u/NLPnerd [link] [comments]  ( 43 min )
    [D] Generative Model FOr Facts Extraction
    Is it possible to finetune a generative model (like T5) to do something like this: { inputs: "XYZ XYZ was born in ABC. They now live in DEF.", targets: "XYZ born in ABC XYZ lives in DEF" } Like the transformer model fom this paper if so how should I go about approaching the problem? Is this task as simple as feeding it the inputs and targets or do you guys think it has more to it? submitted by /u/Zetsu-Eiyu-O [link] [comments]  ( 43 min )
    [D] Open-source auto-ml services
    What're some good open-source auto-ml services mainly for image classification that're similar to Google's Vertex-AI submitted by /u/binaryshrey [link] [comments]  ( 42 min )
    [N] Monitor OpenAI API Latency, Tokens, Rate Limits, and More with Graphsignal
    Relying on hosted inference with LLMs in productions, such as via OpenAI API, has some challenges. The use of APIs should be designed around unstable latency, rate limits, token counts, costs, etc. To make it observable we've built tracing and monitoring specifically for AI apps. For example, the OpenAI Python library is monitored automatically, no need to do anything. We'll be adding support for more libraries. Here is a blog post with more info and screenshots: Monitor OpenAI API Latency, Tokens, Rate Limits, and More. And the GitHub repo. submitted by /u/l0g1cs [link] [comments]  ( 42 min )
    [P] Fine Tuning Whisper in another language
    Hi all, I'm trying to fine-tune Whisper AI to transcribe albanian speech to text but I have a problem in that I don't know how the dataset for training whisper model should look like. I already have voice audios and the transcript for that audio file but I need to know how to reformat it into a valid dataset for training Whisper. Thanks in advance! submitted by /u/ruizard [link] [comments]  ( 43 min )
    [D] Have researchers given up on traditional machine learning methods?
    This may be a silly question for those familiar with the field, but don't machine learning researchers expect any more prospects for traditional methods (I mean, "traditional" is other than deep learning)? I feel that most of the time when people talk about machine learning in the world today, they are referring to deep learning, but is this the same in the academic world? Have people who have been studying traditional methods switched to neural networks? I know that many researchers are excited about deep learning, but I am wondering what they think about other methods. submitted by /u/fujidaiti [link] [comments]  ( 52 min )
  • Open

    Snoop Dogg Giving a speech about Zombies | AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Is it possible to find a specific group of layers where vanishing gradients is starting and cascading ?
    I was curious to know if there are any particular layers which could be contributing to gradients vanishing and these vanishing gradients cascading down to subsequent lower layers, how to find these layers and if there has been any research work done on it ? submitted by /u/V1bicycle [link] [comments]  ( 41 min )
    Feature selection
    Hello, hoping I can get a simple question answered. How are features selected for neural nets where each instance in time has too much data to take in? Let’s say a 4K movie is going to play and we want to know if the next 3 seconds will contain X. We also don’t have a computer strong enough to take in all data at all times. Let’s say 20% of data is maxing our system. I have watched a ton of videos about how layers work but how does the NN take samples to put those samples into its filter? Is it up to the programmer to find clever ways of filtering down the data into certain “indicators” or something? submitted by /u/Joebone87 [link] [comments]  ( 41 min )
    Biologically plausible neural networks?
    Biologically plausible neural networks? So I'm wondering if you could possibly design ANN to simulate action potentials of neurons, and train them to do various tasks. I would hope that you can accurately simulate specific patterns of neural activity, when running a task. I'm sure this has already been done, but I'm wondering how big of a task it is to accomplish. Thanks! submitted by /u/daddydilly694-20 [link] [comments]  ( 42 min )
    Recurrent neural network in python (keras) error: ValueError: `logits` and `labels` must have the same shape, received ((None, 90, 1) vs (None,))
    I'm developing a recurrent neural network in python using keras to do binary classification on roulette wheel data. I'm trying to compile my code but it's crashing, could you help me fix the code please? Here is my code from keras.models import Sequential from keras.layers import Dense, Dropout from sklearn.preprocessing import MinMaxScaler import numpy as np import pandas as pd columns = ['data', 'resultado'] base = pd.read_csv("blaze_values_27_01_2023_VERMELHO_1.csv", header = None, names = columns) base = base.dropna() base_treinamento = base.iloc[:, 1:2] normalizador = MinMaxScaler(feature_range=[0,1]) base_treinamento_normalizada = normalizador.fit_transform(base_treinamento) previsores = [] saida_real = [] for i in range (90,1809): previsores.append(base_treinamento_normalizada[i-9…  ( 42 min )
  • Open

    Snoop Dogg Giving a speech about Zombies | AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    "Accurate and Explainable Image-based Prediction Using a Lightweight Generative Model"
    submitted by /u/pasticciociccio [link] [comments]  ( 40 min )
    100 000 cells simulated, how many will we need to form something that looks like consciousness?
    submitted by /u/blob_evol_sim [link] [comments]  ( 41 min )
    NeuralShare - make any AI model accessible for free from any device, regardless of its hardware resources
    Hello humans! A project is about to start (https://github.com/neuralshare) which will allow any device, even low-cpu, to use and interact with AI models that they could not use since they require so many hardware resources. The project is called NeuralShare and will use the Stellar network to achieve its goal, as you can read on the README on Github at the link on this post. More specifically, it will use the Futurenet network and to achieve its purpose it will use the NEUR token, also on Futurenet, which was already created a little while ago and has no real value, has no speculative objective but only has one purpose, namely that of the project. We plan to distribute this token as an airdrop, totally free of course, and anyone who has this token will be able to use and interact with GPT-3 or other AI models for free simply by sending a NEUR transaction with a text memo attached containing the prompt. In return you will receive back a transaction with a text memo containing the response. However, this will only be possible when there are enough nodes (go to github to understand why),the precise number of nodes that must be active before the response can be received is still to be defined . For more details and updates you can add yourself to Discord, find the links on the README file on GitHub, as well as more details on the functioning of this method which will also allow you to use GPT-3 for free and without an api-key submitted by /u/0ut0flin3 [link] [comments]  ( 43 min )
    AI Related Newsletter
    I do not even remember signing up for an AI newsletter but I got an issue today from “AINow” that was actually pretty insightful/informative. Thought I’d share it as I see frequent posts asking where ppl get their info/news from. Newsletter Here submitted by /u/iwjahshehbs [link] [comments]  ( 40 min )
    AI Still Feels Artificial. What Are We Missing?
    submitted by /u/jrowley [link] [comments]  ( 40 min )
    Stable Diffusion + Dream Fusion + Text-to-Motion. This animation has been made in 5 minutes with the AI-Game Development platform I'm building. No coding or design skills needed, just text prompt engineering. Assets exportable in Unity. Seeking alpha testers
    submitted by /u/SpeaKrLipSync [link] [comments]  ( 41 min )
    This ended super convincing: O Captain! My Captain by Benedict Cumberbatch (ElevenLabs)
    submitted by /u/citizentim [link] [comments]  ( 41 min )
    OpenAI releases AI text detector for ChatGPT and other models
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    Anthropic's Claude: Ex-OpenAI Employees Launches ChatGPT Rival
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    Which are the opensource SOTA for voice conversion and/or voice cloning for the Indic languages ?
    Lot of ok-ish voice cloning and conversion tools are available in market, but most have an American English tone to them. What are the opensource SOTA for voice conversion and/or voice cloning for the Indic languages ? submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    [Searchcolab] BREAKING!! And update to ChatGPT just launched.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    What tech jobs will be safe from AI over the next decade?
    I'm studying computer networking but I'm certain that AI can easily set up virtual networks and cloud computing solutions on their own. Should I have followed my dream to become a bricklayer? submitted by /u/black_linux_guy [link] [comments]  ( 41 min )
    AI Music Videos
    Does anybody know how people on Instagram and TikTok are creating these AI music videos? I see them everywhere but they gatekeep what ever they’re using. submitted by /u/mynameisbob1011 [link] [comments]  ( 40 min )
    📌[Searchcolab] "Gotham during Recession" Link in comments.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Where do you go for news updates and insights? Not Including Here
    Hello all! ​ I am new to this subreddit, but I have begun the fascinating descent into learning about all things AI and its ilk. I would like your help. ​ I want to turn on the firehose and consume all the content I can. ​ - What are your favorite news sources and media creators? ​ - How do you find the latest tools and research papers? ​ - How do you see where the conversation is going and stay updated with new opinions? ​ This would be incredibly helpful, and I hope I posted right. submitted by /u/WaffleHouseBaby [link] [comments]  ( 41 min )
    AI Certifications
    Hi All -- I've been in IT for about 15 years. I started as a sysadmin, segued into dev work, and have been specializing in systems & data integration and business process automation for the past 5ish years. I also studied Computer Science for 4 years but never graduated (that's a whole other sobstory -- won't bore anyone with that) I've been considering the direction I want to go for the next leg of my career and I feel like AI/ML is the logical next step. I've started building a portfolio of my AI/ML projects and I'd like to pick up a cert or two to compliment them. My ultimate goal is solutions architecture, but I love the engineering side of things and expect to start there. What I want to know is what certifications everyone would recommend -- I've been eyeing the IBM-sponsored certificate tracks on Coursera but I've seen a few others and they all have their merits. I'm just not sure how much value they carry in the job market. Are there any worth staying well away from? submitted by /u/am_i_the_rabbit [link] [comments]  ( 41 min )
    DHT(BitTorrent) network that replaces websites/dns
    Soon we will have very powerful assistants.They can interact with databases directly.E.g. you ask for news articles from BBC.In the case of assistants they don't need human readable url.They can sql query database on 192.168.1.1 with ease.And then they present the data in appropriate format.And dht is just for discovery. submitted by /u/nikitastaf1996 [link] [comments]  ( 41 min )
    ChatGPT Could Destroy Google In A Few Years, According To Gmail’s Creator
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Princeton computer science professor says don’t panic over ‘bullshit generator’ ChatGPT
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    A tool to run GPT3 responses on your offline data and custom data sources
    At DocuStack, our goal is to help organisations turn their internal knowledge into an answering bot, saving time and increasing efficiency for teams. If you're familiar with the challenges of managing internal knowledge and answering repetitive questions, I believe you will find DocuStack to be a valuable tool. With features such as a searchable database, customisable answers, multi-language support, and offline file upload capabilities, DocuStack makes it easy to manage your internal knowledge and provide quick, accurate answers to your team. Here is the link https://www.docustack.ai/chat This is our first launch of a wider product for customer support. would love to hear your thoughts and do not forget join the waitlist, if you would want to try the beta version. submitted by /u/titansaurabh [link] [comments]  ( 41 min )
    Science and Engineering Fair Project regarding AI
    Hello everyone. I'm passionate about AI and just computer science in general. I would like to participate in a Science and Engineering Fair, but I want to have an idea of what I'm going to do first. What can I do involving AI and do very well in the fair? If not AI, then computer science in general. Thank you submitted by /u/Sufficient-Ad-8881 [link] [comments]  ( 42 min )
    The generative AI revolution has begun—how did we get here? | "A new class of incredibly powerful AI models has made recent breakthroughs possible."
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    [D] I have been working in UI development for around 12 years having worked in pretty much everything that is there in UI. But lately I feel AI is most definitely something that has the max scope and potential in the future. So my question is how do I get into the field of AI? what would be roadmap?
    submitted by /u/anuratya [link] [comments]  ( 43 min )
  • Open

    Learning an action using PPO reinforcement learning that is also a negative reward?
    I’m doing RL on a problem where I learn 2 actions and my reward = action 1 - action 2. Since action 2 is getting subtracted, the agent learns to output 0.0 value for action 2 (action space for both the actions is between 0.0 and 0.1). Can someone please advice how can I make the agent explore non-zero values for action 2. submitted by /u/HonestScratch1827 [link] [comments]  ( 41 min )
    Multi-Agent RL for Ranged Army Combat Micro-Management (Like Dragon PvP Fight in StarCraft)
    I would like to invite interested people to collaborate on this hobby project of mine. This is still in an early-stage, and I believe it can be significantly improved together. The GitHub repository link is here: https://github.com/kayuksel/multi-rl-crowd-sim Note: The difference from StarCraft is that Dragons can hide behind each other. They also reduce their strength of hitting, propotional to decrease of their health. https://preview.redd.it/wrpcaz782dfa1.png?width=640&format=png&auto=webp&s=1dede69acb78e874a80bd532af85b269c7117f9f submitted by /u/k_yuksel [link] [comments]  ( 41 min )
    Autotuned temperature for SAC
    Has anyone ever monitored the behavior of alpha in autotuned SAC ? I implemented it and it seems to work, but I would be interested in seeing a commented graph of the evolution of alpha during the learning process, and I could not find a contribution including this. submitted by /u/Scrimbibete [link] [comments]  ( 41 min )
    Odd Reward behavior
    Hi all, I'm training an Agent (to control a platform to maintain attitude) but I'm having problems understanding the following behavior: R = A - penalty I thought adding 1.0 would increase the cumulative reward but that's not the case. R1 = A - penalty + 1.0 R1 ends up being less than R. ​ In light of this, I multiplied penalty by 10 to see what happens: R2 = A - 10.0*penalty This, increases cumulative reward (R2 > R). ​ Note that 'A' and 'penalty' are always positive values. Any idea what this means (and how to go about shaping R)? submitted by /u/XecutionStyle [link] [comments]  ( 46 min )
  • Open

    Cyberpunk 2077 Brings a Taste of the Future With DLSS
    Analyst reports. Academic papers. Ph.D. programs. There are a lot of places you can go to get a glimpse of the future. But the best place might just be El Coyote Cojo, a whiskey-soaked dive bar that doesn’t exist in real life. Fire up Cyberpunk 2077 and you’ll see much more than the watering hole’s Read article >  ( 6 min )
    Broadcaster ‘Nilson1489’ Shares Livestreaming Techniques and More This Week ‘In the NVIDIA Studio’
    Broadcasters have an arsenal of new features and technologies at their disposal; the eighth-generation NVIDIA video encoder on RTX 40 Series GPUs with support for the open AV1 video-coding format; new NVIDIA Broadcast app effects like Eye Contact and Vignette; and support for AV1 streaming in Discord.  ( 7 min )
  • Open

    New AI classifier for indicating AI-written text
    We’re launching a classifier trained to distinguish between AI-written and human-written text. We’ve trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe  ( 3 min )
  • Open

    DSC Weekly 31 January 2023 – Data Models for the Weather
    Announcements Data Models for the Weather With January coming to an end, we here in the Northeast let out a collective sigh of relief as the month ends without any major snowstorms that tend to happen in the first month of the year. Weather forecasting is a centuries-old practice that has its roots in divination… Read More »DSC Weekly 31 January 2023 – Data Models for the Weather The post DSC Weekly 31 January 2023 – Data Models for the Weather appeared first on Data Science Central.  ( 19 min )
    Explaining FAIR Data to Aunt Doris
    I’m sure you’ve run into this situation yourself. You’re at a family gathering, and someone at the table asks you exactly what you do for a living. Maybe it’s your uncle, a grandparent, or a child. You try to describe in simple terms what you do, but they get a mystified expression on their face.… Read More »Explaining FAIR Data to Aunt Doris The post Explaining FAIR Data to Aunt Doris appeared first on Data Science Central.  ( 21 min )
    Java in Cloud Native Environment: All You Need To Know
    Java has been a prevalent programming language. Even today, it remains one of the top three most-used languages for developing enterprise software. New cloud-native Java runtimes must provide developers with the following four significant benefits. It helps build cloud-native, microservices, and serverless Java applications: Traditional Java applications run as containers on hardware servers that control… Read More »Java in Cloud Native Environment: All You Need To Know The post Java in Cloud Native Environment: All You Need To Know appeared first on Data Science Central.  ( 20 min )
    NIST Artificial Intelligence Risk Management Framework
    The National Institute of Standards and Technology (NIST) has released it’s Artificial Intelligence Risk Management Framework (AI RMF 1.0), a guidance document for voluntary use by organizations designing, developing, deploying, or using AI systems to help manage the many risks of AI technologies. The NIST AI Risk standards provide a practical and bipartisan perspective for… Read More »NIST Artificial Intelligence Risk Management Framework The post NIST Artificial Intelligence Risk Management Framework appeared first on Data Science Central.  ( 19 min )
    Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II
    In the previous article, we looked at two Ever-Successful NFL teams, the Kansas City Chiefs and the San Francisco 49ers, who seem to be able to win consistently even while things change around them and players and coaches come and go.  Then, we looked at two Never-Successful teams, the Arizona Cardinals and the Cleveland Browns,… Read More »Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II The post Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II appeared first on Data Science Central.  ( 26 min )
  • Open

    Avoid having to integrate by parts twice
    Suppose f(x) and g(x) are functions that are each proportional to their second derivative. These include exponential, circular, and hyperbolic functions. Then the integral of f(x) g(x) can be computed in closed form with a moderate amount of work. The first time you see how such integrals are computed, it’s an interesting trick. I wrote […] Avoid having to integrate by parts twice first appeared on John D. Cook.  ( 5 min )
  • Open

    Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion. (arXiv:2301.11757v1 [cs.CL])
    The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media synthesis. One area that has yet to be fully explored is the application of diffusion models to music generation. Music generation requires to handle multiple aspects, including the temporal dimension, long-term structure, multiple layers of overlapping sounds, and nuances that only trained listeners can detect. In our work, we investigate the potential of diffusion models for text-conditional music generation. We develop a cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU. In addition to trained models, we provide a collection of open-source libraries with the hope of facilitating future work in the field. We open-source the following: - Music samples for this paper: https://bit.ly/anonymous-mousai - All music samples for all models: https://bit.ly/audio-diffusion - Codes: https://github.com/archinetai/audio-diffusion-pytorch  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v2 [cs.LG] UPDATED)
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Input Perturbation Reduces Exposure Bias in Diffusion Models. (arXiv:2301.11706v1 [cs.LG])
    Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the \textbf{exposure bias} problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that the proposed input perturbation leads to a significant improvement of the sample quality while reducing both the training and the inference times. For instance, on CelebA 64$\times$64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time.  ( 2 min )
    Learning Visual Representations for Transfer Learning by Suppressing Texture. (arXiv:2011.01901v3 [cs.CV] UPDATED)
    Recent literature has shown that features obtained from supervised training of CNNs may over-emphasize texture rather than encoding high-level information. In self-supervised learning in particular, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations. To address these problems we propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture. This simple method helps retain important edge information and suppress texture at the same time. We empirically show that our method achieves state-of-the-art results on object detection and image classification with eight diverse datasets in either supervised or self-supervised learning tasks such as MoCoV2 and Jigsaw. Our method is particularly effective for transfer learning tasks and we observed improved performance on five standard transfer learning datasets. The large improvements (up to 11.49\%) on the Sketch-ImageNet dataset, DTD dataset and additional visual analyses with saliency maps suggest that our approach helps in learning better representations that better transfer.  ( 2 min )
    DBGSL: Dynamic Brain Graph Structure Learning. (arXiv:2209.13513v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs) have shown success at learning representations of brain graphs derived from functional magnetic resonance imaging (fMRI) data. The majority of existing GNN methods, however, assume brain graphs are static over time and the graph adjacency matrix is known prior to model training. These assumptions are at odds with neuroscientific evidence that brain graphs are time-varying with a connectivity structure that depends on the choice of functional connectivity measure. Noisy brain graphs that do not truly represent the underling fMRI data can have a detrimental impact on the performance of GNNs. As a solution, we propose Dynamic Brain Graph Structure Learning (DBGSL), a novel method for learning the optimal time-varying dependency structure of fMRI data induced by a downstream prediction task. Experiments demonstrate DBGSL achieves state-of-the-art performance for sex classification using real-world resting-state and task fMRI data. Moreover, analysis of the learnt dynamic graphs highlights prediction-related brain regions which align with existing neuroscience literature.  ( 2 min )
    Normality-Guided Distributional Reinforcement Learning for Continuous Control. (arXiv:2208.13125v2 [cs.LG] UPDATED)
    Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) methods instead model the value distribution, which has been shown to improve performance in many settings. In this paper, we model the value distribution as approximately normal using the Markov Chain central limit theorem. We analytically compute quantile bars to provide a new DRL target that is informed by the decrease in standard deviation that occurs over the course of an episode. In addition, we propose a policy update strategy based on uncertainty as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds and show that our methods produce performance improvements in continuous control tasks.  ( 2 min )
    Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits. (arXiv:2210.00025v2 [cs.LG] UPDATED)
    How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean. (arXiv:2301.11862v1 [stat.ML])
    Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models.  ( 2 min )
    Challenging Common Assumptions in Convex Reinforcement Learning. (arXiv:2202.01511v3 [cs.LG] UPDATED)
    The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.  ( 2 min )
    Bayesian Self-Supervised Contrastive Learning. (arXiv:2301.11673v1 [cs.LG])
    Recent years have witnessed many successful applications of contrastive learning in diverse domains, yet its self-supervised version still remains many exciting challenges. As the negative samples are drawn from unlabeled datasets, a randomly selected sample may be actually a false negative to an anchor, leading to incorrect encoder training. This paper proposes a new self-supervised contrastive loss called the BCL loss that still uses random samples from the unlabeled data while correcting the resulting bias with importance weights. The key idea is to design the desired sampling distribution for sampling hard true negative samples under the Bayesian framework. The prominent advantage lies in that the desired sampling distribution is a parametric structure, with a location parameter for debiasing false negative and concentration parameter for mining hard negative, respectively. Experiments validate the effectiveness and superiority of the BCL loss.  ( 2 min )
    Rethinking Assumptions in Deep Anomaly Detection. (arXiv:2006.00339v3 [cs.LG] UPDATED)
    Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) it is usually treated in an unsupervised manner since one typically does not have access to, or it is infeasible to utilize, a dataset that sufficiently characterizes what it means to be "anomalous." In this paper we present results demonstrating that this intuition surprisingly seems not to extend to deep AD on images. For a recent AD benchmark on ImageNet, classifiers trained to discern between normal samples and just a few (64) random natural images are able to outperform the current state of the art in deep AD. Experimentally we discover that the multiscale structure of image data makes example anomalies exceptionally informative.  ( 2 min )
    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v2 [cs.LG] UPDATED)
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v3 [stat.ML] UPDATED)
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.  ( 2 min )
    Can We Faithfully Represent Absence States to Compute Shapley Values on a DNN?. (arXiv:2105.10719v3 [cs.LG] UPDATED)
    Although many methods have been proposed to estimate attributions of input variables, there still exists a significant theoretical flaw in masking-based attribution methods, i.e., it is hard to examine whether the masking method faithfully represents the absence of input variables. Specifically, for masking-based attributions, setting an input variable to the baseline value is a typical way of representing the absence of the variable. However, there are no studies investigating how to represent the absence of input variables and verify the faithfulness of baseline values. Therefore, we revisit the feature representation of a DNN in terms of causality, and propose to use causal patterns to examine whether the masking method faithfully removes information encoded in input variables. More crucially, it is proven that the causality can be explained as the elementary rationale of the Shapley value. Furthermore, we define the optimal baseline value from the perspective of causality, and we propose a method to learn the optimal baseline value. Experimental results have demonstrated the effectiveness of our method.  ( 2 min )
    Theoretical Analysis of Offline Imitation With Supplementary Dataset. (arXiv:2301.11687v1 [cs.LG])
    Behavioral cloning (BC) can recover a good policy from abundant expert data, but may fail when expert data is insufficient. This paper considers a situation where, besides the small amount of expert data, a supplementary dataset is available, which can be collected cheaply from sub-optimal policies. Imitation learning with a supplementary dataset is an emergent practical framework, but its theoretical foundation remains under-developed. To advance understanding, we first investigate a direct extension of BC, called NBCU, that learns from the union of all available data. Our analysis shows that, although NBCU suffers an imitation gap that is larger than BC in the worst case, there exist special cases where NBCU performs better than or equally well as BC. This discovery implies that noisy data can also be helpful if utilized elaborately. Therefore, we further introduce a discriminator-based importance sampling technique to re-weight the supplementary data, proposing the WBCU method. With our newly developed landscape-based analysis, we prove that WBCU can outperform BC in mild conditions. Empirical studies show that WBCU simultaneously achieves the best performance on two challenging tasks where prior state-of-the-art methods fail.  ( 2 min )
    Interpreting learning in biological neural networks as zero-order optimization method. (arXiv:2301.11777v1 [cs.LG])
    Recently, significant progress has been made regarding the statistical understanding of artificial neural networks (ANNs). ANNs are motivated by the functioning of the brain, but differ in several crucial aspects. In particular, it is biologically implausible that the learning of the brain is based on gradient descent. In this work we look at the brain as a statistical method for supervised learning. The main contribution is to relate the local updating rule of the connection parameters in biological neural networks (BNNs) to a zero-order optimization method.  ( 2 min )
    Reinforcement Learning from Diverse Human Preferences. (arXiv:2301.11774v1 [cs.LG])
    The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v2 [cs.LG] UPDATED)
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v3 [cs.LG] UPDATED)
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.
    CADet: Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning. (arXiv:2210.01742v2 [cs.LG] UPDATED)
    Handling out-of-distribution (OOD) samples has become a major stake in the real-world deployment of machine learning systems. This work explores the application of self-supervised contrastive learning to the simultaneous detection of two types of OOD samples: unseen classes and adversarial perturbations. Since in practice the distribution of such samples is not known in advance, we do not assume access to OOD examples. We first show that similarity functions trained with contrastive learning can be leveraged with the maximum mean discrepancy (MMD) two-sample test to verify whether two independent sets of samples are drawn from the same distribution. Inspired by this approach, we introduce CADet (Contrastive Anomaly Detection), a method based on contrastive transformations to perform anomaly detection on single samples. CADet compares favorably to adversarial detection methods to detect adversarially perturbed samples on ImageNet. Simultaneously, it achieves comparable performance to unseen label detection methods on two challenging benchmarks: ImageNet-O and iNaturalist. CADet is fully self-supervised and requires neither labels for in-distribution samples nor access to OOD examples.
    Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization. (arXiv:2212.10445v2 [cs.LG] UPDATED)
    Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: from a pre-trained foundation model, they fine-tune the weights on the target task of interest. So, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks: these individual fine-tunings exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain rich and diverse features. In this paper, we thus propose model ratatouille, a new strategy to recycle the multiple fine-tunings of the same foundation model on diverse auxiliary tasks. Specifically, we repurpose these auxiliary weights as initializations for multiple parallel fine-tunings on the target task; then, we average all fine-tuned weights to obtain the final model. This recycling strategy aims at maximizing the diversity in weights by leveraging the diversity in auxiliary tasks. Empirically, it improves the state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, this work contributes to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to reliably update machine learning models.
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v2 [stat.ML] UPDATED)
    We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.
    Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks. (arXiv:2301.11546v1 [cs.LG])
    Learning adversarial examples can be formulated as an optimization problem of maximizing the loss function with some box-constraints. However, for solving this induced optimization problem, the state-of-the-art gradient-based methods such as FGSM, I-FGSM and MI-FGSM look different from their original methods especially in updating the direction, which makes it difficult to understand them and then leaves some theoretical issues to be addressed in viewpoint of optimization. In this paper, from the perspective of adapting step-size, we provide a unified theoretical interpretation of these gradient-based adversarial learning methods. We show that each of these algorithms is in fact a specific reformulation of their original gradient methods but using the step-size rules with only current gradient information. Motivated by such analysis, we present a broad class of adaptive gradient-based algorithms based on the regular gradient methods, in which the step-size strategy utilizing information of the accumulated gradients is integrated. Such adaptive step-size strategies directly normalize the scale of the gradients rather than use some empirical operations. The important benefit is that convergence for the iterative algorithms is guaranteed and then the whole optimization process can be stabilized. The experiments demonstrate that our AdaI-FGM consistently outperforms I-FGSM and AdaMI-FGM remains competitive with MI-FGSM for black-box attacks.
    Synthetic A/B Testing using Synthetic Interventions. (arXiv:2006.07691v5 [econ.EM] UPDATED)
    Suppose there are $N$ units and $D$ interventions. We aim to learn the average potential outcome associated with every unit-intervention pair, i.e., $N \times D$ causal parameters. While running $N \times D$ experiments is conceivable, it can be expensive or infeasible. This work introduces an experiment design, synthetic A/B testing, and the synthetic interventions (SI) estimator to recover all $N \times D$ causal parameters while observing each unit under at most two interventions, independent of $D$. Under a novel tensor factor model for potential outcomes across units, measurements, and interventions, we establish the identification of each parameter. Further, we show the SI estimator is finite-sample consistent and asymptotically normal. Collectively, these also lead to novel results for panel data settings, particularly for synthetic controls. We empirically validate our experiment design using real e-commerce data from a large-scale A/B test.
    Uplink Scheduling in Federated Learning: an Importance-Aware Approach via Graph Representation Learning. (arXiv:2301.11903v1 [cs.NI])
    Federated Learning (FL) has emerged as a promising framework for distributed training of AI-based services, applications, and network procedures in 6G. One of the major challenges affecting the performance and efficiency of 6G wireless FL systems is the massive scheduling of user devices over resource-constrained channels. In this work, we argue that the uplink scheduling of FL client devices is a problem with a rich relational structure. To address this challenge, we propose a novel, energy-efficient, and importance-aware metric for client scheduling in FL applications by leveraging Unsupervised Graph Representation Learning (UGRL). Our proposed approach introduces a relational inductive bias in the scheduling process and does not require the collection of training feedback information from client devices, unlike state-of-the-art importance-aware mechanisms. We evaluate our proposed solution against baseline scheduling algorithms based on recently proposed metrics in the literature. Results show that, when considering scenarios of nodes exhibiting spatial relations, our approach can achieve an average gain of up to 10% in model accuracy and up to 17 times in energy efficiency compared to state-of-the-art importance-aware policies.
    Incorporating Background Knowledge in Symbolic Regression using a Computer Algebra System. (arXiv:2301.11919v1 [cs.LG])
    Symbolic Regression (SR) can generate interpretable, concise expressions that fit a given dataset, allowing for more human understanding of the structure than black-box approaches. The addition of background knowledge (in the form of symbolic mathematical constraints) allows for the generation of expressions that are meaningful with respect to theory while also being consistent with data. We specifically examine the addition of constraints to traditional genetic algorithm (GA) based SR (PySR) as well as a Markov-chain Monte Carlo (MCMC) based Bayesian SR architecture (Bayesian Machine Scientist), and apply these to rediscovering adsorption equations from experimental, historical datasets. We find that, while hard constraints prevent GA and MCMC SR from searching, soft constraints can lead to improved performance both in terms of search effectiveness and model meaningfulness, with computational costs increasing by about an order-of-magnitude. If the constraints do not correlate well with the dataset or expected models, they can hinder the search of expressions. We find Bayesian SR is better these constraints (as the Bayesian prior) than by modifying the fitness function in the GA
    Synth-by-Reg (SbR): Contrastive learning for synthesis-based registration of paired images. (arXiv:2107.14449v3 [cs.CV] UPDATED)
    Nonlinear inter-modality registration is often challenging due to the lack of objective functions that are good proxies for alignment. Here we propose a synthesis-by-registration method to convert this problem into an easier intra-modality task. We introduce a registration loss for weakly supervised image translation between domains that does not require perfectly aligned training data. This loss capitalises on a registration U-Net with frozen weights, to drive a synthesis CNN towards the desired translation. We complement this loss with a structure preserving constraint based on contrastive learning, which prevents blurring and content shifts due to overfitting. We apply this method to the registration of histological sections to MRI slices, a key step in 3D histology reconstruction. Results on two different public datasets show improvements over registration based on mutual information (13% reduction in landmark error) and synthesis-based algorithms such as CycleGAN (11% reduction), and are comparable to a registration CNN with label supervision. Code and data are publicly available at \url{https://github.com/acasamitjana/SynthByReg}
    Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees. (arXiv:2301.11911v1 [cs.LG])
    The completeness axiom renders the explanation of a post-hoc XAI method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. Recently, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. This paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.
    Communication-Efficient Learning of Deep Networks from Decentralized Data. (arXiv:1602.05629v4 [cs.LG] UPDATED)
    Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.
    OccRob: Efficient SMT-Based Occlusion Robustness Verification of Deep Neural Networks. (arXiv:2301.11912v1 [cs.LG])
    Occlusion is a prevalent and easily realizable semantic perturbation to deep neural networks (DNNs). It can fool a DNN into misclassifying an input image by occluding some segments, possibly resulting in severe errors. Therefore, DNNs planted in safety-critical systems should be verified to be robust against occlusions prior to deployment. However, most existing robustness verification approaches for DNNs are focused on non-semantic perturbations and are not suited to the occlusion case. In this paper, we propose the first efficient, SMT-based approach for formally verifying the occlusion robustness of DNNs. We formulate the occlusion robustness verification problem and prove it is NP-complete. Then, we devise a novel approach for encoding occlusions as a part of neural networks and introduce two acceleration techniques so that the extended neural networks can be efficiently verified using off-the-shelf, SMT-based neural network verification tools. We implement our approach in a prototype called OccRob and extensively evaluate its performance on benchmark datasets with various occlusion variants. The experimental results demonstrate our approach's effectiveness and efficiency in verifying DNNs' robustness against various occlusions, and its ability to generate counterexamples when these DNNs are not robust.
    Projected Subnetworks Scale Adaptation. (arXiv:2301.11487v1 [cs.LG])
    Large models support great zero-shot and few-shot capabilities. However, updating these models on new tasks can break performance on previous seen tasks and their zero/few-shot unseen tasks. Our work explores how to update zero/few-shot learners such that they can maintain performance on seen/unseen tasks of previous tasks as well as new tasks. By manipulating the parameter updates of a gradient-based meta learner as the projected task-specific subnetworks, we show improvements for large models to retain seen and zero/few shot task performance in online settings.
    Streaming LifeLong Learning With Any-Time Inference. (arXiv:2301.11892v1 [cs.LG])
    Despite rapid advancements in lifelong learning (LLL) research, a large body of research mainly focuses on improving the performance in the existing \textit{static} continual learning (CL) setups. These methods lack the ability to succeed in a rapidly changing \textit{dynamic} environment, where an AI agent needs to quickly learn new instances in a `single pass' from the non-i.i.d (also possibly temporally contiguous/coherent) data streams without suffering from catastrophic forgetting. For practical applicability, we propose a novel lifelong learning approach, which is streaming, i.e., a single input sample arrives in each time step, single pass, class-incremental, and subject to be evaluated at any moment. To address this challenging setup and various evaluation protocols, we propose a Bayesian framework, that enables fast parameter update, given a single training example, and enables any-time inference. We additionally propose an implicit regularizer in the form of snap-shot self-distillation, which effectively minimizes the forgetting further. We further propose an effective method that efficiently selects a subset of samples for online memory rehearsal and employs a new replay buffer management scheme that significantly boosts the overall performance. Our empirical evaluations and ablations demonstrate that the proposed method outperforms the prior works by large margins.
    Graph-Free Learning in Graph-Structured Data: A More Efficient and Accurate Spatiotemporal Learning Perspective. (arXiv:2301.11742v1 [cs.LG])
    Spatiotemporal learning, which aims at extracting spatiotemporal correlations from the collected spatiotemporal data, is a research hotspot in recent years. And considering the inherent graph structure of spatiotemporal data, recent works focus on capturing spatial dependencies by utilizing Graph Convolutional Networks (GCNs) to aggregate vertex features with the guidance of adjacency matrices. In this paper, with extensive and deep-going experiments, we comprehensively analyze existing spatiotemporal graph learning models and reveal that extracting adjacency matrices with carefully design strategies, which are viewed as the key of enhancing performance on graph learning, are largely ineffective. Meanwhile, based on these experiments, we also discover that the aggregation itself is more important than the way that how vertices are aggregated. With these preliminary, a novel efficient Graph-Free Spatial (GFS) learning module based on layer normalization for capturing spatial correlations in spatiotemporal graph learning. The proposed GFS module can be easily plugged into existing models for replacing all graph convolution components. Rigorous theoretical proof demonstrates that the time complexity of GFS is significantly better than that of graph convolution operation. Extensive experiments verify the superiority of GFS in both the perspectives of efficiency and learning effect in processing graph-structured data especially extreme large scale graph data.  ( 2 min )
    ActiveLab: Active Learning with Re-Labeling by Multiple Annotators. (arXiv:2301.11856v1 [cs.LG])
    In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods.
    Element selection for functional materials discovery by integrated machine learning of elemental contributions to properties. (arXiv:2202.01051v2 [cond-mat.mtrl-sci] UPDATED)
    Fundamental differences between materials originate from the unique nature of their constituent chemical elements. Before specific differences emerge according to the precise ratios of elements in a given crystal structure, a material can be represented by the set of its constituent chemical elements. By working at the level of the periodic table, assessment of materials at the level of their phase fields reduces the combinatorial complexity to accelerate screening, and circumvents the challenges associated with composition-level approaches such as poor extrapolation within phase fields, and the impossibility of exhaustive sampling. This early stage discrimination combined with evaluation of novelty of phase fields aligns with the outstanding experimental challenge of identifying new areas of chemistry to investigate, by prioritising which elements to combine in a reaction. Here, we demonstrate that phase fields can be assessed with respect to the maximum expected value of a target functional property and ranked according to chemical novelty. We develop and present PhaseSelect, an end-to-end machine learning model that combines the representation, classification, regression and ranking of phase fields. First, PhaseSelect constructs elemental characteristics from the co-occurrence of chemical elements in computationally and experimentally reported materials, then it employs attention mechanisms to learn representation for phase fields and assess their functional performance. At the level of the periodic table, PhaseSelect quantifies the probability of observing a functional property, estimates its value within a phase field and also ranks a phase field novelty, which we demonstrate with significant accuracy for three avenues of materials applications for high-temperature superconductivity, high-temperature magnetism, and targeted bandgap energy.
    Is TinyML Sustainable? Assessing the Environmental Impacts of Machine Learning on Microcontrollers. (arXiv:2301.11899v1 [cs.LG])
    The sustained growth of carbon emissions and global waste elicits significant sustainability concerns for our environment's future. The growing Internet of Things (IoT) has the potential to exacerbate this issue. However, an emerging area known as Tiny Machine Learning (TinyML) has the opportunity to help address these environmental challenges through sustainable computing practices. TinyML, the deployment of machine learning (ML) algorithms onto low-cost, low-power microcontroller systems, enables on-device sensor analytics that unlocks numerous always-on ML applications. This article discusses the potential of these TinyML applications to address critical sustainability challenges. Moreover, the footprint of this emerging technology is assessed through a complete life cycle analysis of TinyML systems. From this analysis, TinyML presents opportunities to offset its carbon emissions by enabling applications that reduce the emissions of other sectors. Nevertheless, when globally scaled, the carbon footprint of TinyML systems is not negligible, necessitating that designers factor in environmental impact when formulating new devices. Finally, research directions for enabling further opportunities for TinyML to contribute to a sustainable future are outlined.
    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient. (arXiv:2301.11913v1 [cs.DC])
    Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.
    Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning. (arXiv:2301.11857v1 [cs.AI])
    Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.
    Automatic Modulation Classification with Deep Neural Networks. (arXiv:2301.11773v1 [cs.LG])
    Automatic modulation classification is a desired feature in many modern software-defined radios. In recent years, a number of convolutional deep learning architectures have been proposed for automatically classifying the modulation used on observed signal bursts. However, a comprehensive analysis of these differing architectures and importance of each design element has not been carried out. Thus it is unclear what tradeoffs the differing designs of these convolutional neural networks might have. In this research, we investigate numerous architectures for automatic modulation classification and perform a comprehensive ablation study to investigate the impacts of varying hyperparameters and design elements on automatic modulation classification performance. We show that a new state of the art in performance can be achieved using a subset of the studied design elements. In particular, we show that a combination of dilated convolutions, statistics pooling, and squeeze-and-excitation units results in the strongest performing classifier. We further investigate this best performer according to various other criteria, including short signal bursts, common misclassifications, and performance across differing modulation categories and modes.  ( 2 min )
    Conformal inference is (almost) free for neural networks trained with early stopping. (arXiv:2301.11556v1 [stat.ML])
    Early stopping based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stopping often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses the above limitation with conformalized early stopping: a novel method that combines early stopping with conformal calibration while efficiently recycling the same hold-out data. This leads to models that are both accurate and able to provide exact predictive inferences without multiple data splits nor overly conservative adjustments. Practical implementations are developed for different learning tasks -- outlier detection, multi-class classification, regression -- and their competitive performance is demonstrated on real data.
    I Prefer not to Say: Are Users Penalized for Protecting Personal Data?. (arXiv:2210.13954v3 [cs.LG] UPDATED)
    We examine the problem of obtaining fair outcomes for individuals who choose to share optional information with machine-learned models and those who do not consent and keep their data undisclosed. We find that these non-consenting users receive significantly lower prediction outcomes than justified by their provided information alone. This observation gives rise to the overlooked problem of how to ensure that users, who protect their personal data, are not penalized. While statistical fairness notions focus on fair outcomes between advantaged and disadvantaged groups, these fairness notions fail to protect the non-consenting users. To address this problem, we formalize protection requirements for models which (i) allow users to benefit from sharing optional information and (ii) do not penalize them if they keep their data undisclosed. We offer the first solution to this problem by proposing the notion of Optional Feature Fairness (OFF), which we prove to be loss-optimal under our protection requirements (i) and (ii). To learn OFF-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we extensively analyze OFF on a variety of challenging real-world tasks, models, and data sets with multiple optional features.
    Aleatoric and Epistemic Discrimination in Classification. (arXiv:2301.11781v1 [cs.LG])
    Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy given fairness constraints and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.  ( 2 min )
    Meta-Learning Mini-Batch Risk Functionals. (arXiv:2301.11724v1 [cs.LG])
    Supervised learning typically optimizes the expected value risk functional of the loss, but in many cases, we want to optimize for other risk functionals. In full-batch gradient descent, this is done by taking gradients of a risk functional of interest, such as the Conditional Value at Risk (CVaR) which ignores some quantile of extreme losses. However, deep learning must almost always use mini-batch gradient descent, and lack of unbiased estimators of various risk functionals make the right optimization procedure unclear. In this work, we introduce a meta-learning-based method of learning an interpretable mini-batch risk functional during model training, in a single shot. When optimizing for various risk functionals, the learned mini-batch risk functions lead to risk reduction of up to 10% over hand-engineered mini-batch risk functionals. Then in a setting where the right risk functional is unknown a priori, our method improves over baseline by 14% relative (~9% absolute). We analyze the learned mini-batch risk functionals at different points through training, and find that they learn a curriculum (including warm-up periods), and that their final form can be surprisingly different from the underlying risk functional that they optimize for.
    Diverse Weight Averaging for Out-of-Distribution Generalization. (arXiv:2205.09739v2 [cs.CV] UPDATED)
    Standard neural networks struggle to generalize under distribution shifts in computer vision. Fortunately, combining multiple networks can consistently improve out-of-distribution generalization. In particular, weight averaging (WA) strategies were shown to perform best on the competitive DomainBed benchmark; they directly average the weights of multiple networks despite their nonlinearities. In this paper, we propose Diverse Weight Averaging (DiWA), a new WA strategy whose main motivation is to increase the functional diversity across averaged models. To this end, DiWA averages weights obtained from several independent training runs: indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between WA and standard functional ensembling. Moreover, this decomposition highlights that WA succeeds when the variance term dominates, which we show occurs when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on DomainBed without inference overhead.
    Leveraging the Third Dimension in Contrastive Learning. (arXiv:2301.11790v1 [cs.CV])
    Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.
    Hyperbolic VAE via Latent Gaussian Distributions. (arXiv:2209.15217v2 [cs.LG] UPDATED)
    We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent space consists of a set of Gaussian distributions. It is known that the set of the univariate Gaussian distributions with the Fisher information metric form a hyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed with the Gaussian manifolds, we propose a pseudo-Gaussian manifold normal distribution based on the Kullback-Leibler divergence, a local approximation of the squared Fisher-Rao distance, to define a density over the latent space. In experiments, we demonstrate the efficacy of GM-VAE on two different tasks: density estimation of image datasets and environment modeling in model-based reinforcement learning. GM-VAE outperforms the other variants of hyperbolic- and Euclidean-VAEs on density estimation tasks and shows competitive performance in model-based reinforcement learning. We observe that our model provides strong numerical stability, addressing a common limitation reported in previous hyperbolic-VAEs.
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then use our method to compare four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. In this application, we corroborate evidence for the recently proposed L\'evy flight model of decision-making and show how transfer learning can be leveraged to enhance training efficiency. Reproducible code for all analyses is provided.
    Provably Efficient Causal Model-Based Reinforcement Learning for Systematic Generalization. (arXiv:2202.06545v2 [cs.LG] UPDATED)
    In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another that is shared. Consider a set of environments that share the laws of motion as an example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide a tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that guarantees any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.
    Deep Multi-modal Fusion of Image and Non-image Data in Disease Diagnosis and Prognosis: A Review. (arXiv:2203.15588v3 [cs.LG] UPDATED)
    The rapid development of diagnostic technologies in healthcare is leading to higher requirements for physicians to handle and integrate the heterogeneous, yet complementary data that are produced during routine practice. For instance, the personalized diagnosis and treatment planning for a single cancer patient relies on the various images (e.g., radiological, pathological, and camera images) and non-image data (e.g., clinical data and genomic data). However, such decision-making procedures can be subjective, qualitative, and have large inter-subject variabilities. With the recent advances in multi-modal deep learning technologies, an increasingly large number of efforts have been devoted to a key question: how do we extract and aggregate multi-modal information to ultimately provide more objective, quantitative computer-aided clinical decision making? This paper reviews the recent studies on dealing with such a question. Briefly, this review will include the (1) overview of current multi-modal learning workflows, (2) summarization of multi-modal fusion methods, (3) discussion of the performance, (4) applications in disease diagnosis and prognosis, and (5) challenges and future directions.
    Constrained Clustering: General Pairwise and Cardinality Constraints. (arXiv:1907.10410v2 [cs.LG] UPDATED)
    We study constrained clustering, where constraints guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. Most existing constrained clustering models can only utilize one category of constraints at a time. We enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As the two categories provide different useful information, utilizing both allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the unified formulation. To solve this, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. We derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers. Extensive experiments on both synthetic and real data demonstrate when: (1) utilizing a single category of constraint, the proposed model is superior to or competitive with SOTA constrained clustering models, and (2) utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experiments show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to 2%-5% in classical clustering metrics, e.g. Adjusted Random, Mirkin's, and Huber's, indices outerperfomring other methods.
    Deep Clustering Survival Machines with Interpretable Expert Distributions. (arXiv:2301.11826v1 [cs.LG])
    Conventional survival analysis methods are typically ineffective to characterize heterogeneity in the population while such information can be used to assist predictive modeling. In this study, we propose a hybrid survival analysis method, referred to as deep clustering survival machines, that combines the discriminative and generative mechanisms. Similar to the mixture models, we assume that the timing information of survival data is generatively described by a mixture of certain numbers of parametric distributions, i.e., expert distributions. We learn weights of the expert distributions for individual instances according to their features discriminatively such that each instance's survival information can be characterized by a weighted combination of the learned constant expert distributions. This method also facilitates interpretable subgrouping/clustering of all instances according to their associated expert distributions. Extensive experiments on both real and synthetic datasets have demonstrated that the method is capable of obtaining promising clustering results and competitive time-to-event predicting performance.
    A Game-Theoretic Framework for Managing Risk in Multi-Agent Systems. (arXiv:2205.15434v2 [cs.LG] UPDATED)
    In order for agents in multi-agent systems (MAS) to be safe, they need to take into account the risks posed by the actions of other agents. However, the dominant paradigm in game theory (GT) assumes that agents are not affected by risk from other agents and only strive to maximise their expected utility. For example, in hybrid human-AI driving systems, it is necessary to limit large deviations in reward resulting from car crashes. Although there are equilibrium concepts in game theory that take into account risk aversion, they either assume that agents are risk-neutral with respect to the uncertainty caused by the actions of other agents, or they are not guaranteed to exist. We introduce a new GT-based Risk-Averse Equilibrium (RAE) that always produces a solution that minimises the potential variance in reward accounting for the strategy of other agents. Theoretically and empirically, we show RAE shares many properties with a Nash Equilibrium (NE), establishing convergence properties and generalising to risk-dominant NE in certain cases. To tackle large-scale problems, we extend RAE to the PSRO multi-agent reinforcement learning (MARL) framework. We empirically demonstrate the minimum reward variance benefits of RAE in matrix games with high-risk outcomes. Results on MARL experiments show RAE generalises to risk-dominant NE in a trust dilemma game and that it reduces instances of crashing by 7x in an autonomous driving setting versus the best performing baseline.
    DAG Learning on the Permutahedron. (arXiv:2301.11898v1 [cs.LG])
    We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.
    Robust variance-regularized risk minimization with concomitant scaling. (arXiv:2301.11584v1 [stat.ML])
    Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.  ( 2 min )
    Sparse Mixture-of-Experts are Domain Generalizable Learners. (arXiv:2206.04046v6 [cs.CV] UPDATED)
    Human visual perception can easily generalize to out-of-distributed visual data, which is far beyond the capability of modern machine learning models. Domain generalization (DG) aims to close this gap, with existing DG methods mainly focusing on the loss function design. In this paper, we propose to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely Generalizable Mixture-of-Experts (GMoE). Extensive experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Moreover, GMoE is complementary to existing DG methods and its performance is substantially improved when trained with DG algorithms.
    PECAN: A Deterministic Certified Defense Against Backdoor Attacks. (arXiv:2301.11824v1 [cs.CR])
    Neural networks are vulnerable to backdoor poisoning attacks, where the attackers maliciously poison the training set and insert triggers into the test input to change the prediction of the victim model. Existing defenses for backdoor attacks either provide no formal guarantees or come with expensive-to-compute and ineffective probabilistic guarantees. We present PECAN, an efficient and certified approach for defending against backdoor attacks. The key insight powering PECAN is to apply off-the-shelf test-time evasion certification techniques on a set of neural networks trained on disjoint partitions of the data. We evaluate PECAN on image classification and malware detection datasets. Our results demonstrate that PECAN can (1) significantly outperform the state-of-the-art certified backdoor defense, both in defense strength and efficiency, and (2) on real back-door attacks, PECAN can reduce attack success rate by order of magnitude when compared to a range of baselines from the literature.
    Naive Few-Shot Learning: Uncovering the fluid intelligence of machines. (arXiv:2205.12013v3 [cs.AI] UPDATED)
    In this paper, we aimed to help bridge the gap between human fluid intelligence - the ability to solve novel tasks without prior training - and the performance of deep neural networks, which typically require extensive prior training. An essential cognitive component for solving intelligence tests, which in humans are used to measure fluid intelligence, is the ability to identify regularities in sequences. This motivated us to construct a benchmark task, which we term \textit{sequence consistency evaluation} (SCE), whose solution requires the ability to identify regularities in sequences. Given the proven capabilities of deep networks, their ability to solve such tasks after extensive training is expected. Surprisingly, however, we show that naive (randomly initialized) deep learning models that are trained on a \textit{single} SCE with a \textit{single} optimization step can still solve non-trivial versions of the task relatively well. We extend our findings to solve, without any prior training, real-world anomaly detection tasks in the visual and auditory modalities. These results demonstrate the fluid-intelligent computational capabilities of deep networks. We discuss the implications of our work for constructing fluid-intelligent machines.
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v2 [math.ST] UPDATED)
    Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.
    Accelerating Domain-aware Deep Learning Models with Distributed Training. (arXiv:2301.11787v1 [cs.LG])
    Recent advances in data-generating techniques led to an explosive growth of geo-spatiotemporal data. In domains such as hydrology, ecology, and transportation, interpreting the complex underlying patterns of spatiotemporal interactions with the help of deep learning techniques hence becomes the need of the hour. However, applying deep learning techniques without domain-specific knowledge tends to provide sub-optimal prediction performance. Secondly, training such models on large-scale data requires extensive computational resources. To eliminate these challenges, we present a novel distributed domain-aware spatiotemporal network that utilizes domain-specific knowledge with improved model performance. Our network consists of a pixel-contribution block, a distributed multiheaded multichannel convolutional (CNN) spatial block, and a recurrent temporal block. We choose flood prediction in hydrology as a use case to test our proposed method. From our analysis, the network effectively predicts high peaks in discharge measurements at watershed outlets with up to 4.1x speedup and increased prediction performance of up to 93\%. Our approach achieved a 12.6x overall speedup and increased the mean prediction performance by 16\%. We perform extensive experiments on a dataset of 23 watersheds in a northern state of the U.S. and present our findings.
    Overparameterized Linear Regression under Adversarial Attacks. (arXiv:2204.06274v2 [stat.ML] UPDATED)
    We study the error of linear regression in the face of adversarial attacks. In this framework, an adversary changes the input to the regression model in order to maximize the prediction error. We provide bounds on the prediction error in the presence of an adversary as a function of the parameter norm and the error in the absence of such an adversary. We show how these bounds make it possible to study the adversarial error using analysis from non-adversarial setups. The obtained results shed light on the robustness of overparameterized linear models to adversarial attacks. Adding features might be either a source of additional robustness or brittleness. On the one hand, we use asymptotic results to illustrate how double-descent curves can be obtained for the adversarial error. On the other hand, we derive conditions under which the adversarial error can grow to infinity as more features are added, while at the same time, the test error goes to zero. We show this behavior is caused by the fact that the norm of the parameter vector grows with the number of features. It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks might behave fundamentally differently due to how the $\ell_1$ and $\ell_2$-norms of random projections concentrate. We also show how our reformulation allows for solving adversarial training as a convex optimization problem. This fact is then exploited to establish similarities between adversarial training and parameter-shrinking methods and to study how the training might affect the robustness of the estimated models.
    Feature Selection on Quantum Computers. (arXiv:2203.13261v2 [quant-ph] UPDATED)
    In machine learning, fewer features reduce model complexity. Carefully assessing the influence of each input feature on the model quality is therefore a crucial preprocessing step. We propose a novel feature selection algorithm based on a quadratic unconstrained binary optimization (QUBO) problem, which allows to select a specified number of features based on their importance and redundancy. In contrast to iterative or greedy methods, our direct approach yields higherquality solutions. QUBO problems are particularly interesting because they can be solved on quantum hardware. To evaluate our proposed algorithm, we conduct a series of numerical experiments using a classical computer, a quantum gate computer and a quantum annealer. Our evaluation compares our method to a range of standard methods on various benchmark datasets. We observe competitive performance.  ( 2 min )
    Constrained Monotonic Neural Networks. (arXiv:2205.11775v2 [cs.LG] UPDATED)
    Deep neural networks are becoming increasingly popular in approximating arbitrary functions from noisy data. But wider adoption is being hindered by the need to explain such models and to impose additional constraints on them. Monotonicity constraint is one of the most requested properties in real-world scenarios and is the focus of this paper. One of the oldest ways to construct a monotonic fully connected neural network is to constrain its weights to be non-negative while employing a monotonic activation function. Unfortunately, this construction does not work with popular non-saturated activation functions such as ReLU, ELU, SELU etc, as it can only approximate convex functions. We show this shortcoming can be fixed by employing the original activation function for a part of the neurons in the layer, and employing its point reflection for the other part. Our experiments show this approach of building monotonic deep neural networks have matching or better accuracy when compared to other state-of-the-art methods such as deep lattice networks or monotonic networks obtained by heuristic regularization. This method is the simplest one in the sense of having the least number of parameters, not requiring any modifications to the learning procedure or steps post-learning steps.
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v3 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.
    SOBER: Scalable Batch Bayesian Optimization and Quadrature using Recombination Constraints. (arXiv:2301.11832v1 [cs.LG])
    Batch Bayesian optimisation (BO) has shown to be a sample-efficient method of performing optimisation where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch BO with arbitrary acquisition functions, arbitrary input spaces (e.g. graph), and arbitrary kernels. The key to our approach is to reformulate batch selection for BO as a Bayesian quadrature (BQ) problem, which offers computational advantages. This reformulation is beneficial in solving BQ tasks reciprocally, which introduces the exploitative functionality of BO to BQ. We show that SOBER offers substantive performance gains in synthetic and real-world tasks, including drug discovery and simulation-based inference.
    Image Restoration with Mean-Reverting Stochastic Differential Equations. (arXiv:2301.11699v1 [cs.LG])
    This paper presents a stochastic differential equation (SDE) approach for general-purpose image restoration. The key construction consists in a mean-reverting SDE that transforms a high-quality image into a degraded counterpart as a mean state with fixed Gaussian noise. Then, by simulating the corresponding reverse-time SDE, we are able to restore the origin of the low-quality image without relying on any task-specific prior knowledge. Crucially, the proposed mean-reverting SDE has a closed-form solution, allowing us to compute the ground truth time-dependent score and learn it with a neural network. Moreover, we propose a maximum likelihood objective to learn an optimal reverse trajectory which stabilizes the training and improves the restoration results. In the experiments, we show that our proposed method achieves highly competitive performance in quantitative comparisons on image deraining, deblurring, and denoising, setting a new state-of-the-art on two deraining datasets. Finally, the general applicability of our approach is further demonstrated via qualitative results on image super-resolution, inpainting, and dehazing. Code is available at \url{https://github.com/Algolzw/image-restoration-sde}.
    Efficiently predicting high resolution mass spectra with graph neural networks. (arXiv:2301.11419v1 [cs.LG])
    Identifying a small molecule from its mass spectrum is the primary open problem in computational metabolomics. This is typically cast as information retrieval: an unknown spectrum is matched against spectra predicted computationally from a large database of chemical structures. However, current approaches to spectrum prediction model the output space in ways that force a tradeoff between capturing high resolution mass information and tractable learning. We resolve this tradeoff by casting spectrum prediction as a mapping from an input molecular graph to a probability distribution over molecular formulas. We discover that a large corpus of mass spectra can be closely approximated using a fixed vocabulary constituting only 2% of all observed formulas. This enables efficient spectrum prediction using an architecture similar to graph classification - GrAFF-MS - achieving significantly lower prediction error and orders-of-magnitude faster runtime than state-of-the-art methods.
    Constrained Submodular Optimization for Vaccine Design. (arXiv:2206.08336v2 [q-bio.QM] UPDATED)
    Advances in machine learning have enabled the prediction of immune system responses to prophylactic and therapeutic vaccines. However, the engineering task of designing vaccines remains a challenge. In particular, the genetic variability of the human immune system makes it difficult to design peptide vaccines that provide widespread immunity in vaccinated populations. We introduce a framework for evaluating and designing peptide vaccines that uses probabilistic machine learning models, and demonstrate its ability to produce designs for a SARS-CoV-2 vaccine that outperform previous designs. We provide a theoretical analysis of the approximability, scalability, and complexity of our framework.
    Regret Analysis of Learning-Based MPC with Partially-Unknown Cost Function. (arXiv:2108.02307v2 [math.OC] UPDATED)
    The exploration/exploitation trade-off is an inherent challenge in data-driven adaptive control. Though this trade-off has been studied for multi-armed bandits (MAB's) and reinforcement learning for linear systems; it is less well-studied for learning-based control of nonlinear systems. A significant theoretical challenge in the nonlinear setting is that there is no explicit characterization of an optimal controller for a given set of cost and system parameters. We propose the use of a finite-horizon oracle controller with full knowledge of parameters as a reasonable surrogate to optimal controller. This allows us to develop policies in the context of learning-based MPC and MAB's and conduct a control-theoretic analysis using techniques from MPC- and optimization-theory to show these policies achieve low regret with respect to this finite-horizon oracle. Our simulations exhibit the low regret of our policy on a heating, ventilation, and air-conditioning model with partially-unknown cost function.  ( 2 min )
    Achieving Risk Control in Online Learning Settings. (arXiv:2205.09095v7 [cs.LG] UPDATED)
    To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk -- such as coverage of confidence intervals, false negative rate, or F1 score -- in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion. The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost. We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks. To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks. Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.
    Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation. (arXiv:2301.11547v1 [cs.LG])
    Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.  ( 2 min )
    Targeted Attacks on Timeseries Forecasting. (arXiv:2301.11544v1 [cs.LG])
    Real-world deep learning models developed for Time Series Forecasting are used in several critical applications ranging from medical devices to the security domain. Many previous works have shown how deep learning models are prone to adversarial attacks and studied their vulnerabilities. However, the vulnerabilities of time series models for forecasting due to adversarial inputs are not extensively explored. While the attack on a forecasting model might aim to deteriorate the performance of the model, it is more effective, if the attack is focused on a specific impact on the model's output. In this paper, we propose a novel formulation of Directional, Amplitudinal, and Temporal targeted adversarial attacks on time series forecasting models. These targeted attacks create a specific impact on the amplitude and direction of the output prediction. We use the existing adversarial attack techniques from the computer vision domain and adapt them for time series. Additionally, we propose a modified version of the Auto Projected Gradient Descent attack for targeted attacks. We examine the impact of the proposed targeted attacks versus untargeted attacks. We use KS-Tests to statistically demonstrate the impact of the attack. Our experimental results show how targeted attacks on time series models are viable and are more powerful in terms of statistical similarity. It is, hence difficult to detect through statistical methods. We believe that this work opens a new paradigm in the time series forecasting domain and represents an important consideration for developing better defenses.  ( 2 min )
    Decentralized Online Bandit Optimization on Directed Graphs with Regret Bounds. (arXiv:2301.11802v1 [cs.LG])
    We consider a decentralized multiplayer game, played over $T$ rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint bandit-reward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.
    Multimodal and Explainable Internet Meme Classification. (arXiv:2212.05612v2 [cs.AI] UPDATED)
    Warning: this paper contains content that may be offensive or upsetting. In the current context where online platforms have been effectively weaponized in a variety of geo-political events and social issues, Internet memes make fair content moderation at scale even more difficult. Existing work on meme classification and tracking has focused on black-box methods that do not explicitly consider the semantics of the memes or the context of their creation. In this paper, we pursue a modular and explainable architecture for Internet meme understanding. We design and implement multimodal classification methods that perform example- and prototype-based reasoning over training cases, while leveraging both textual and visual SOTA models to represent the individual cases. We study the relevance of our modular and explainable models in detecting harmful memes on two existing tasks: Hate Speech Detection and Misogyny Classification. We compare the performance between example- and prototype-based methods, and between text, vision, and multimodal models, across different categories of harmfulness (e.g., stereotype and objectification). We devise a user-friendly interface that facilitates the comparative analysis of examples retrieved by all of our models for any given meme, informing the community about the strengths and limitations of these explainable methods.  ( 2 min )
    Certified Invertibility in Neural Networks via Mixed-Integer Programming. (arXiv:2301.11783v1 [cs.LG])
    Neural networks are notoriously vulnerable to adversarial attacks -- small imperceptible perturbations that can change the network's output drastically. In the reverse direction, there may exist large, meaningful perturbations that leave the network's decision unchanged (excessive invariance, nonivertibility). We study the latter phenomenon in two contexts: (a) discrete-time dynamical system identification, as well as (b) calibration of the output of one neural network to the output of another (neural network matching). For ReLU networks and $L_p$ norms ($p=1,2,\infty$), we formulate these optimization problems as mixed-integer programs (MIPs) that apply to neural network approximators of dynamical systems. We also discuss the applicability of our results to invertibility certification in transformations between neural networks (e.g. at different levels of pruning).
    Personalised Federated Learning On Heterogeneous Feature Spaces. (arXiv:2301.11447v1 [cs.LG])
    Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.  ( 2 min )
    Distributionally Robust Multi-objective Bayesian Optimization under Uncertain Environments. (arXiv:2301.11588v1 [stat.ML])
    In this study, we address the problem of optimizing multi-output black-box functions under uncertain environments. We formulate this problem as the estimation of the uncertain Pareto-frontier (PF) of a multi-output Bayesian surrogate model with two types of variables: design variables and environmental variables. We consider this problem within the context of Bayesian optimization (BO) under uncertain environments, where the design variables are controllable, whereas the environmental variables are assumed to be random and not controllable. The challenge of this problem is to robustly estimate the PF when the distribution of the environmental variables is unknown, that is, to estimate the PF when the environmental variables are generated from the worst possible distribution. We propose a method for solving the BO problem by appropriately incorporating the uncertainties of the environmental variables and their probability distribution. We demonstrate that the proposed method can find an arbitrarily accurate PF with high probability in a finite number of iterations. We also evaluate the performance of the proposed method through numerical experiments.  ( 2 min )
    Big portfolio selection by graph-based conditional moments method. (arXiv:2301.11697v1 [stat.ML])
    How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which guides the learning procedure through a factor-hypergraph built by the set of stock-to-stock relations from the domain knowledge as well as the set of factor-to-stock relations from the asset pricing knowledge. Next, the GRACE method learns the conditional variance, skewness, and kurtosis of stock returns from the learned conditional quantiles by using the quantiled conditional moment (QCM) method. The QCM method is a supervised learning procedure to learn these conditional higher-order moments, so it largely overcomes the computational difficulty from the classical high-dimensional GARCH-type methods. Moreover, the QCM method allows the mis-specification in modeling conditional quantiles to some extent, due to its regression-based nature. Finally, the GRACE method uses the learned conditional mean, variance, skewness, and kurtosis to construct several performance measures, which are criteria to sort the stocks to proceed the portfolio selection in the well-known 10-decile framework. An application to NASDAQ and NYSE stock markets shows that the GRACE method performs much better than its competitors, particularly when the performance measures are comprised of conditional variance, skewness, and kurtosis.
    Collaborative Regret Minimization in Multi-Armed Bandits. (arXiv:2301.11442v1 [cs.LG])
    In this paper, we study the collaborative learning model, which concerns the tradeoff between parallelism and communication overhead in multi-agent reinforcement learning. For a fundamental problem in bandit theory, regret minimization in multi-armed bandits, we present the first and almost tight tradeoffs between the number of rounds of communication between the agents and the regret of the collaborative learning process.  ( 2 min )
    Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence. (arXiv:2301.11476v1 [cs.LG])
    Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
    Neural Abstractions. (arXiv:2301.11683v1 [cs.LO])
    We present a novel method for the safety verification of nonlinear dynamical models that uses neural networks to represent abstractions of their dynamics. Neural networks have extensively been used before as approximators; in this work, we make a step further and use them for the first time as abstractions. For a given dynamical model, our method synthesises a neural network that overapproximates its dynamics by ensuring an arbitrarily tight, formally certified bound on the approximation error. For this purpose, we employ a counterexample-guided inductive synthesis procedure. We show that this produces a neural ODE with non-deterministic disturbances that constitutes a formal abstraction of the concrete model under analysis. This guarantees a fundamental property: if the abstract model is safe, i.e., free from any initialised trajectory that reaches an undesirable state, then the concrete model is also safe. By using neural ODEs with ReLU activation functions as abstractions, we cast the safety verification problem for nonlinear dynamical models into that of hybrid automata with affine dynamics, which we verify using SpaceEx. We demonstrate that our approach performs comparably to the mature tool Flow* on existing benchmark nonlinear models. We additionally demonstrate and that it is effective on models that do not exhibit local Lipschitz continuity, which are out of reach to the existing technologies.
    Semi-Parametric Video-Grounded Text Generation. (arXiv:2301.11507v1 [cs.CV])
    Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames. Parametric approaches such as the attention mechanism may not be ideal since its computational cost quadratically increases as the video length increases. Rather, previous studies have relied on offline feature extraction or frame sampling to represent the video efficiently, focusing on cross-modal modeling in short video clips. In this paper, we propose a semi-parametric video-grounded text generation model, SeViT, a novel perspective on scalable video-language modeling toward long untrimmed videos. Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames from the data store for a given query and a parametric generator to effectively aggregate the frames with the query via late fusion methods. Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding. Moreover, our model achieves the new state of the art on four video-language datasets, iVQA (+4.8), Next-QA (+6.9), and Activitynet-QA (+4.8) in accuracy, and MSRVTT-Caption (+3.6) in CIDEr.
    Single-Trajectory Distributionally Robust Reinforcement Learning. (arXiv:2301.11721v1 [stat.ML])
    As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.
    Outcome-directed Reinforcement Learning by Uncertainty & Temporal Distance-Aware Curriculum Goal Generation. (arXiv:2301.11741v1 [cs.LG])
    Current reinforcement learning (RL) often suffers when solving a challenging exploration problem where the desired outcomes or high rewards are rarely observed. Even though curriculum RL, a framework that solves complex tasks by proposing a sequence of surrogate tasks, shows reasonable results, most of the previous works still have difficulty in proposing curriculum due to the absence of a mechanism for obtaining calibrated guidance to the desired outcome state without any prior domain knowledge. To alleviate it, we propose an uncertainty & temporal distance-aware curriculum goal generation method for the outcome-directed RL via solving a bipartite matching problem. It could not only provide precisely calibrated guidance of the curriculum to the desired outcome states but also bring much better sample efficiency and geometry-agnostic curriculum goal proposal capability compared to previous curriculum RL methods. We demonstrate that our algorithm significantly outperforms these prior methods in a variety of challenging navigation tasks and robotic manipulation tasks in a quantitative and qualitative way.
    Semi-Supervised Machine Learning: a Homological Approach. (arXiv:2301.11658v1 [cs.LG])
    In this paper we describe the mathematical foundations of a new approach to semi-supervised Machine Learning. Using techniques of Symbolic Computation and Computer Algebra, we apply the concept of persistent homology to obtain a new semi-supervised learning method.
    Modeling human road crossing decisions as reward maximization with visual perception limitations. (arXiv:2301.11737v1 [cs.LG])
    Understanding the interaction between different road users is critical for road safety and automated vehicles (AVs). Existing mathematical models on this topic have been proposed based mostly on either cognitive or machine learning (ML) approaches. However, current cognitive models are incapable of simulating road user trajectories in general scenarios, and ML models lack a focus on the mechanisms generating the behavior and take a high-level perspective which can cause failures to capture important human-like behaviors. Here, we develop a model of human pedestrian crossing decisions based on computational rationality, an approach using deep reinforcement learning (RL) to learn boundedly optimal behavior policies given human constraints, in our case a model of the limited human visual system. We show that the proposed combined cognitive-RL model captures human-like patterns of gap acceptance and crossing initiation time. Interestingly, our model's decisions are sensitive to not only the time gap, but also the speed of the approaching vehicle, something which has been described as a "bias" in human gap acceptance behavior. However, our results suggest that this is instead a rational adaption to human perceptual limitations. Moreover, we demonstrate an approach to accounting for individual differences in computational rationality models, by conditioning the RL policy on the parameters of the human constraints. Our results demonstrate the feasibility of generating more human-like road user behavior by combining RL with cognitive models.
    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v1 [stat.ML])
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.
    Synopsis: Sequential Decision Problems with Weak Feedback. (arXiv:2212.11599v2 [cs.LG] UPDATED)
    This thesis considers sequential decision problems, where the loss/reward incurred by selecting an action may not be inferred from observed feedback. A major part of this thesis focuses on the unsupervised sequential selection problem, where one can not infer the loss incurred for selecting an action from observed feedback. We also introduce a new setup named Censored Semi Bandits, where the loss incurred for selecting an action can be observed under certain conditions. Finally, we study the channel selection problem in the communication networks, where the reward for an action is only observed when no other player selects that action to play in the round. These problems find applications in many fields like healthcare, crowd-sourcing, security, adaptive resource allocation, among many others. This thesis aims to address the above-described sequential decision problems by exploiting specific structures these problems exhibit. We develop provably optimal algorithms for each of these setups with weak feedback and validate their empirical performance on different problem instances derived from synthetic and real datasets.  ( 2 min )
    SLCNN: Sentence-Level Convolutional Neural Network for Text Classification. (arXiv:2301.11696v1 [cs.CL])
    Text classification is a fundamental task in natural language processing (NLP). Several recent studies show the success of deep learning on text processing. Convolutional neural network (CNN), as a popular deep learning model, has shown remarkable success in the task of text classification. In this paper, new baseline models have been studied for text classification using CNN. In these models, documents are fed to the network as a three-dimensional tensor representation to provide sentence-level analysis. Applying such a method enables the models to take advantage of the positional information of the sentences in the text. Besides, analysing adjacent sentences allows extracting additional features. The proposed models have been compared with the state-of-the-art models using several datasets. The results have shown that the proposed models have better performance, particularly in the longer documents.
    Uplifting Message Passing Neural Network with Graph Original Information. (arXiv:2210.05382v2 [cs.LG] UPDATED)
    Message passing neural networks (MPNNs) learn the representation of graph-structured data based on graph original information, including node features and graph structures, and have shown astonishing improvement in node classification tasks. However, the expressive power of MPNNs is upper bounded by the first-order Weisfeiler-Leman test and its accuracy still has room for improvement. This work studies how to improve MPNNs' expressiveness and generalizability by fully exploiting graph original information both theoretically and empirically. It further proposes a new GNN model called INGNN (INformation-enhanced Graph Neural Network) that leverages the insights to improve node classification performance. Extensive experiments on both synthetic and real datasets demonstrate the superiority (average rank 1.78) of our INGNN compared with state-of-the-art methods.  ( 2 min )
    Soft Labels for Rapid Satellite Object Detection. (arXiv:2212.00585v3 [cs.CV] UPDATED)
    Soft labels in image classification are vector representations of an image's true classification. In this paper, we investigate soft labels in the context of satellite object detection. We propose using detections as the basis for a new dataset of soft labels. Much of the effort in creating a high-quality model is gathering and annotating the training data. If we could use a model to generate a dataset for us, we could not only rapidly create datasets, but also supplement existing open-source datasets. Using a subset of the xView dataset, we train a YOLOv5 model to detect cars, planes, and ships. We then use that model to generate soft labels for the second training set which we then train and compare to the original model. We show that soft labels can be used to train a model that is almost as accurate as a model trained on the original data.  ( 2 min )
    Feasibility and Transferability of Transfer Learning: A Mathematical Framework. (arXiv:2301.11542v1 [cs.LG])
    Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from previous learning tasks to improve the performance of new ones. Despite its numerous empirical successes, theoretical analysis for transfer learning is limited. In this paper we build for the first time, to the best of our knowledge, a mathematical framework for the general procedure of transfer learning. Our unique reformulation of transfer learning as an optimization problem allows for the first time, analysis of its feasibility. Additionally, we propose a novel concept of transfer risk to evaluate transferability of transfer learning. Our numerical studies using the Office-31 dataset demonstrate the potential and benefits of incorporating transfer risk in the evaluation of transfer learning performance.
    A Robust Optimisation Perspective on Counterexample-Guided Repair of Neural Networks. (arXiv:2301.11342v1 [cs.LG])
    Counterexample-guided repair aims at creating neural networks with mathematical safety guarantees, facilitating the application of neural networks in safety-critical domains. However, whether counterexample-guided repair is guaranteed to terminate remains an open question. We approach this question by showing that counterexample-guided repair can be viewed as a robust optimisation algorithm. While termination guarantees for neural network repair itself remain beyond our reach, we prove termination for more restrained machine learning models and disprove termination in a general setting. We empirically study the practical implications of our theoretical results, demonstrating the suitability of common verifiers and falsifiers for repair despite a disadvantageous theoretical result. Additionally, we use our theoretical insights to devise a novel algorithm for repairing linear regression models, surpassing existing approaches.  ( 2 min )
    MLExchange: A web-based platform enabling exchangeable machine learning workflows for scientific studies. (arXiv:2208.09751v4 [cs.LG] UPDATED)
    Machine learning (ML) algorithms are showing a growing trend in helping the scientific communities across different disciplines and institutions to address large and diverse data problems. However, many available ML tools are programmatically demanding and computationally costly. The MLExchange project aims to build a collaborative platform equipped with enabling tools that allow scientists and facility users who do not have a profound ML background to use ML and computational resources in scientific discovery. At the high level, we are targeting a full user experience where managing and exchanging ML algorithms, workflows, and data are readily available through web applications. Since each component is an independent container, the whole platform or its individual service(s) can be easily deployed at servers of different scales, ranging from a personal device (laptop, smart phone, etc.) to high performance clusters (HPC) accessed (simultaneously) by many users. Thus, MLExchange renders flexible using scenarios -- users could either access the services and resources from a remote server or run the whole platform or its individual service(s) within their local network.  ( 2 min )
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v2 [cs.LG] UPDATED)
    Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in its most practical form. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. Existing analysis cannot conclude the convergence for such a challenging case. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. We develop a novel framework that evaluates and controls the error propagation between actor and critic systematically. To our knowledge, this is the first finite-time analysis for the online single-timescale actor-critic method. Our results compare favorably to the existing literature in terms of considering the most practical yet challenging settings and requiring weaker assumptions.  ( 2 min )
    Phy-Q as a measure for physical reasoning intelligence. (arXiv:2108.13696v3 [cs.AI] UPDATED)
    Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. We create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent using the physical scenarios we considered. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark  ( 2 min )
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v5 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.  ( 2 min )
    Differential Privacy has Bounded Impact on Fairness in Classification. (arXiv:2210.16242v2 [cs.LG] UPDATED)
    We theoretically study the impact of differential privacy on fairness in classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on accuracy conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.  ( 2 min )
    Optimized Sparse Matrix Operations for Reverse Mode Automatic Differentiation. (arXiv:2212.05159v2 [cs.LG] UPDATED)
    Sparse matrix representations are ubiquitous in computational science and machine learning, leading to significant reductions in compute time, in comparison to dense representation, for problems that have local connectivity. The adoption of sparse representation in leading ML frameworks such as PyTorch is incomplete, however, with support for both automatic differentiation and GPU acceleration missing. In this work, we present an implementation of a CSR-based sparse matrix wrapper for PyTorch with CUDA acceleration for basic matrix operations, as well as automatic differentiability. We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.  ( 2 min )
    On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2211.13208v2 [cs.LG] UPDATED)
    Sample-efficient offline reinforcement learning (RL) with linear function approximation has recently been studied extensively. Much of prior work has yielded the minimax-optimal bound of $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$, with $K$ being the number of episodes in the offline data. In this work, we seek to understand instance-dependent bounds for offline RL with function approximation. We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI), which leverages data bootstrapping and constrained optimization on top of pessimism. We show that under a partial data coverage assumption, that of \emph{concentrability} with respect to an optimal policy, the proposed algorithm yields a fast rate of $\tilde{\mathcal{O}}(\frac{1}{K})$ for offline RL when there is a positive gap in the optimal Q-value functions, even when the offline data were adaptively collected. Moreover, when the linear features of the optimal actions in the states reachable by an optimal policy span those reachable by the behavior policy and the optimal actions are unique, offline RL achieves absolute zero sub-optimality error when $K$ exceeds a (finite) instance-dependent threshold. To the best of our knowledge, these are the first $\tilde{\mathcal{O}}(\frac{1}{K})$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data with partial coverage. We also provide instance-agnostic and instance-dependent information-theoretical lower bounds to complement our upper bounds.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v4 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v3 [stat.ML] UPDATED)
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators. (arXiv:2210.06812v2 [cs.LG] UPDATED)
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.  ( 2 min )
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v2 [stat.ML] UPDATED)
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.  ( 2 min )
    Generalizability of Adversarial Robustness Under Distribution Shifts. (arXiv:2209.15042v2 [cs.LG] UPDATED)
    Recent progress in empirical and certified robustness promises to deliver reliable and deployable Deep Neural Networks (DNNs). Despite that success, most existing evaluations of DNN robustness have been done on images sampled from the same distribution on which the model was trained. However, in the real world, DNNs may be deployed in dynamic environments that exhibit significant distribution shifts. In this work, we take a first step towards thoroughly investigating the interplay between empirical and certified adversarial robustness on one hand and domain generalization on another. To do so, we train robust models on multiple domains and evaluate their accuracy and robustness on an unseen domain. We observe that: (1) both empirical and certified robustness generalize to unseen domains, and (2) the level of generalizability does not correlate well with input visual similarity, measured by the FID between source and target domains. We also extend our study to cover a real-world medical application, in which adversarial augmentation significantly boosts the generalization of robustness with minimal effect on clean data accuracy.  ( 2 min )
    Commonsense Knowledge Salience Evaluation with a Benchmark Dataset in E-commerce. (arXiv:2205.10843v2 [cs.CL] CROSS LISTED)
    In e-commerce, the salience of commonsense knowledge (CSK) is beneficial for widespread applications such as product search and recommendation. For example, when users search for ``running'' in e-commerce, they would like to find products highly related to running, such as ``running shoes'' rather than ``shoes''. Nevertheless, many existing CSK collections rank statements solely by confidence scores, and there is no information about which ones are salient from a human perspective. In this work, we define the task of supervised salience evaluation, where given a CSK triple, the model is required to learn whether the triple is salient or not. In addition to formulating the new task, we also release a new Benchmark dataset of Salience Evaluation in E-commerce (BSEE) and hope to promote related research on commonsense knowledge salience evaluation. We conduct experiments in the dataset with several representative baseline models. The experimental results show that salience evaluation is a challenging task where models perform poorly on our evaluation set. We further propose a simple but effective approach, PMI-tuning, which shows promise for solving this novel problem. Code is available in \url{https://github.com/OpenBGBenchmark/OpenBG-CSK.  ( 2 min )
    Task-Agnostic Graph Neural Network Evaluation via Adversarial Collaboration. (arXiv:2301.11517v1 [cs.LG])
    It has been increasingly demanding to develop reliable Graph Neural Network (GNN) evaluation methods to quantify the progress of the rapidly expanding GNN research. Existing GNN benchmarking methods focus on comparing the GNNs with respect to their performances on some node/graph classification/regression tasks in certain datasets. There lacks a principled, task-agnostic method to directly compare two GNNs. Moreover, most of the existing graph self-supervised learning (SSL) works incorporate handcrafted augmentations to the graph, which has several severe difficulties due to the unique characteristics of graph-structured data. To address the aforementioned issues, we propose GraphAC (Graph Adversarial Collaboration) -- a conceptually novel, principled, task-agnostic, and stable framework for evaluating GNNs through contrastive self-supervision. GraphAC succeeds in distinguishing GNNs of different expressiveness across various aspects, and has been proven to be a principled and reliable GNN evaluation method, eliminating the need for handcrafted augmentations for stable SSL.  ( 2 min )
    Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data. (arXiv:2207.03615v2 [cs.LG] UPDATED)
    This paper analyzes the convergence and generalization of training a one-hidden-layer neural network when the input features follow the Gaussian mixture model consisting of a finite number of Gaussian distributions. Assuming the labels are generated from a teacher model with an unknown ground truth weight, the learning problem is to estimate the underlying teacher model by minimizing a non-convex risk function over a student neural network. With a finite number of training samples, referred to the sample complexity, the iterations are proved to converge linearly to a critical point with guaranteed generalization error. In addition, for the first time, this paper characterizes the impact of the input distributions on the sample complexity and the learning rate.  ( 2 min )
    GATE: Gated Additive Tree Ensemble for Tabular Classification and Regression. (arXiv:2207.08548v4 [cs.LG] UPDATED)
    We propose a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data, Gated Additive Tree Ensemble(GATE). GATE uses a gating mechanism, inspired from GRU, as a feature representation learning unit with an in-built feature selection mechanism. We combine it with an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict our desired output. We demonstrate that GATE is a competitive alternative to SOTA approaches like GBDTs, NODE, FT Transformers, etc. by experiments on several public datasets (both classification and regression). We have made available the code at https://github.com/manujosephv/GATE under MIT License.  ( 2 min )
    Explaining Patterns in Data with Language Models via Interpretable Autoprompting. (arXiv:2210.01848v2 [cs.LG] UPDATED)
    Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery. All code for using the methods and data here is made available on Github.  ( 2 min )
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v2 [cs.LG] UPDATED)
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).  ( 2 min )
    Statistical Inference for the Dynamic Time Warping Distance, with Application to Abnormal Time-Series Detection. (arXiv:2202.06593v2 [stat.ML] UPDATED)
    We study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too difficult to derive because it is obtained based on the solution of the DTW algorithm, which is complicated. To circumvent this difficulty, we propose to employ the conditional selective inference framework, which enables us to derive a valid inference method on the DTW distance. To our knowledge, this is the first method that can provide a valid p-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making such as abnormal time-series detection problems. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.  ( 2 min )
    Constrained Parameter Inference as a Principle for Learning. (arXiv:2203.13203v5 [cs.NE] UPDATED)
    Learning in neural networks is often framed as a problem in which targeted error signals are directly propagated to parameters and used to produce updates that induce more optimal network behaviour. Backpropagation of error (BP) is an example of such an approach and has proven to be a highly successful application of stochastic gradient descent to deep neural networks. We propose constrained parameter inference (COPI) as a new principle for learning. The COPI approach assumes that learning can be set up in a manner where parameters infer their own values based upon observations of their local neuron activities. We find that this estimation of network parameters is possible under the constraints of decorrelated neural inputs and top-down perturbations of neural states for credit assignment. We show that the decorrelation required for COPI allows learning at extremely high learning rates, competitive with that of adaptive optimizers, as used by BP. We further demonstrate that COPI affords a new approach to feature analysis and network compression. Finally, we argue that COPI may shed new light on learning in biological networks given the evidence for decorrelation in the brain.
    Demystifying Reinforcement Learning in Time-Varying Systems. (arXiv:2201.05560v2 [cs.LG] UPDATED)
    Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity, shed light on the range of approaches to them and develop a robust framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (i) identifies different environments encountered by the live system, (ii) triggers exploration when necessary, (iii) takes precautions to retain knowledge from prior environments, and (iv) employs safeguards to protect the system's performance when the RL agent makes mistakes. We apply our framework to two systems problems, straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that all components of the framework are necessary to cope with non-stationarity and provide guidance on alternative design choices for each component.
    AdaBoost is not an Optimal Weak to Strong Learner. (arXiv:2301.11571v1 [cs.LG])
    AdaBoost is a classic boosting algorithm for combining multiple inaccurate classifiers produced by a weak learner, to produce a strong learner with arbitrarily high accuracy when given enough training data. Determining the optimal number of samples necessary to obtain a given accuracy of the strong learner, is a basic learning theoretic question. Larsen and Ritzert (NeurIPS'22) recently presented the first provably optimal weak-to-strong learner. However, their algorithm is somewhat complicated and it remains an intriguing question whether the prototypical boosting algorithm AdaBoost also makes optimal use of training samples. In this work, we answer this question in the negative. Concretely, we show that the sample complexity of AdaBoost, and other classic variations thereof, are sub-optimal by at least one logarithmic factor in the desired accuracy of the strong learner.  ( 2 min )
    A critical look at deep neural network for dynamic system modeling. (arXiv:2301.11604v1 [cs.LG])
    Neural network models become increasingly popular as dynamic modeling tools in the control community. They have many appealing features including nonlinear structures, being able to approximate any functions. While most researchers hold optimistic attitudes towards such models, this paper questions the capability of (deep) neural networks for the modeling of dynamic systems using input-output data. For the identification of linear time-invariant (LTI) dynamic systems, two representative neural network models, Long Short-Term Memory (LSTM) and Cascade Foward Neural Network (CFNN) are compared to the standard Prediction Error Method (PEM) of system identification. In the comparison, four essential aspects of system identification are considered, then several possible defects and neglected issues of neural network based modeling are pointed out. Detailed simulation studies are performed to verify these defects: for the LTI system, both LSTM and CFNN fail to deliver consistent models even in noise-free cases; and they give worse results than PEM in noisy cases.  ( 2 min )
    Neural Episodic Control with State Abstraction. (arXiv:2301.11490v1 [cs.LG])
    Existing Deep Reinforcement Learning (DRL) algorithms suffer from sample inefficiency. Generally, episodic control-based approaches are solutions that leverage highly-rewarded past experiences to improve sample efficiency of DRL algorithms. However, previous episodic control-based approaches fail to utilize the latent information from the historical behaviors (e.g., state transitions, topological similarities, etc.) and lack scalability during DRL training. This work introduces Neural Episodic Control with State Abstraction (NECSA), a simple but effective state abstraction-based episodic control containing a more comprehensive episodic memory, a novel state evaluation, and a multi-step state analysis. We evaluate our approach to the MuJoCo and Atari tasks in OpenAI gym domains. The experimental results indicate that NECSA achieves higher sample efficiency than the state-of-the-art episodic control-based approaches. Our data and code are available at the project website\footnote{\url{https://sites.google.com/view/drl-necsa}}.  ( 2 min )
    Cellular Network Capacity and Coverage Enhancement with MDT Data and Deep Reinforcement Learning. (arXiv:2202.10968v1 [cs.NI] CROSS LISTED)
    Recent years witnessed a remarkable increase in the availability of data and computing resources in communication networks. This contributed to the rise of data-driven over model-driven algorithms for network automation. This paper investigates a Minimization of Drive Tests (MDT)-driven Deep Reinforcement Learning (DRL) algorithm to optimize coverage and capacity by tuning antennas tilts on a cluster of cells from TIM's cellular network. We jointly utilize MDT data, electromagnetic simulations, and network Key Performance indicators (KPIs) to define a simulated network environment for the training of a Deep Q-Network (DQN) agent. Some tweaks have been introduced to the classical DQN formulation to improve the agent's sample efficiency, stability, and performance. In particular, a custom exploration policy is designed to introduce soft constraints at training time. Results show that the proposed algorithm outperforms baseline approaches like DQN and best-fist search in terms of long-term reward and sample efficiency. Our results indicate that MDT-driven approaches constitute a valuable tool for autonomous coverage and capacity optimization of mobile radio networks.
    Fine-tuning Neural-Operator architectures for training and generalization. (arXiv:2301.11509v1 [cs.LG])
    In this work, we present an analysis of the generalization of Neural Operators (NOs) and derived architectures. We proposed a family of networks, which we name (${\textit{s}}{\text{NO}}+\varepsilon$), where we modify the layout of NOs towards an architecture resembling a Transformer; mainly, we substitute the Attention module with the Integral Operator part of NOs. The resulting network preserves universality, has a better generalization to unseen data, and similar number of parameters as NOs. On the one hand, we study numerically the generalization by gradually transforming NOs into ${\textit{s}}{\text{NO}}+\varepsilon$ and verifying a reduction of the test loss considering a time-harmonic wave dataset with different frequencies. We perform the following changes in NOs: (a) we split the Integral Operator (non-local) and the (local) feed-forward network (MLP) into different layers, generating a {\it sequential} structure which we call sequential Neural Operator (${\textit{s}}{\text{NO}}$), (b) we add the skip connection, and layer normalization in ${\textit{s}}{\text{NO}}$, and (c) we incorporate dropout and stochastic depth that allows us to generate deep networks. In each case, we observe a decrease in the test loss in a wide variety of initialization, indicating that our changes outperform the NO. On the other hand, building on infinite-dimensional Statistics, and in particular the Dudley Theorem, we provide bounds of the Rademacher complexity of NOs and ${\textit{s}}{\text{NO}}$, and we find the following relationship: the upper bound of the Rademacher complexity of the ${\textit{s}}{\text{NO}}$ is a lower-bound of the NOs, thereby, the generalization error bound of ${\textit{s}}{\text{NO}}$ is smaller than NO, which further strengthens our numerical results.
    Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2209.06620v3 [cs.LG] UPDATED)
    Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.
    Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions. (arXiv:2301.11885v1 [stat.ML])
    Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.
    ExplainableFold: Understanding AlphaFold Prediction with Explainable AI. (arXiv:2301.11765v1 [cs.AI])
    This paper presents ExplainableFold, an explainable AI framework for protein structure prediction. Despite the success of AI-based methods such as AlphaFold in this field, the underlying reasons for their predictions remain unclear due to the black-box nature of deep learning models. To address this, we propose a counterfactual learning framework inspired by biological principles to generate counterfactual explanations for protein structure prediction, enabling a dry-lab experimentation approach. Our experimental results demonstrate the ability of ExplainableFold to generate high-quality explanations for AlphaFold's predictions, providing near-experimental understanding of the effects of amino acids on 3D protein structure. This framework has the potential to facilitate a deeper understanding of protein structures.
    Gene Teams are on the Field: Evaluation of Variants in Gene-Networks Using High Dimensional Modelling. (arXiv:2301.11763v1 [cs.LG])
    In medical genetics, each genetic variant is evaluated as an independent entity regarding its clinical importance. However, in most complex diseases, variant combinations in specific gene networks, rather than the presence of a particular single variant, predominates. In the case of complex diseases, disease status can be evaluated by considering the success level of a team of specific variants. We propose a high dimensional modelling based method to analyse all the variants in a gene network together. To evaluate our method, we selected two gene networks, mTOR and TGF-Beta. For each pathway, we generated 400 control and 400 patient group samples. mTOR and TGF-? pathways contain 31 and 93 genes of varying sizes, respectively. We produced Chaos Game Representation images for each gene sequence to obtain 2-D binary patterns. These patterns were arranged in succession, and a 3-D tensor structure was achieved for each gene network. Features for each data sample were acquired by exploiting Enhanced Multivariance Products Representation to 3-D data. Features were split as training and testing vectors. Training vectors were employed to train a Support Vector Machines classification model. We achieved more than 96% and 99% classification accuracies for mTOR and TGF-Beta networks, respectively, using a limited amount of training samples.
    Invariant Meta Learning for Out-of-Distribution Generalization. (arXiv:2301.11779v1 [cs.LG])
    Modern deep learning techniques have illustrated their excellent capabilities in many areas, but relies on large training data. Optimization-based meta-learning train a model on a variety tasks, such that it can solve new learning tasks using only a small number of training samples.However, these methods assumes that training and test dataare identically and independently distributed. To overcome such limitation, in this paper, we propose invariant meta learning for out-of-distribution tasks. Specifically, invariant meta learning find invariant optimal meta-initialization,and fast adapt to out-of-distribution tasks with regularization penalty. Extensive experiments demonstrate the effectiveness of our proposed invariant meta learning on out-of-distribution few-shot tasks.
    Detecting Pump&Dump Stock Market Manipulation from Online Forums. (arXiv:2301.11403v1 [cs.SI])
    The intersection of social media, low-cost trading platforms, and naive investors has created an ideal situation for information-based market manipulations, especially pump&dumps. Manipulators accumulate small-cap stocks, disseminate false information on social media to inflate their price, and sell at the peak. We collect a dataset of stocks whose price and volume profiles have the characteristic shape of a pump&dump, and social media posts for those same stocks that match the timing of the initial price rises. From these we build predictive models for pump&dump events based on the language used in the social media posts. There are multiple difficulties: not every post will cause the intended market reaction, some pump&dump events may be triggered by posts in other forums, and there may be accidental confluences of post timing and market movements. Nevertheless, our best model achieves a prediction accuracy of 85% and an F1-score of 62%. Such a tool can provide early warning to investors and regulators that a pump&dump may be underway.
    Deep Residual Compensation Convolutional Network without Backpropagation. (arXiv:2301.11663v1 [cs.CV])
    PCANet and its variants provided good accuracy results for classification tasks. However, despite the importance of network depth in achieving good classification accuracy, these networks were trained with a maximum of nine layers. In this paper, we introduce a residual compensation convolutional network, which is the first PCANet-like network trained with hundreds of layers while improving classification accuracy. The design of the proposed network consists of several convolutional layers, each followed by post-processing steps and a classifier. To correct the classification errors and significantly increase the network's depth, we train each layer with new labels derived from the residual information of all its preceding layers. This learning mechanism is accomplished by traversing the network's layers in a single forward pass without backpropagation or gradient computations. Our experiments on four distinct classification benchmarks (MNIST, CIFAR-10, CIFAR-100, and TinyImageNet) show that our deep network outperforms all existing PCANet-like networks and is competitive with several traditional gradient-based models.  ( 2 min )
    Convolutional neural networks for valid and efficient causal inference. (arXiv:2301.11732v1 [stat.ML])
    Convolutional neural networks (CNN) have been successful in machine learning applications. Their success relies on their ability to consider space invariant local features. We consider the use of CNN to fit nuisance models in semiparametric estimation of the average causal effect of a treatment. In this setting, nuisance models are functions of pre-treatment covariates that need to be controlled for. In an application where we want to estimate the effect of early retirement on a health outcome, we propose to use CNN to control for time-structured covariates. Thus, CNN is used when fitting nuisance models explaining the treatment and the outcome. These fits are then combined into an augmented inverse probability weighting estimator yielding efficient and uniformly valid inference. Theoretically, we contribute by providing rates of convergence for CNN equipped with the rectified linear unit activation function and compare it to an existing result for feedforward neural networks. We also show when those rates guarantee uniformly valid inference. A Monte Carlo study is provided where the performance of the proposed estimator is evaluated and compared with other strategies. Finally, we give results on a study of the effect of early retirement on hospitalization using data covering the whole Swedish population.  ( 2 min )
    Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers. (arXiv:2301.11578v1 [cs.LG])
    Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.  ( 2 min )
    CAPoW: Context-Aware AI-Assisted Proof of Work based DDoS Defense. (arXiv:2301.11767v1 [cs.CR])
    Critical servers can be secured against distributed denial of service (DDoS) attacks using proof of work (PoW) systems assisted by an Artificial Intelligence (AI) that learns contextual network request patterns. In this work, we introduce CAPoW, a context-aware anti-DDoS framework that injects latency adaptively during communication by utilizing context-aware PoW puzzles. In CAPoW, a security professional can define relevant request context attributes which can be learned by the AI system. These contextual attributes can include information about the user request, such as IP address, time, flow-level information, etc., and are utilized to generate a contextual score for incoming requests that influence the hardness of a PoW puzzle. These puzzles need to be solved by a user before the server begins to process their request. Solving puzzles slow down the volume of incoming adversarial requests. Additionally, the framework compels the adversary to incur a cost per request, hence making it expensive for an adversary to prolong a DDoS attack. We include the theoretical foundations of the CAPoW framework along with a description of its implementation and evaluation.
    A denoting diffusion model for fluid flow prediction. (arXiv:2301.11661v1 [cs.LG])
    We propose a novel denoising diffusion generative model for predicting nonlinear fluid fields named FluidDiff. By performing a diffusion process, the model is able to learn a complex representation of the high-dimensional dynamic system, and then Langevin sampling is used to generate predictions for the flow state under specified initial conditions. The model is trained with finite, discrete fluid simulation data. We demonstrate that our model has the capacity to model the distribution of simulated training data and that it gives accurate predictions on the test data. Without encoded prior knowledge of the underlying physical system, it shares competitive performance with other deep learning models for fluid prediction, which is promising for investigation on new computational fluid dynamics methods.  ( 2 min )
    TransNet: Transferable Neural Networks for Partial Differential Equations. (arXiv:2301.11701v1 [math.NA])
    Transfer learning for partial differential equations (PDEs) is to develop a pre-trained neural network that can be used to solve a wide class of PDEs. Existing transfer learning approaches require much information of the target PDEs such as its formulation and/or data of its solution for pre-training. In this work, we propose to construct transferable neural feature spaces from purely function approximation perspectives without using PDE information. The construction of the feature space involves re-parameterization of the hidden neurons and uses auxiliary functions to tune the resulting feature space. Theoretical analysis shows the high quality of the produced feature space, i.e., uniformly distributed neurons. Extensive numerical experiments verify the outstanding performance of our method, including significantly improved transferability, e.g., using the same feature space for various PDEs with different domains and boundary conditions, and the superior accuracy, e.g., several orders of magnitude smaller mean squared error than the state of the art methods.  ( 2 min )
    Mixed Attention Network for Hyperspectral Image Denoising. (arXiv:2301.11525v1 [cs.CV])
    Hyperspectral image denoising is unique for the highly similar and correlated spectral information that should be properly considered. However, existing methods show limitations in exploring the spectral correlations across different bands and feature interactions within each band. Besides, the low- and high-level features usually exhibit different importance for different spatial-spectral regions, which is not fully explored for current algorithms as well. In this paper, we present a Mixed Attention Network (MAN) that simultaneously considers the inter- and intra-spectral correlations as well as the interactions between low- and high-level spatial-spectral meaningful features. Specifically, we introduce a multi-head recurrent spectral attention that efficiently integrates the inter-spectral features across all the spectral bands. These features are further enhanced with a progressive spectral channel attention by exploring the intra-spectral relationships. Moreover, we propose an attentive skip-connection that adaptively controls the proportion of the low- and high-level spatial-spectral features from the encoder and decoder to better enhance the aggregated features. Extensive experiments show that our MAN outperforms existing state-of-the-art methods on simulated and real noise settings while maintaining a low cost of parameters and running time.
    FedHP: Heterogeneous Federated Learning with Privacy-preserving. (arXiv:2301.11705v1 [cs.LG])
    Federated Learning is a distributed machine learning environment, which ensures that clients complete collaborative training without sharing private data, only by exchanging parameters. However, the data does not satisfy the same distribution and the computing resources of clients are different, which brings challenges to the related research. To better solve the above heterogeneous problems, we designed a novel federated learning method. The local model consists of the pre-trained model as the backbone and fully connected layers as the head. The backbone can extract features for the head, and the embedding vector of classes is shared between clients to optimize the head so that the local model can perform better. By sharing the embedding vector of classes, instead of parameters based on gradient space, clients can better adapt to private data, and it is more efficient in the communication between the server and clients. To better protect privacy, we proposed a privacy-preserving hybrid method to add noise to the embedding vector of classes, which has less impact on the local model performance under the premise of satisfying differential privacy. We conduct a comprehensive evaluation with other federated learning methods on the self-built vehicle dataset under non-independent identically distributed(Non-IID)  ( 2 min )
    Variance, Self-Consistency, and Arbitrariness in Fair Classification. (arXiv:2301.11562v1 [cs.LG])
    In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.  ( 2 min )
    Incorporating Knowledge into Document Summarization: an Application of Prefix-Tuning on GPT-2. (arXiv:2301.11719v1 [cs.CL])
    Despite the great development of document summarization techniques nowadays, factual inconsistencies between the generated summaries and the original text still occur from time to time. This paper proposes a prefix-tuning-based approach that uses a set of trainable continuous prefix prompt together with discrete prompts to aid model generation, which makes a significant impact on both CNN/Daily Mail and XSum summaries generated using GPT-2. The improvements on fact preservation in the generated summaries indicates the effectiveness of adopting this prefix-tuning-based method in knowledge-enhanced document summarization, and also shows a great potential on other natural language processing tasks.  ( 2 min )
    Behaviour Discriminator: A Simple Data Filtering Method to Improve Offline Policy Learning. (arXiv:2301.11734v1 [cs.LG])
    This paper studies the problem of learning a control policy without the need for interactions with the environment; instead, learning purely from an existing dataset. Prior work has demonstrated that offline learning algorithms (e.g., behavioural cloning and offline reinforcement learning) are more likely to discover a satisfactory policy when trained using high-quality expert data. However, many real-world/practical datasets can contain significant proportions of examples generated using low-skilled agents. Therefore, we propose a behaviour discriminator (BD) concept, a novel and simple data filtering approach based on semi-supervised learning, which can accurately discern expert data from a mixed-quality dataset. Our BD approach was used to pre-process the mixed-skill-level datasets from the Real Robot Challenge (RRC) III, an open competition requiring participants to solve several dexterous robotic manipulation tasks using offline learning methods; the new BD method allowed a standard behavioural cloning algorithm to outperform other more sophisticated offline learning algorithms. Moreover, we demonstrate that the new BD pre-processing method can be applied to a number of D4RL benchmark problems, improving the performance of multiple state-of-the-art offline reinforcement learning algorithms.  ( 2 min )
    Large-Scale Traffic Data Imputation with Spatiotemporal Semantic Understanding. (arXiv:2301.11691v1 [cs.LG])
    Large-scale data missing is a challenging problem in Intelligent Transportation Systems (ITS). Many studies have been carried out to impute large-scale traffic data by considering their spatiotemporal correlations at a network level. In existing traffic data imputations, however, rich semantic information of a road network has been largely ignored when capturing network-wide spatiotemporal correlations. This study proposes a Graph Transformer for Traffic Data Imputation (GT-TDI) model to impute large-scale traffic data with spatiotemporal semantic understanding of a road network. Specifically, the proposed model introduces semantic descriptions consisting of network-wide spatial and temporal information of traffic data to help the GT-TDI model capture spatiotemporal correlations at a network level. The proposed model takes incomplete data, the social connectivity of sensors, and semantic descriptions as input to perform imputation tasks with the help of Graph Neural Networks (GNN) and Transformer. On the PeMS freeway dataset, extensive experiments are conducted to compare the proposed GT-TDI model with conventional methods, tensor factorization methods, and deep learning-based methods. The results show that the proposed GT-TDI outperforms existing methods in complex missing patterns and diverse missing rates. The code of the GT-TDI model will be available at https://github.com/KP-Zhang/GT-TDI.  ( 2 min )
    A Green(er) World for A.I. (arXiv:2301.11581v1 [cs.AI])
    As research and practice in artificial intelligence (A.I.) grow in leaps and bounds, the resources necessary to sustain and support their operations also grow at an increasing pace. While innovations and applications from A.I. have brought significant advances, from applications to vision and natural language to improvements to fields like medical imaging and materials engineering, their costs should not be neglected. As we embrace a world with ever-increasing amounts of data as well as research and development of A.I. applications, we are sure to face an ever-mounting energy footprint to sustain these computational budgets, data storage needs, and more. But, is this sustainable and, more importantly, what kind of setting is best positioned to nurture such sustainable A.I. in both research and practice? In this paper, we outline our outlook for Green A.I. -- a more sustainable, energy-efficient and energy-aware ecosystem for developing A.I. across the research, computing, and practitioner communities alike -- and the steps required to arrive there. We present a bird's eye view of various areas for potential changes and improvements from the ground floor of AI's operational and hardware optimizations for datacenters/HPCs to the current incentive structures in the world of A.I. research and practice, and more. We hope these points will spur further discussion, and action, on some of these issues and their potential solutions.  ( 2 min )
    Solving Constrained Reinforcement Learning through Augmented State and Reward Penalties. (arXiv:2301.11592v1 [cs.LG])
    Constrained Reinforcement Learning has been employed to enforce safety constraints on policy through the use of expected cost constraints. The key challenge is in handling expected cost accumulated using the policy and not just in a single step. Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions (at each time step). While such approaches have provided good solutions with regards to objective, they can either be overly aggressive or conservative with respect to costs. This is owing to use of estimates for "future" or "backward" costs in local cost constraints. To that end, we provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties. This intuitive formulation is general and has interesting theoretical properties. More importantly, this provides a new paradigm for solving constrained RL problems effectively. As we show in our experimental results, we are able to outperform leading approaches on multiple benchmark problems from literature.
    Learning to Generate All Feasible Actions. (arXiv:2301.11461v1 [cs.LG])
    Several machine learning (ML) applications are characterized by searching for an optimal solution to a complex task. The search space for this optimal solution is often very large, so large in fact that this optimal solution is often not computable. Part of the problem is that many candidate solutions found via ML are actually infeasible and have to be discarded. Restricting the search space to only the feasible solution candidates simplifies finding an optimal solution for the tasks. Further, the set of feasible solutions could be re-used in multiple problems characterized by different tasks. In particular, we observe that complex tasks can be decomposed into subtasks and corresponding skills. We propose to learn a reusable and transferable skill by training an actor to generate all feasible actions. The trained actor can then propose feasible actions, among which an optimal one can be chosen according to a specific task. The actor is trained by interpreting the feasibility of each action as a target distribution. The training procedure minimizes a divergence of the actor's output distribution to this target. We derive the general optimization target for arbitrary f-divergences using a combination of kernel density estimates, resampling, and importance sampling. We further utilize an auxiliary critic to reduce the interactions with the environment. A preliminary comparison to related strategies shows that our approach learns to visit all the modes in the feasible action space, demonstrating the framework's potential for learning skills that can be used in various downstream tasks.
    Down the Rabbit Hole: Detecting Online Extremism, Radicalisation, and Politicised Hate Speech. (arXiv:2301.11579v1 [cs.SI])
    Social media is a modern person's digital voice to project and engage with new ideas and mobilise communities $\unicode{x2013}$ a power shared with extremists. Given the societal risks of unvetted content-moderating algorithms for Extremism, Radicalisation, and Hate speech (ERH) detection, responsible software engineering must understand the who, what, when, where, and why such models are necessary to protect user safety and free expression. Hence, we propose and examine the unique research field of ERH context mining to unify disjoint studies. Specifically, we evaluate the start-to-finish design process from socio-technical definition-building and dataset collection strategies to technical algorithm design and performance. Our 2015-2021 51-study Systematic Literature Review (SLR) provides the first cross-examination of textual, network, and visual approaches to detecting extremist affiliation, hateful content, and radicalisation towards groups and movements. We identify consensus-driven ERH definitions and propose solutions to existing ideological and geographic biases, particularly due to the lack of research in Oceania/Australasia. Our hybridised investigation on Natural Language Processing, Community Detection, and visual-text models demonstrates the dominating performance of textual transformer-based algorithms. We conclude with vital recommendations for ERH context mining researchers and propose an uptake roadmap with guidelines for researchers, industries, and governments to enable a safer cyberspace.  ( 2 min )
    Adversarial Learning for Implicit Semantic-Aware Communications. (arXiv:2301.11589v1 [cs.LG])
    Semantic communication is a novel communication paradigm that focuses on recognizing and delivering the desired meaning of messages to the destination users. Most existing works in this area focus on delivering explicit semantics, labels or signal features that can be directly identified from the source signals. In this paper, we consider the implicit semantic communication problem in which hidden relations and closely related semantic terms that cannot be recognized from the source signals need to also be delivered to the destination user. We develop a novel adversarial learning-based implicit semantic-aware communication (iSAC) architecture in which the source user, instead of maximizing the total amount of information transmitted to the channel, aims to help the recipient learn an inference rule that can automatically generate implicit semantics based on limited clue information. We prove that by applying iSAC, the destination user can always learn an inference rule that matches the true inference rule of the source messages. Experimental results show that the proposed iSAC can offer up to a 19.69 dB improvement over existing non-inferential communication solutions, in terms of symbol error rate at the destination user.  ( 2 min )
    Can We Use Probing to Better Understand Fine-tuning and Knowledge Distillation of the BERT NLU?. (arXiv:2301.11688v1 [cs.CL])
    In this article, we use probing to investigate phenomena that occur during fine-tuning and knowledge distillation of a BERT-based natural language understanding (NLU) model. Our ultimate purpose was to use probing to better understand practical production problems and consequently to build better NLU models. We designed experiments to see how fine-tuning changes the linguistic capabilities of BERT, what the optimal size of the fine-tuning dataset is, and what amount of information is contained in a distilled NLU based on a tiny Transformer. The results of the experiments show that the probing paradigm in its current form is not well suited to answer such questions. Structural, Edge and Conditional probes do not take into account how easy it is to decode probed information. Consequently, we conclude that quantification of information decodability is critical for many practical applications of the probing paradigm.  ( 2 min )
    Online Learning in Stackelberg Games with an Omniscient Follower. (arXiv:2301.11518v1 [cs.LG])
    We study the problem of online learning in a two-player decentralized cooperative Stackelberg game. In each round, the leader first takes an action, followed by the follower who takes their action after observing the leader's move. The goal of the leader is to learn to minimize the cumulative regret based on the history of interactions. Differing from the traditional formulation of repeated Stackelberg games, we assume the follower is omniscient, with full knowledge of the true reward, and that they always best-respond to the leader's actions. We analyze the sample complexity of regret minimization in this repeated Stackelberg game. We show that depending on the reward structure, the existence of the omniscient follower may change the sample complexity drastically, from constant to exponential, even for linear cooperative Stackelberg games. This poses unique challenges for the learning process of the leader and the subsequent regret analysis.
    Neural Wasserstein Gradient Flows for Maximum Mean Discrepancies with Riesz Kernels. (arXiv:2301.11624v1 [cs.LG])
    Wasserstein gradient flows of maximum mean discrepancy (MMD) functionals with non-smooth Riesz kernels show a rich structure as singular measures can become absolutely continuous ones and conversely. In this paper we contribute to the understanding of such flows. We propose to approximate the backward scheme of Jordan, Kinderlehrer and Otto for computing such Wasserstein gradient flows as well as a forward scheme for so-called Wasserstein steepest descent flows by neural networks (NNs). Since we cannot restrict ourselves to absolutely continuous measures, we have to deal with transport plans and velocity plans instead of usual transport maps and velocity fields. Indeed, we approximate the disintegration of both plans by generative NNs which are learned with respect to appropriate loss functions. In order to evaluate the quality of both neural schemes, we benchmark them on the interaction energy. Here we provide analytic formulas for Wasserstein schemes starting at a Dirac measure and show their convergence as the time step size tends to zero. Finally, we illustrate our neural MMD flows by numerical examples.
    SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning. (arXiv:2301.11520v1 [cs.LG])
    As previous representations for reinforcement learning cannot effectively incorporate a human-intuitive understanding of the 3D environment, they usually suffer from sub-optimal performances. In this paper, we present Semantic-aware Neural Radiance Fields for Reinforcement Learning (SNeRL), which jointly optimizes semantic-aware neural radiance fields (NeRF) with a convolutional encoder to learn 3D-aware neural implicit representation from multi-view images. We introduce 3D semantic and distilled feature fields in parallel to the RGB radiance fields in NeRF to learn semantic and object-centric representation for reinforcement learning. SNeRL outperforms not only previous pixel-based representations but also recent 3D-aware representations both in model-free and model-based reinforcement learning.  ( 2 min )
    PLay: Parametrically Conditioned Layout Generation using Latent Diffusion. (arXiv:2301.11529v1 [cs.LG])
    Layout design is an important task in various design fields, including user interfaces, document, and graphic design. As this task requires tedious manual effort by designers, prior works have attempted to automate this process using generative models, but commonly fell short of providing intuitive user controls and achieving design objectives. In this paper, we build a conditional latent diffusion model, PLay, that generates parametrically conditioned layouts in vector graphic space from user-specified guidelines, which are commonly used by designers for representing their design intents in current practices. Our method outperforms prior works across three datasets on metrics including FID and FD-VG, and in user test. Moreover, it brings a novel and interactive experience to professional layout design processes.  ( 2 min )
    Improving deep learning precipitation nowcasting by using prior knowledge. (arXiv:2301.11707v1 [cs.LG])
    Deep learning methods dominate short-term high-resolution precipitation nowcasting in terms of prediction error. However, their operational usability is limited by difficulties explaining dynamics behind the predictions, which are smoothed out and missing the high-frequency features due to optimizing for mean error loss functions. We experiment with hand-engineering of the advection-diffusion differential equation into a PhyCell to introduce more accurate physical prior to a PhyDNet model that disentangles physical and residual dynamics. Results indicate that while PhyCell can learn the intended dynamics, training of PhyDNet remains driven by loss optimization, resulting in a model with the same prediction capabilities.
    Class-Incremental Learning with Repetition. (arXiv:2301.11396v1 [cs.LG])
    Real-world data streams naturally include the repetition of previous concepts. From a Continual Learning (CL) perspective, repetition is a property of the environment and, unlike replay, cannot be controlled by the user. Nowadays, Class-Incremental scenarios represent the leading test-bed for assessing and comparing CL strategies. This family of scenarios is very easy to use, but it never allows revisiting previously seen classes, thus completely disregarding the role of repetition. We focus on the family of Class-Incremental with Repetition (CIR) scenarios, where repetition is embedded in the definition of the stream. We propose two stochastic scenario generators that produce a wide range of CIR scenarios starting from a single dataset and a few control parameters. We conduct the first comprehensive evaluation of repetition in CL by studying the behavior of existing CL strategies under different CIR scenarios. We then present a novel replay strategy that exploits repetition and counteracts the natural imbalance present in the stream. On both CIFAR100 and TinyImageNet, our strategy outperforms other replay approaches, which are not designed for environments with repetition.  ( 2 min )
    PhysGraph: Physics-Based Integration Using Graph Neural Networks. (arXiv:2301.11841v1 [cs.GR])
    Physics-based simulation of mesh based domains remains a challenging task. State-of-the-art techniques can produce realistic results but require expert knowledge. A major bottleneck in many approaches is the step of integrating a potential energy in order to compute velocities or displacements. Recently, learning based method for physics-based simulation have sparked interest with graph based approaches being a promising research direction. One of the challenges for these methods is to generate models that are mesh independent and generalize to different material properties. Moreover, the model should also be able to react to unforeseen external forces like ubiquitous collisions. Our contribution is based on a simple observation: evaluating forces is computationally relatively cheap for traditional simulation methods and can be computed in parallel in contrast to their integration. If we learn how a system reacts to forces in general, irrespective of their origin, we can learn an integrator that can predict state changes due to the total forces with high generalization power. We effectively factor out the physical model behind resulting forces by relying on an opaque force module. We demonstrate that this idea leads to a learnable module that can be trained on basic internal forces of small mesh patches and generalizes to different mesh typologies, resolutions, material parameters and unseen forces like collisions at inference time. Our proposed paradigm is general and can be used to model a variety of physical phenomena. We focus our exposition on the detail enhancement of coarse clothing geometry which has many applications including computer games, virtual reality and virtual try-on.  ( 2 min )
    BOMP-NAS: Bayesian Optimization Mixed Precision NAS. (arXiv:2301.11810v1 [cs.LG])
    Bayesian Optimization Mixed-Precision Neural Architecture Search (BOMP-NAS) is an approach to quantization-aware neural architecture search (QA-NAS) that leverages both Bayesian optimization (BO) and mixed-precision quantization (MP) to efficiently search for compact, high performance deep neural networks. The results show that integrating quantization-aware fine-tuning (QAFT) into the NAS loop is a necessary step to find networks that perform well under low-precision quantization: integrating it allows a model size reduction of nearly 50\% on the CIFAR-10 dataset. BOMP-NAS is able to find neural networks that achieve state of the art performance at much lower design costs. This study shows that BOMP-NAS can find these neural networks at a 6x shorter search time compared to the closest related work.  ( 2 min )
    Feature space exploration as an alternative for design space exploration beyond the parametric space. (arXiv:2301.11416v1 [cs.LG])
    This paper compares the parametric design space with a feature space generated by the extraction of design features using deep learning (DL) as an alternative way for design space exploration. In this comparison, the parametric design space is constructed by creating a synthetic dataset of 15.000 elements using a parametric algorithm and reducing its dimensions for visualization. The feature space - reduced-dimensionality vector space of embedded data features - is constructed by training a DL model on the same dataset. We analyze and compare the extracted design features by reducing their dimension and visualizing the results. We demonstrate that parametric design space is narrow in how it describes the design solutions because it is based on the combination of individual parameters. In comparison, we observed that the feature design space can intuitively represent design solutions according to complex parameter relationships. Based on our results, we discuss the potential of translating the features learned by DL models to provide a mechanism for intuitive design exploration space and visualization of possible design solutions.
    Learning the Dynamics of Sparsely Observed Interacting Systems. (arXiv:2301.11647v1 [stat.ML])
    We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.  ( 2 min )
    DBGDGM: Dynamic Brain Graph Deep Generative Model. (arXiv:2301.11408v1 [cs.LG])
    Graphs are a natural representation of brain activity derived from functional magnetic imaging (fMRI) data. It is well known that clusters of anatomical brain regions, known as functional connectivity networks (FCNs), encode temporal relationships which can serve as useful biomarkers for understanding brain function and dysfunction. Previous works, however, ignore the temporal dynamics of the brain and focus on static graphs. In this paper, we propose a dynamic brain graph deep generative model (DBGDGM) which simultaneously clusters brain regions into temporally evolving communities and learns dynamic unsupervised node embeddings. Specifically, DBGDGM represents brain graph nodes as embeddings sampled from a distribution over communities that evolve over time. We parameterise this community distribution using neural networks that learn from subject and node embeddings as well as past community assignments. Experiments demonstrate DBGDGM outperforms baselines in graph generation, dynamic link prediction, and is comparable for graph classification. Finally, an analysis of the learnt community distributions reveals overlap with known FCNs reported in neuroscience literature.  ( 2 min )
    Nik Defense: An Artificial Intelligence Based Defense Mechanism against Selfish Mining in Bitcoin. (arXiv:2301.11463v1 [cs.CR])
    The Bitcoin cryptocurrency has received much attention recently. In the network of Bitcoin, transactions are recorded in a ledger. In this network, the process of recording transactions depends on some nodes called miners that execute a protocol known as mining protocol. One of the significant aspects of mining protocol is incentive compatibility. However, literature has shown that Bitcoin mining's protocol is not incentive-compatible. Some nodes with high computational power can obtain more revenue than their fair share by adopting a type of attack called the selfish mining attack. In this paper, we propose an artificial intelligence-based defense against selfish mining attacks by applying the theory of learning automata. The proposed defense mechanism ignores private blocks by assigning weight based on block discovery time and changes current Bitcoin's fork resolving policy by evaluating branches' height difference in a self-adaptive manner utilizing learning automata. To the best of our knowledge, the proposed protocol is the literature's first learning-based defense mechanism. Simulation results have shown the superiority of the proposed mechanism against tie-breaking mechanism, which is a well-known defense. The simulation results have shown that the suggested defense mechanism increases the profit threshold up to 40\% and decreases the revenue of selfish attackers.  ( 2 min )
    Limitless stability for Graph Convolutional Networks. (arXiv:2301.11443v1 [cs.LG])
    This work establishes rigorous, novel and widely applicable stability guarantees and transferability bounds for graph convolutional networks -- without reference to any underlying limit object or statistical distribution. Crucially, utilized graph-shift operators (GSOs) are not necessarily assumed to be normal, allowing for the treatment of networks on both directed- and for the first time also undirected graphs. Stability to node-level perturbations is related to an 'adequate (spectral) covering' property of the filters in each layer. Stability to edge-level perturbations is related to Lipschitz constants and newly introduced semi-norms of filters. Results on stability to topological perturbations are obtained through recently developed mathematical-physics based tools. As an important and novel example, it is showcased that graph convolutional networks are stable under graph-coarse-graining procedures (replacing strongly-connected sub-graphs by single nodes) precisely if the GSO is the graph Laplacian and filters are regular at infinity. These new theoretical results are supported by corresponding numerical investigations.  ( 2 min )
    Learning Informative Representation for Fairness-aware Multivariate Time-series Forecasting: A Group-based Perspective. (arXiv:2301.11535v1 [cs.LG])
    Multivariate time series (MTS) forecasting has penetrated and benefited our daily life. However, the unfair forecasting of MTSs not only degrades their practical benefit but even brings about serious potential risk. Such unfair MTS forecasting may be attributed to variable disparity leading to advantaged and disadvantaged variables. This issue has rarely been studied in the existing MTS forecasting models. To address this significant gap, we formulate the MTS fairness modeling problem as learning informative representations attending to both advantaged and disadvantaged variables. Accordingly, we propose a novel framework, named FairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial learning to generate both group-irrelevant and -relevant representations for the downstream forecasting. FairFor first adopts the recurrent graph convolution to capture spatio-temporal variable correlations and to group variables by leveraging a spectral relaxation of the K-means objective. Then, it utilizes a novel filtering & fusion module to filter the group-relevant information and generate group-irrelevant representations by orthogonality regularization. The group-irrelevant and -relevant representations form highly informative representations, facilitating to share the knowledge from advantaged variables to disadvantaged variables and guarantee fairness. Extensive experiments on four public datasets demonstrate the FairFor effectiveness for fair forecasting and significant performance improvement.  ( 2 min )
    Policy Optimization with Robustness Certificates. (arXiv:2301.11374v1 [cs.LG])
    We present a policy optimization framework in which the learned policy comes with a machine-checkable certificate of adversarial robustness. Our approach, called CAROL, learns a model of the environment. In each learning iteration, it uses the current version of this model and an external abstract interpreter to construct a differentiable signal for provable robustness. This signal is used to guide policy learning, and the abstract interpretation used to construct it directly leads to the robustness certificate returned at convergence. We give a theoretical analysis that bounds the worst-case accumulative reward of CAROL. We also experimentally evaluate CAROL on four MuJoCo environments. On these tasks, which involve continuous state and action spaces, CAROL learns certified policies that have performance comparable to the (non-certified) policies learned using state-of-the-art robust RL methods.  ( 2 min )
    Diffusion Denoising for Low-Dose-CT Model. (arXiv:2301.11482v1 [eess.IV])
    Low-dose Computed Tomography (LDCT) reconstruction is an important task in medical image analysis. Recent years have seen many deep learning based methods, proved to be effective in this area. However, these methods mostly follow a supervised architecture, which needs paired CT image of full dose and quarter dose, and the solution is highly dependent on specific measurements. In this work, we introduce Denoising Diffusion LDCT Model, dubbed as DDLM, generating noise-free CT image using conditioned sampling. DDLM uses pretrained model, and need no training nor tuning process, thus our proposal is in unsupervised manner. Experiments on LDCT images have shown comparable performance of DDLM using less inference time, surpassing other state-of-the-art methods, proving both accurate and efficient. Implementation code will be set to public soon.  ( 2 min )
    Rigid body flows for sampling molecular crystal structures. (arXiv:2301.11355v1 [cs.LG])
    Normalizing flows (NF) are a class of powerful generative models that have gained popularity in recent years due to their ability to model complex distributions with high flexibility and expressiveness. In this work, we introduce a new type of normalizing flow that is tailored for modeling positions and orientations of multiple objects in three-dimensional space, such as molecules in a crystal. Our approach is based on two key ideas: first, we define smooth and expressive flows on the group of unit quaternions, which allows us to capture the continuous rotational motion of rigid bodies; second, we use the double cover property of unit quaternions to define a proper density on the rotation group. This ensures that our model can be trained using standard likelihood-based methods or variational inference with respect to a thermodynamic target density. We evaluate the method by training Boltzmann generators for two molecular examples, namely the multi-modal density of a tetrahedral system in an external field and the ice XI phase in the TIP4P-Ew water model. Our flows can be combined with flows operating on the internal degrees of freedom of molecules, and constitute an important step towards the modeling of distributions of many interacting molecules.  ( 2 min )
    Alien Coding. (arXiv:2301.11479v1 [cs.AI])
    We introduce a self-learning algorithm for synthesizing programs for OEIS sequences. The algorithm starts from scratch initially generating programs at random. Then it runs many iterations of a self-learning loop that interleaves (i) training neural machine translation to learn the correspondence between sequences and the programs discovered so far, and (ii) proposing many new programs for each OEIS sequence by the trained neural machine translator. The algorithm discovers on its own programs for more than 78000 OEIS sequences, sometimes developing unusual programming methods. We analyze its behavior and the invented programs in several experiments.  ( 2 min )
    Exploring Deep Reinforcement Learning for Holistic Smart Building Control. (arXiv:2301.11510v1 [cs.LG])
    In this paper, we take a holistic approach to deal with the tradeoffs between energy use and comfort in commercial buildings. We developed a system called OCTOPUS, which employs a novel deep reinforcement learning (DRL) framework that uses a data-driven approach to find the optimal control sequences of all building's subsystems, including HVAC, lighting, blind and window systems. The DRL architecture includes a novel reward function that allows the framework to explore the tradeoffs between energy use and users' comfort, while at the same time enabling the solution of the high-dimensional control problem due to the interactions of four different building subsystems. In order to cope with OCTOPUS's data training requirements, we argue that calibrated simulations that match the target building operational points are the vehicle to generate enough data to be able to train our DRL framework to find the control solution for the target building. In our work, we trained OCTOPUS with 10-year weather data and a building model that is implemented in the EnergyPlus building simulator, which was calibrated using data from a real production building. Through extensive simulations, we demonstrate that OCTOPUS can achieve 14.26% and 8.1% energy savings compared with the state-of-the-art rule-based method in a LEED Gold Certified building and the latest DRL-based method available in the literature respectively, while maintaining human comfort within a desired range.  ( 2 min )
    Machine Learning Approach and Extreme Value Theory to Correlated Stochastic Time Series with Application to Tree Ring Data. (arXiv:2301.11488v1 [stat.ML])
    The main goal of machine learning (ML) is to study and improve mathematical models which can be trained with data provided by the environment to infer the future and to make decisions without necessarily having complete knowledge of all influencing elements. In this work, we describe how ML can be a powerful tool in studying climate modeling. Tree ring growth was used as an implementation in different aspects, for example, studying the history of buildings and environment. By growing and via the time, a new layer of wood to beneath its bark by the tree. After years of growing, time series can be applied via a sequence of tree ring widths. The purpose of this paper is to use ML algorithms and Extreme Value Theory in order to analyse a set of tree ring widths data from nine trees growing in Nottinghamshire. Initially, we start by exploring the data through a variety of descriptive statistical approaches. Transforming data is important at this stage to find out any problem in modelling algorithm. We then use algorithm tuning and ensemble methods to improve the k-nearest neighbors (KNN) algorithm. A comparison between the developed method in this study ad other methods are applied. Also, extreme value of the dataset will be more investigated. The results of the analysis study show that the ML algorithms in the Random Forest method would give accurate results in the analysis of tree ring widths data from nine trees growing in Nottinghamshire with the lowest Root Mean Square Error value. Also, we notice that as the assumed ARMA model parameters increased, the probability of selecting the true model also increased. In terms of the Extreme Value Theory, the Weibull distribution would be a good choice to model tree ring data.  ( 2 min )
    Neural networks learn to magnify areas near decision boundaries. (arXiv:2301.11375v1 [cs.LG])
    We study how training molds the Riemannian geometry induced by neural network feature maps. At infinite width, neural networks with random parameters induce highly symmetric metrics on input space. Feature learning in networks trained to perform classification tasks magnifies local areas along decision boundaries. These changes are consistent with previously proposed geometric approaches for hand-tuning of kernel methods to improve generalization.  ( 2 min )
    Voting from Nearest Tasks: Meta-Vote Pruning of Pre-trained Models for Downstream Tasks. (arXiv:2301.11560v1 [cs.LG])
    As a few large-scale pre-trained models become the major choices of various applications, new challenges arise for model pruning, e.g., can we avoid pruning the same model from scratch for every downstream task? How to reuse the pruning results of previous tasks to accelerate the pruning for a new task? To address these challenges, we create a small model for a new task from the pruned models of similar tasks. We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task. We study this ''meta-pruning'' from nearest tasks on two major classes of pre-trained models, convolutional neural network (CNN) and vision transformer (ViT), under a limited budget of pruning iterations. Our study begins by investigating the overlap of pruned models for similar tasks and how the overlap changes over different layers and blocks. Inspired by these discoveries, we develop a simple but effective ''Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task by initializing a sub-network from the pruned models of its nearest tasks. In experiments, we demonstrate MVP's advantages in accuracy, efficiency, and generalization through extensive empirical studies and comparisons with popular pruning methods over several datasets.  ( 2 min )
    Model-based Offline Reinforcement Learning with Local Misspecification. (arXiv:2301.11426v1 [cs.LG])
    We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to jointly consider selecting over dynamics models and policies: as long as a dynamics model can accurately represent the dynamics of the state-action pairs visited by a given policy, it is possible to approximate the value of that particular policy. We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks.  ( 2 min )
    Direct Parameterization of Lipschitz-Bounded Deep Networks. (arXiv:2301.11526v1 [cs.LG])
    This paper introduces a new parameterization of deep neural networks (both fully-connected and convolutional) with guaranteed Lipschitz bounds, i.e. limited sensitivity to perturbations. The Lipschitz guarantees are equivalent to the tightest-known bounds based on certification via a semidefinite program (SDP), which does not scale to large models. In contrast to the SDP approach, we provide a ``direct'' parameterization, i.e. a smooth mapping from $\mathbb R^N$ onto the set of weights of Lipschitz-bounded networks. This enables training via standard gradient methods, without any computationally intensive projections or barrier terms. The new parameterization can equivalently be thought of as either a new layer type (the \textit{sandwich layer}), or a novel parameterization of standard feedforward networks with parameter sharing between neighbouring layers. We illustrate the method with some applications in image classification (MNIST and CIFAR-10).  ( 2 min )
    Graph Scattering beyond Wavelet Shackles. (arXiv:2301.11456v1 [cs.LG])
    This work develops a flexible and mathematically sound framework for the design and analysis of graph scattering networks with variable branching ratios and generic functional calculus filters. Spectrally-agnostic stability guarantees for node- and graph-level perturbations are derived; the vertex-set non-preserving case is treated by utilizing recently developed mathematical-physics based tools. Energy propagation through the network layers is investigated and related to truncation stability. New methods of graph-level feature aggregation are introduced and stability of the resulting composite scattering architectures is established. Finally, scattering transforms are extended to edge- and higher order tensorial input. Theoretical results are complemented by numerical investigations: Suitably chosen cattering networks conforming to the developed theory perform better than traditional graph-wavelet based scattering approaches in social network graph classification tasks and significantly outperform other graph-based learning approaches to regression of quantum-chemical energies on QM7.  ( 2 min )
    MG-GNN: Multigrid Graph Neural Networks for Learning Multilevel Domain Decomposition Methods. (arXiv:2301.11378v1 [cs.LG])
    Domain decomposition methods (DDMs) are popular solvers for discretized systems of partial differential equations (PDEs), with one-level and multilevel variants. These solvers rely on several algorithmic and mathematical parameters, prescribing overlap, subdomain boundary conditions, and other properties of the DDM. While some work has been done on optimizing these parameters, it has mostly focused on the one-level setting or special cases such as structured-grid discretizations with regular subdomain construction. In this paper, we propose multigrid graph neural networks (MG-GNN), a novel GNN architecture for learning optimized parameters in two-level DDMs\@. We train MG-GNN using a new unsupervised loss function, enabling effective training on small problems that yields robust performance on unstructured grids that are orders of magnitude larger than those in the training set. We show that MG-GNN outperforms popular hierarchical graph network architectures for this optimization and that our proposed loss function is critical to achieving this improved performance.  ( 2 min )
    Rethinking 1x1 Convolutions: Can we train CNNs with Frozen Random Filters?. (arXiv:2301.11360v1 [cs.CV])
    Modern CNNs are learning the weights of vast numbers of convolutional operators. In this paper, we raise the fundamental question if this is actually necessary. We show that even in the extreme case of only randomly initializing and never updating spatial filters, certain CNN architectures can be trained to surpass the accuracy of standard training. By reinterpreting the notion of pointwise ($1\times 1$) convolutions as an operator to learn linear combinations (LC) of frozen (random) spatial filters, we are able to analyze these effects and propose a generic LC convolution block that allows tuning of the linear combination rate. Empirically, we show that this approach not only allows us to reach high test accuracies on CIFAR and ImageNet but also has favorable properties regarding model robustness, generalization, sparsity, and the total number of necessary weights. Additionally, we propose a novel weight sharing mechanism, which allows sharing of a single weight tensor between all spatial convolution layers to massively reduce the number of weights.  ( 2 min )
    Estimating Causal Effects using a Multi-task Deep Ensemble. (arXiv:2301.11351v1 [cs.LG])
    Over the past few decades, a number of methods have been proposed for causal effect estimation, yet few have been demonstrated to be effective in handling data with complex structures, such as images. To fill this gap, we propose a Causal Multi-task Deep Ensemble (CMDE) framework to learn both shared and group-specific information from the study population and prove its equivalence to a multi-task Gaussian process (GP) with coregionalization kernel a priori. Compared to multi-task GP, CMDE efficiently handles high-dimensional and multi-modal covariates and provides pointwise uncertainty estimates of causal effects. We evaluate our method across various types of datasets and tasks and find that CMDE outperforms state-of-the-art methods on a majority of these tasks.  ( 2 min )
    Learning Vortex Dynamics for Fluid Inference and Prediction. (arXiv:2301.11494v1 [cs.LG])
    We propose a novel machine learning method based on differentiable vortex particles to infer and predict fluid dynamics from a single video. The key design of our system is a particle-based latent space to encapsulate the hidden, Lagrangian vortical evolution underpinning the observable, Eulerian flow phenomena. We devise a novel differentiable vortex particle system in conjunction with their learnable, vortex-to-velocity dynamics mapping to effectively capture and represent the complex flow features in a reduced space. We further design an end-to-end training pipeline to directly learn and synthesize simulators from data, that can reliably deliver future video rollouts based on limited observation. The value of our method is twofold: first, our learned simulator enables the inference of hidden physics quantities (e.g. velocity field) purely from visual observation, to be used for motion analysis; secondly, it also supports future prediction, constructing the input video's sequel along with its future dynamics evolution. We demonstrate our method's efficacy by comparing quantitatively and qualitatively with a range of existing methods on both synthetic and real-world videos, displaying improved data correspondence, visual plausibility, and physical integrity.  ( 2 min )
    Revisiting Discriminative Entropy Clustering and its relation to K-means. (arXiv:2301.11405v1 [cs.LG])
    Maximization of mutual information between the model's input and output is formally related to "decisiveness" and "fairness" of the softmax predictions, motivating such unsupervised entropy-based losses for discriminative neural networks. Recent self-labeling methods based on such losses represent the state of the art in deep clustering. However, some important properties of entropy clustering are not well-known, or even misunderstood. For example, we provide a counterexample to prior claims about equivalence to variance clustering (K-means) and point out technical mistakes in such theories. We discuss the fundamental differences between these discriminative and generative clustering approaches. Moreover, we show the susceptibility of standard entropy clustering to narrow margins and motivate an explicit margin maximization term. We also propose an improved self-labeling loss; it is robust to pseudo-labeling errors and enforces stronger fairness. We develop an EM algorithm for our loss that is significantly faster than the standard alternatives. Our results improve the state-of-the-art on standard benchmarks.  ( 2 min )
    Multi-limb Split Learning for Tumor Classification on Vertically Distributed Data. (arXiv:2301.11468v1 [eess.IV])
    Brain tumors are one of the life-threatening forms of cancer. Previous studies have classified brain tumors using deep neural networks. In this paper, we perform the later task using a collaborative deep learning technique, more specifically split learning. Split learning allows collaborative learning via neural networks splitting into two (or more) parts, a client-side network and a server-side network. The client-side is trained to a certain layer called the cut layer. Then, the rest of the training is resumed on the server-side network. Vertical distribution, a method for distributing data among organizations, was implemented where several hospitals hold different attributes of information for the same set of patients. To the best of our knowledge this paper will be the first paper to implement both split learning and vertical distribution for brain tumor classification. Using both techniques, we were able to achieve train and test accuracy greater than 90\% and 70\%, respectively.  ( 2 min )
    Are Equivariant Equilibrium Approximators Beneficial?. (arXiv:2301.11481v1 [cs.GT])
    Recently, remarkable progress has been made by approximating Nash equilibrium (NE), correlated equilibrium (CE), and coarse correlated equilibrium (CCE) through function approximation that trains a neural network to predict equilibria from game representations. Furthermore, equivariant architectures are widely adopted in designing such equilibrium approximators in normal-form games. In this paper, we theoretically characterize benefits and limitations of equivariant equilibrium approximators. For the benefits, we show that they enjoy better generalizability than general ones and can achieve better approximations when the payoff distribution is permutation-invariant. For the limitations, we discuss their drawbacks in terms of equilibrium selection and social welfare. Together, our results help to understand the role of equivariance in equilibrium approximators.  ( 2 min )
    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing. (arXiv:2301.11500v1 [cs.LG])
    It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019): GD sequentially learns solutions with increasing ranks until it recovers the ground truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result provides characterizations for the whole learning process. Moreover, besides the over-parameterized regime that many prior works focused on, our analysis of the incremental learning procedure also applies to the under-parameterized regime. Finally, we conduct numerical experiments to confirm our theoretical findings.  ( 2 min )
    Learning Modulo Theories. (arXiv:2301.11435v1 [cs.LG])
    Recent techniques that integrate \emph{solver layers} into Deep Neural Networks (DNNs) have shown promise in bridging a long-standing gap between inductive learning and symbolic reasoning techniques. In this paper we present a set of techniques for integrating \emph{Satisfiability Modulo Theories} (SMT) solvers into the forward and backward passes of a deep network layer, called SMTLayer. Using this approach, one can encode rich domain knowledge into the network in the form of mathematical formulas. In the forward pass, the solver uses symbols produced by prior layers, along with these formulas, to construct inferences; in the backward pass, the solver informs updates to the network, driving it towards representations that are compatible with the solver's theory. Notably, the solver need not be differentiable. We implement \layername as a Pytorch module, and our empirical results show that it leads to models that \emph{1)} require fewer training samples than conventional models, \emph{2)} that are robust to certain types of covariate shift, and \emph{3)} that ultimately learn representations that are consistent with symbolic knowledge, and thus naturally interpretable.  ( 2 min )
    Causal Structural Learning from Time Series: A Convex Optimization Approach. (arXiv:2301.11336v1 [cs.LG])
    Structural learning, which aims to learn directed acyclic graphs (DAGs) from observational data, is foundational to causal reasoning and scientific discovery. Recent advancements formulate structural learning into a continuous optimization problem; however, DAG learning remains a highly non-convex problem, and there has not been much work on leveraging well-developed convex optimization techniques for causal structural learning. We fill this gap by proposing a data-adaptive linear approach for causal structural learning from time series data, which can be conveniently cast into a convex optimization problem using a recently developed monotone operator variational inequality (VI) formulation. Furthermore, we establish non-asymptotic recovery guarantee of the VI-based approach and show the superior performance of our proposed method on structure recovery over existing methods via extensive numerical experiments.  ( 2 min )
    A Simple Algorithm For Scaling Up Kernel Methods. (arXiv:2301.11414v1 [cs.LG])
    The recent discovery of the equivalence between infinitely wide neural networks (NNs) in the lazy training regime and Neural Tangent Kernels (NTKs) (Jacot et al., 2018) has revived interest in kernel methods. However, conventional wisdom suggests kernel methods are unsuitable for large samples due to their computational complexity and memory requirements. We introduce a novel random feature regression algorithm that allows us (when necessary) to scale to virtually infinite numbers of random features. We illustrate the performance of our method on the CIFAR-10 dataset.  ( 2 min )
    Coincident Learning for Unsupervised Anomaly Detection. (arXiv:2301.11368v1 [cs.LG])
    Anomaly detection is an important task for complex systems (e.g., industrial facilities, manufacturing, large-scale science experiments), where failures in a sub-system can lead to low yield, faulty products, or even damage to components. While complex systems often have a wealth of data, labeled anomalies are typically rare (or even nonexistent) and expensive to acquire. In this paper, we introduce a new method, called CoAD, for training anomaly detection models on unlabeled data, based on the expectation that anomalous behavior in one sub-system will produce coincident anomalies in downstream sub-systems and products. Given data split into two streams $s$ and $q$ (i.e., subsystem diagnostics and final product quality), we define an unsupervised metric, $\hat{F}_\beta$, out of analogy to the supervised classification $F_\beta$ statistic, which quantifies the performance of the independent anomaly detection algorithms on s and q based on their coincidence rate. We demonstrate our method in four cases: a synthetic time-series data set, a synthetic imaging data set generated from MNIST, a metal milling data set, and a data set taken from a particle accelerator.  ( 2 min )
    Causal Bandits without Graph Learning. (arXiv:2301.11401v1 [stat.ML])
    We study the causal bandit problem when the causal graph is unknown and develop an efficient algorithm for finding the parent node of the reward node using atomic interventions. We derive the exact equation for the expected number of interventions performed by the algorithm and show that under certain graphical conditions it could perform either logarithmically fast or, under more general assumptions, slower but still sublinearly in the number of variables. We formally show that our algorithm is optimal as it meets the universal lower bound we establish for any algorithm that performs atomic interventions. Finally, we extend our algorithm to the case when the reward node has multiple parents. Using this algorithm together with a standard algorithm from bandit literature leads to improved regret bounds.  ( 2 min )
    A Hybrid Deep Neural Operator/Finite Element Method for Ice-Sheet Modeling. (arXiv:2301.11402v1 [physics.comp-ph])
    One of the most challenging and consequential problems in climate modeling is to provide probabilistic projections of sea level rise. A large part of the uncertainty of sea level projections is due to uncertainty in ice sheet dynamics. At the moment, accurate quantification of the uncertainty is hindered by the cost of ice sheet computational models. In this work, we develop a hybrid approach to approximate existing ice sheet computational models at a fraction of their cost. Our approach consists of replacing the finite element model for the momentum equations for the ice velocity, the most expensive part of an ice sheet model, with a Deep Operator Network, while retaining a classic finite element discretization for the evolution of the ice thickness. We show that the resulting hybrid model is very accurate and it is an order of magnitude faster than the traditional finite element model. Further, a distinctive feature of the proposed model compared to other neural network approaches, is that it can handle high-dimensional parameter spaces (parameter fields) such as the basal friction at the bed of the glacier, and can therefore be used for generating samples for uncertainty quantification. We study the impact of hyper-parameters, number of unknowns and correlation length of the parameter distribution on the training and accuracy of the Deep Operator Network on a synthetic ice sheet model. We then target the evolution of the Humboldt glacier in Greenland and show that our hybrid model can provide accurate statistics of the glacier mass loss and can be effectively used to accelerate the quantification of uncertainty.  ( 2 min )
  • Open

    Differential Privacy has Bounded Impact on Fairness in Classification. (arXiv:2210.16242v2 [cs.LG] UPDATED)
    We theoretically study the impact of differential privacy on fairness in classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on accuracy conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.
    Achieving Risk Control in Online Learning Settings. (arXiv:2205.09095v7 [cs.LG] UPDATED)
    To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk -- such as coverage of confidence intervals, false negative rate, or F1 score -- in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion. The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost. We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks. To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks. Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v3 [stat.ML] UPDATED)
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.
    Big portfolio selection by graph-based conditional moments method. (arXiv:2301.11697v1 [stat.ML])
    How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which guides the learning procedure through a factor-hypergraph built by the set of stock-to-stock relations from the domain knowledge as well as the set of factor-to-stock relations from the asset pricing knowledge. Next, the GRACE method learns the conditional variance, skewness, and kurtosis of stock returns from the learned conditional quantiles by using the quantiled conditional moment (QCM) method. The QCM method is a supervised learning procedure to learn these conditional higher-order moments, so it largely overcomes the computational difficulty from the classical high-dimensional GARCH-type methods. Moreover, the QCM method allows the mis-specification in modeling conditional quantiles to some extent, due to its regression-based nature. Finally, the GRACE method uses the learned conditional mean, variance, skewness, and kurtosis to construct several performance measures, which are criteria to sort the stocks to proceed the portfolio selection in the well-known 10-decile framework. An application to NASDAQ and NYSE stock markets shows that the GRACE method performs much better than its competitors, particularly when the performance measures are comprised of conditional variance, skewness, and kurtosis.  ( 2 min )
    Personalised Federated Learning On Heterogeneous Feature Spaces. (arXiv:2301.11447v1 [cs.LG])
    Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.  ( 2 min )
    Fine-tuning Neural-Operator architectures for training and generalization. (arXiv:2301.11509v1 [cs.LG])
    In this work, we present an analysis of the generalization of Neural Operators (NOs) and derived architectures. We proposed a family of networks, which we name (${\textit{s}}{\text{NO}}+\varepsilon$), where we modify the layout of NOs towards an architecture resembling a Transformer; mainly, we substitute the Attention module with the Integral Operator part of NOs. The resulting network preserves universality, has a better generalization to unseen data, and similar number of parameters as NOs. On the one hand, we study numerically the generalization by gradually transforming NOs into ${\textit{s}}{\text{NO}}+\varepsilon$ and verifying a reduction of the test loss considering a time-harmonic wave dataset with different frequencies. We perform the following changes in NOs: (a) we split the Integral Operator (non-local) and the (local) feed-forward network (MLP) into different layers, generating a {\it sequential} structure which we call sequential Neural Operator (${\textit{s}}{\text{NO}}$), (b) we add the skip connection, and layer normalization in ${\textit{s}}{\text{NO}}$, and (c) we incorporate dropout and stochastic depth that allows us to generate deep networks. In each case, we observe a decrease in the test loss in a wide variety of initialization, indicating that our changes outperform the NO. On the other hand, building on infinite-dimensional Statistics, and in particular the Dudley Theorem, we provide bounds of the Rademacher complexity of NOs and ${\textit{s}}{\text{NO}}$, and we find the following relationship: the upper bound of the Rademacher complexity of the ${\textit{s}}{\text{NO}}$ is a lower-bound of the NOs, thereby, the generalization error bound of ${\textit{s}}{\text{NO}}$ is smaller than NO, which further strengthens our numerical results.  ( 2 min )
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v3 [cs.LG] UPDATED)
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.  ( 2 min )
    SOBER: Scalable Batch Bayesian Optimization and Quadrature using Recombination Constraints. (arXiv:2301.11832v1 [cs.LG])
    Batch Bayesian optimisation (BO) has shown to be a sample-efficient method of performing optimisation where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch BO with arbitrary acquisition functions, arbitrary input spaces (e.g. graph), and arbitrary kernels. The key to our approach is to reformulate batch selection for BO as a Bayesian quadrature (BQ) problem, which offers computational advantages. This reformulation is beneficial in solving BQ tasks reciprocally, which introduces the exploitative functionality of BO to BQ. We show that SOBER offers substantive performance gains in synthetic and real-world tasks, including drug discovery and simulation-based inference.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v3 [stat.ML] UPDATED)
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    Rethinking Assumptions in Deep Anomaly Detection. (arXiv:2006.00339v3 [cs.LG] UPDATED)
    Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) it is usually treated in an unsupervised manner since one typically does not have access to, or it is infeasible to utilize, a dataset that sufficiently characterizes what it means to be "anomalous." In this paper we present results demonstrating that this intuition surprisingly seems not to extend to deep AD on images. For a recent AD benchmark on ImageNet, classifiers trained to discern between normal samples and just a few (64) random natural images are able to outperform the current state of the art in deep AD. Experimentally we discover that the multiscale structure of image data makes example anomalies exceptionally informative.  ( 2 min )
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v2 [cs.LG] UPDATED)
    Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in its most practical form. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. Existing analysis cannot conclude the convergence for such a challenging case. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. We develop a novel framework that evaluates and controls the error propagation between actor and critic systematically. To our knowledge, this is the first finite-time analysis for the online single-timescale actor-critic method. Our results compare favorably to the existing literature in terms of considering the most practical yet challenging settings and requiring weaker assumptions.  ( 2 min )
    Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions. (arXiv:2301.11885v1 [stat.ML])
    Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.  ( 2 min )
    Conformal inference is (almost) free for neural networks trained with early stopping. (arXiv:2301.11556v1 [stat.ML])
    Early stopping based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stopping often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses the above limitation with conformalized early stopping: a novel method that combines early stopping with conformal calibration while efficiently recycling the same hold-out data. This leads to models that are both accurate and able to provide exact predictive inferences without multiple data splits nor overly conservative adjustments. Practical implementations are developed for different learning tasks -- outlier detection, multi-class classification, regression -- and their competitive performance is demonstrated on real data.  ( 2 min )
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v2 [cs.LG] UPDATED)
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).  ( 2 min )
    Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2209.06620v3 [cs.LG] UPDATED)
    Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.  ( 2 min )
    Statistical Inference for the Dynamic Time Warping Distance, with Application to Abnormal Time-Series Detection. (arXiv:2202.06593v2 [stat.ML] UPDATED)
    We study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too difficult to derive because it is obtained based on the solution of the DTW algorithm, which is complicated. To circumvent this difficulty, we propose to employ the conditional selective inference framework, which enables us to derive a valid inference method on the DTW distance. To our knowledge, this is the first method that can provide a valid p-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making such as abnormal time-series detection problems. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.  ( 2 min )
    Learning the Dynamics of Sparsely Observed Interacting Systems. (arXiv:2301.11647v1 [stat.ML])
    We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.  ( 2 min )
    Feasibility and Transferability of Transfer Learning: A Mathematical Framework. (arXiv:2301.11542v1 [cs.LG])
    Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from previous learning tasks to improve the performance of new ones. Despite its numerous empirical successes, theoretical analysis for transfer learning is limited. In this paper we build for the first time, to the best of our knowledge, a mathematical framework for the general procedure of transfer learning. Our unique reformulation of transfer learning as an optimization problem allows for the first time, analysis of its feasibility. Additionally, we propose a novel concept of transfer risk to evaluate transferability of transfer learning. Our numerical studies using the Office-31 dataset demonstrate the potential and benefits of incorporating transfer risk in the evaluation of transfer learning performance.  ( 2 min )
    Single-Trajectory Distributionally Robust Reinforcement Learning. (arXiv:2301.11721v1 [stat.ML])
    As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.  ( 2 min )
    I Prefer not to Say: Are Users Penalized for Protecting Personal Data?. (arXiv:2210.13954v3 [cs.LG] UPDATED)
    We examine the problem of obtaining fair outcomes for individuals who choose to share optional information with machine-learned models and those who do not consent and keep their data undisclosed. We find that these non-consenting users receive significantly lower prediction outcomes than justified by their provided information alone. This observation gives rise to the overlooked problem of how to ensure that users, who protect their personal data, are not penalized. While statistical fairness notions focus on fair outcomes between advantaged and disadvantaged groups, these fairness notions fail to protect the non-consenting users. To address this problem, we formalize protection requirements for models which (i) allow users to benefit from sharing optional information and (ii) do not penalize them if they keep their data undisclosed. We offer the first solution to this problem by proposing the notion of Optional Feature Fairness (OFF), which we prove to be loss-optimal under our protection requirements (i) and (ii). To learn OFF-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we extensively analyze OFF on a variety of challenging real-world tasks, models, and data sets with multiple optional features.  ( 2 min )
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v2 [math.ST] UPDATED)
    Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.  ( 2 min )
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v3 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.  ( 2 min )
    Convergence of Batch Updating Methods with Approximate Gradients and/or Noisy Measurements: Theory and Computational Results. (arXiv:2209.05372v2 [math.OC] UPDATED)
    In this paper, we present a unified and general framework for analyzing the batch updating approach to nonlinear, high-dimensional optimization. The framework encompasses all the currently used batch updating approaches, and is applicable to nonconvex as well as convex functions. Moreover, the framework permits the use of noise-corrupted gradients, as well as first-order approximations to the gradient (sometimes referred to as "gradient-free" approaches). By viewing the analysis of the iterations as a problem in the convergence of stochastic processes, we are able to establish a very general theorem, which includes most known convergence results for zeroth-order and first-order methods. The analysis of "second-order" or momentum-based methods is not a part of this paper, and will be studied elsewhere. However, numerical experiments indicate that momentum-based methods can fail if the true gradient is replaced by its first-order approximation. This requires further theoretical analysis.  ( 2 min )
    Explaining Patterns in Data with Language Models via Interpretable Autoprompting. (arXiv:2210.01848v2 [cs.LG] UPDATED)
    Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery. All code for using the methods and data here is made available on Github.  ( 2 min )
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then use our method to compare four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. In this application, we corroborate evidence for the recently proposed L\'evy flight model of decision-making and show how transfer learning can be leveraged to enhance training efficiency. Reproducible code for all analyses is provided.  ( 2 min )
    CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators. (arXiv:2210.06812v2 [cs.LG] UPDATED)
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.  ( 2 min )
    Synthetic A/B Testing using Synthetic Interventions. (arXiv:2006.07691v5 [econ.EM] UPDATED)
    Suppose there are $N$ units and $D$ interventions. We aim to learn the average potential outcome associated with every unit-intervention pair, i.e., $N \times D$ causal parameters. While running $N \times D$ experiments is conceivable, it can be expensive or infeasible. This work introduces an experiment design, synthetic A/B testing, and the synthetic interventions (SI) estimator to recover all $N \times D$ causal parameters while observing each unit under at most two interventions, independent of $D$. Under a novel tensor factor model for potential outcomes across units, measurements, and interventions, we establish the identification of each parameter. Further, we show the SI estimator is finite-sample consistent and asymptotically normal. Collectively, these also lead to novel results for panel data settings, particularly for synthetic controls. We empirically validate our experiment design using real e-commerce data from a large-scale A/B test.  ( 2 min )
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v2 [stat.ML] UPDATED)
    We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v2 [cs.LG] UPDATED)
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Constrained Clustering: General Pairwise and Cardinality Constraints. (arXiv:1907.10410v2 [cs.LG] UPDATED)
    We study constrained clustering, where constraints guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. Most existing constrained clustering models can only utilize one category of constraints at a time. We enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As the two categories provide different useful information, utilizing both allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the unified formulation. To solve this, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. We derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers. Extensive experiments on both synthetic and real data demonstrate when: (1) utilizing a single category of constraint, the proposed model is superior to or competitive with SOTA constrained clustering models, and (2) utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experiments show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to 2%-5% in classical clustering metrics, e.g. Adjusted Random, Mirkin's, and Huber's, indices outerperfomring other methods.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v4 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v2 [cs.LG] UPDATED)
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.  ( 2 min )
    DAG Learning on the Permutahedron. (arXiv:2301.11898v1 [cs.LG])
    We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.  ( 2 min )
    Aleatoric and Epistemic Discrimination in Classification. (arXiv:2301.11781v1 [cs.LG])
    Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy given fairness constraints and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.  ( 2 min )
    From Classification Accuracy to Proper Scoring Rules: Elicitability of Probabilistic Top List Predictions. (arXiv:2301.11797v1 [stat.ME])
    In the face of uncertainty, the need for probabilistic assessments has long been recognized in the literature on forecasting. In classification, however, comparative evaluation of classifiers often focuses on predictions specifying a single class through the use of simple accuracy measures, which disregard any probabilistic uncertainty quantification. I propose probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions. The probabilistic top list functional is elicitable through the use of strictly consistent evaluation metrics. The proposed evaluation metrics are based on symmetric proper scoring rules and admit comparison of various types of predictions ranging from single-class point predictions to fully specified predictive distributions. The Brier score yields a metric that is particularly well suited for this kind of comparison.  ( 2 min )
    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing. (arXiv:2301.11500v1 [cs.LG])
    It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019): GD sequentially learns solutions with increasing ranks until it recovers the ground truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result provides characterizations for the whole learning process. Moreover, besides the over-parameterized regime that many prior works focused on, our analysis of the incremental learning procedure also applies to the under-parameterized regime. Finally, we conduct numerical experiments to confirm our theoretical findings.  ( 2 min )
    DBGSL: Dynamic Brain Graph Structure Learning. (arXiv:2209.13513v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs) have shown success at learning representations of brain graphs derived from functional magnetic resonance imaging (fMRI) data. The majority of existing GNN methods, however, assume brain graphs are static over time and the graph adjacency matrix is known prior to model training. These assumptions are at odds with neuroscientific evidence that brain graphs are time-varying with a connectivity structure that depends on the choice of functional connectivity measure. Noisy brain graphs that do not truly represent the underling fMRI data can have a detrimental impact on the performance of GNNs. As a solution, we propose Dynamic Brain Graph Structure Learning (DBGSL), a novel method for learning the optimal time-varying dependency structure of fMRI data induced by a downstream prediction task. Experiments demonstrate DBGSL achieves state-of-the-art performance for sex classification using real-world resting-state and task fMRI data. Moreover, analysis of the learnt dynamic graphs highlights prediction-related brain regions which align with existing neuroscience literature.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean. (arXiv:2301.11862v1 [stat.ML])
    Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v2 [cs.LG] UPDATED)
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.  ( 2 min )
    Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits. (arXiv:2210.00025v2 [cs.LG] UPDATED)
    How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.  ( 2 min )
    Overparameterized Linear Regression under Adversarial Attacks. (arXiv:2204.06274v2 [stat.ML] UPDATED)
    We study the error of linear regression in the face of adversarial attacks. In this framework, an adversary changes the input to the regression model in order to maximize the prediction error. We provide bounds on the prediction error in the presence of an adversary as a function of the parameter norm and the error in the absence of such an adversary. We show how these bounds make it possible to study the adversarial error using analysis from non-adversarial setups. The obtained results shed light on the robustness of overparameterized linear models to adversarial attacks. Adding features might be either a source of additional robustness or brittleness. On the one hand, we use asymptotic results to illustrate how double-descent curves can be obtained for the adversarial error. On the other hand, we derive conditions under which the adversarial error can grow to infinity as more features are added, while at the same time, the test error goes to zero. We show this behavior is caused by the fact that the norm of the parameter vector grows with the number of features. It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks might behave fundamentally differently due to how the $\ell_1$ and $\ell_2$-norms of random projections concentrate. We also show how our reformulation allows for solving adversarial training as a convex optimization problem. This fact is then exploited to establish similarities between adversarial training and parameter-shrinking methods and to study how the training might affect the robustness of the estimated models.  ( 2 min )
    ActiveLab: Active Learning with Re-Labeling by Multiple Annotators. (arXiv:2301.11856v1 [cs.LG])
    In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods.  ( 2 min )
    Exponential tail bounds and Large Deviation Principle for Heavy-Tailed U-Statistics. (arXiv:2301.11563v1 [math.PR])
    We study deviation of U-statistics when samples have heavy-tailed distribution so the kernel of the U-statistic does not have bounded exponential moments at any positive point. We obtain an exponential upper bound for the tail of the U-statistics which clearly denotes two regions of tail decay, the first is a Gaussian decay and the second behaves like the tail of the kernel. For several common U-statistics, we also show the upper bound has the right rate of decay as well as sharp constants by obtaining rough logarithmic limits which in turn can be used to develop LDP for U-statistics. In spite of usual LDP results in the literature, processes we consider in this work have LDP speed slower than their sample size $n$.  ( 2 min )
    Leveraging the Third Dimension in Contrastive Learning. (arXiv:2301.11790v1 [cs.CV])
    Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.  ( 2 min )
    Variance, Self-Consistency, and Arbitrariness in Fair Classification. (arXiv:2301.11562v1 [cs.LG])
    In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.  ( 2 min )
    Causal Bandits without Graph Learning. (arXiv:2301.11401v1 [stat.ML])
    We study the causal bandit problem when the causal graph is unknown and develop an efficient algorithm for finding the parent node of the reward node using atomic interventions. We derive the exact equation for the expected number of interventions performed by the algorithm and show that under certain graphical conditions it could perform either logarithmically fast or, under more general assumptions, slower but still sublinearly in the number of variables. We formally show that our algorithm is optimal as it meets the universal lower bound we establish for any algorithm that performs atomic interventions. Finally, we extend our algorithm to the case when the reward node has multiple parents. Using this algorithm together with a standard algorithm from bandit literature leads to improved regret bounds.  ( 2 min )
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v5 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.  ( 2 min )
    Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees. (arXiv:2301.11911v1 [cs.LG])
    The completeness axiom renders the explanation of a post-hoc XAI method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. Recently, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. This paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.  ( 2 min )
    Convolutional neural networks for valid and efficient causal inference. (arXiv:2301.11732v1 [stat.ML])
    Convolutional neural networks (CNN) have been successful in machine learning applications. Their success relies on their ability to consider space invariant local features. We consider the use of CNN to fit nuisance models in semiparametric estimation of the average causal effect of a treatment. In this setting, nuisance models are functions of pre-treatment covariates that need to be controlled for. In an application where we want to estimate the effect of early retirement on a health outcome, we propose to use CNN to control for time-structured covariates. Thus, CNN is used when fitting nuisance models explaining the treatment and the outcome. These fits are then combined into an augmented inverse probability weighting estimator yielding efficient and uniformly valid inference. Theoretically, we contribute by providing rates of convergence for CNN equipped with the rectified linear unit activation function and compare it to an existing result for feedforward neural networks. We also show when those rates guarantee uniformly valid inference. A Monte Carlo study is provided where the performance of the proposed estimator is evaluated and compared with other strategies. Finally, we give results on a study of the effect of early retirement on hospitalization using data covering the whole Swedish population.  ( 2 min )
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v2 [stat.ML] UPDATED)
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.  ( 2 min )
    Distributionally Robust Multi-objective Bayesian Optimization under Uncertain Environments. (arXiv:2301.11588v1 [stat.ML])
    In this study, we address the problem of optimizing multi-output black-box functions under uncertain environments. We formulate this problem as the estimation of the uncertain Pareto-frontier (PF) of a multi-output Bayesian surrogate model with two types of variables: design variables and environmental variables. We consider this problem within the context of Bayesian optimization (BO) under uncertain environments, where the design variables are controllable, whereas the environmental variables are assumed to be random and not controllable. The challenge of this problem is to robustly estimate the PF when the distribution of the environmental variables is unknown, that is, to estimate the PF when the environmental variables are generated from the worst possible distribution. We propose a method for solving the BO problem by appropriately incorporating the uncertainties of the environmental variables and their probability distribution. We demonstrate that the proposed method can find an arbitrarily accurate PF with high probability in a finite number of iterations. We also evaluate the performance of the proposed method through numerical experiments.  ( 2 min )
    Distilling Importance Sampling for Likelihood Free Inference. (arXiv:1910.03632v6 [stat.CO] UPDATED)
    Likelihood-free inference involves inferring parameter values given observed data and a simulator model. The simulator is computer code which takes parameters, performs stochastic calculations, and outputs simulated data. In this work, we view the simulator as a function whose inputs are (1) the parameters and (2) a vector of pseudo-random draws. We attempt to infer all these inputs conditional on the observations. This is challenging as the resulting posterior can be high dimensional and involve strong dependence. We approximate the posterior using normalizing flows, a flexible parametric family of densities. Training data is generated by likelihood-free importance sampling with a large bandwidth value epsilon, which makes the target similar to the prior. The training data is "distilled" by using it to train an updated normalizing flow. The process is iterated, using the updated flow as the importance sampling proposal, and slowly reducing epsilon so the target becomes closer to the posterior. Unlike most other likelihood-free methods, we avoid the need to reduce data to low dimensional summary statistics, and hence can achieve more accurate results. We illustrate our method in two challenging examples, on queuing and epidemiology.  ( 2 min )
    Multilayer hypergraph clustering using the aggregate similarity matrix. (arXiv:2301.11657v1 [math.ST])
    We consider the community recovery problem on a multilayer variant of the hypergraph stochastic block model (HSBM). Each layer is associated with an independent realization of a d-uniform HSBM on N vertices. Given the aggregated number of hyperedges incident to each pair of vertices, represented using a similarity matrix, the goal is to obtain a partition of the N vertices into disjoint communities. In this work, we investigate a semidefinite programming (SDP) approach and obtain information-theoretic conditions on the model parameters that guarantee exact recovery both in the assortative and the disassortative cases.  ( 2 min )
    Machine Learning Approach and Extreme Value Theory to Correlated Stochastic Time Series with Application to Tree Ring Data. (arXiv:2301.11488v1 [stat.ML])
    The main goal of machine learning (ML) is to study and improve mathematical models which can be trained with data provided by the environment to infer the future and to make decisions without necessarily having complete knowledge of all influencing elements. In this work, we describe how ML can be a powerful tool in studying climate modeling. Tree ring growth was used as an implementation in different aspects, for example, studying the history of buildings and environment. By growing and via the time, a new layer of wood to beneath its bark by the tree. After years of growing, time series can be applied via a sequence of tree ring widths. The purpose of this paper is to use ML algorithms and Extreme Value Theory in order to analyse a set of tree ring widths data from nine trees growing in Nottinghamshire. Initially, we start by exploring the data through a variety of descriptive statistical approaches. Transforming data is important at this stage to find out any problem in modelling algorithm. We then use algorithm tuning and ensemble methods to improve the k-nearest neighbors (KNN) algorithm. A comparison between the developed method in this study ad other methods are applied. Also, extreme value of the dataset will be more investigated. The results of the analysis study show that the ML algorithms in the Random Forest method would give accurate results in the analysis of tree ring widths data from nine trees growing in Nottinghamshire with the lowest Root Mean Square Error value. Also, we notice that as the assumed ARMA model parameters increased, the probability of selecting the true model also increased. In terms of the Extreme Value Theory, the Weibull distribution would be a good choice to model tree ring data.  ( 2 min )
    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v1 [stat.ML])
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.  ( 2 min )
    Rigid body flows for sampling molecular crystal structures. (arXiv:2301.11355v1 [cs.LG])
    Normalizing flows (NF) are a class of powerful generative models that have gained popularity in recent years due to their ability to model complex distributions with high flexibility and expressiveness. In this work, we introduce a new type of normalizing flow that is tailored for modeling positions and orientations of multiple objects in three-dimensional space, such as molecules in a crystal. Our approach is based on two key ideas: first, we define smooth and expressive flows on the group of unit quaternions, which allows us to capture the continuous rotational motion of rigid bodies; second, we use the double cover property of unit quaternions to define a proper density on the rotation group. This ensures that our model can be trained using standard likelihood-based methods or variational inference with respect to a thermodynamic target density. We evaluate the method by training Boltzmann generators for two molecular examples, namely the multi-modal density of a tetrahedral system in an external field and the ice XI phase in the TIP4P-Ew water model. Our flows can be combined with flows operating on the internal degrees of freedom of molecules, and constitute an important step towards the modeling of distributions of many interacting molecules.  ( 2 min )
    Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference. (arXiv:2301.11674v1 [stat.ME])
    Likelihood-free inference methods typically make use of a distance between simulated and real data. A common example is the maximum mean discrepancy (MMD), which has previously been used for approximate Bayesian computation, minimum distance estimation, generalised Bayesian inference, and within the nonparametric learning framework. The MMD is commonly estimated at a root-$m$ rate, where $m$ is the number of simulated samples. This can lead to significant computational challenges since a large $m$ is required to obtain an accurate estimate, which is crucial for parameter estimation. In this paper, we propose a novel estimator for the MMD with significantly improved sample complexity. The estimator is particularly well suited for computationally expensive smooth simulators with low- to mid-dimensional inputs. This claim is supported through both theoretical results and an extensive simulation study on benchmark simulators.  ( 2 min )
    Neural networks learn to magnify areas near decision boundaries. (arXiv:2301.11375v1 [cs.LG])
    We study how training molds the Riemannian geometry induced by neural network feature maps. At infinite width, neural networks with random parameters induce highly symmetric metrics on input space. Feature learning in networks trained to perform classification tasks magnifies local areas along decision boundaries. These changes are consistent with previously proposed geometric approaches for hand-tuning of kernel methods to improve generalization.  ( 2 min )
    DBGDGM: Dynamic Brain Graph Deep Generative Model. (arXiv:2301.11408v1 [cs.LG])
    Graphs are a natural representation of brain activity derived from functional magnetic imaging (fMRI) data. It is well known that clusters of anatomical brain regions, known as functional connectivity networks (FCNs), encode temporal relationships which can serve as useful biomarkers for understanding brain function and dysfunction. Previous works, however, ignore the temporal dynamics of the brain and focus on static graphs. In this paper, we propose a dynamic brain graph deep generative model (DBGDGM) which simultaneously clusters brain regions into temporally evolving communities and learns dynamic unsupervised node embeddings. Specifically, DBGDGM represents brain graph nodes as embeddings sampled from a distribution over communities that evolve over time. We parameterise this community distribution using neural networks that learn from subject and node embeddings as well as past community assignments. Experiments demonstrate DBGDGM outperforms baselines in graph generation, dynamic link prediction, and is comparable for graph classification. Finally, an analysis of the learnt community distributions reveals overlap with known FCNs reported in neuroscience literature.  ( 2 min )
    Estimating Causal Effects using a Multi-task Deep Ensemble. (arXiv:2301.11351v1 [cs.LG])
    Over the past few decades, a number of methods have been proposed for causal effect estimation, yet few have been demonstrated to be effective in handling data with complex structures, such as images. To fill this gap, we propose a Causal Multi-task Deep Ensemble (CMDE) framework to learn both shared and group-specific information from the study population and prove its equivalence to a multi-task Gaussian process (GP) with coregionalization kernel a priori. Compared to multi-task GP, CMDE efficiently handles high-dimensional and multi-modal covariates and provides pointwise uncertainty estimates of causal effects. We evaluate our method across various types of datasets and tasks and find that CMDE outperforms state-of-the-art methods on a majority of these tasks.  ( 2 min )
    Robust variance-regularized risk minimization with concomitant scaling. (arXiv:2301.11584v1 [stat.ML])
    Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.  ( 2 min )

  • Open

    College classes you guys recommend, to help me build and use neural networks.
    For some context, I plan on eventually majoring in neuroscience. I am enrolling in a local state college soon. I want to get my associates out of the way, and then transferring to a college with a good neuroscience program. It seems like neural networks, are an exciting avenue for research. There's a lot more neuroscience related research being published using neural networks. What courses would you guys recommend me enrolling in? I feel like the obvious one is software development, but maybe there are some others as well. I'd appreciate any insight, thanks! submitted by /u/daddydilly694-20 [link] [comments]  ( 41 min )
  • Open

    Weekly China AI News: China's Master Plan for Robots; Robots Transform into Liquid to Escape Jail; Li Auto's Goal to Become AI Leader by 2030
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    A Video Made using CHATGPT + MidJourney and Text to Speech
    Hey All! I wrote this story and would appreciate your support on my channel and reviews. You can also give me stories for the channel. Thank You for your time :) https://youtu.be/-aM7cSbFFFY The sky was a deep shade of black as the group of friends made their way to the cabin, the only light coming from the beams of their headlights piercing through the heavy rain. The cabin, located deep in the woods, was said to be haunted, but the friends brushed off the rumors as nothing more than tales meant to scare. As they settled in for the night, they laughed and joked, telling ghost stories and watching horror movies. They passed jokes about how they would run if they saw a ghost, unaware of the danger lurking outside. But their laughter was cut short when Dave stumbled across a letter in the…  ( 46 min )
    AI Dream 150 - MINDBLOW MONDAY - AI Video - FINAL MASTERPIECE
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    The Year of AI Breakthroughs 2022
    submitted by /u/pmz [link] [comments]  ( 40 min )
    What media format will ChatGPT and AI bring back that was previously obsolete?
    I'v been thinking a lot about Marshall McLuhan and his 4 laws of media. Specifically, the one that states that all new forms of media cause something to be retrieved from the past. What will ChatGPT and AI revive and retrieve? I put some more thoughts in my blog. Would love to hear your thoughts on it. https://bobhutchins.substack.com/p/what-media-format-will-chatgpt-and submitted by /u/Interesting_Status64 [link] [comments]  ( 41 min )
    Easy-To-Use Voice Models for Fun Round of Trivia
    I'm putting together a trivia night, and one idea I have for a round is "The Pen*s Game" where I replace a word from a famous movie quote with the word "pen*s". I could simply replace it with my own voice, but I don't think it would sound as good/be as funny, so I think it would be amazing to instead use a tool (presumably AI-powered) to achieve this. Is there a reasonably user-friendly way of doing this?Ideally it would be some magical tool that has a nice GUI and celebrity voices already trained, though I doubt this exists... I am well-versed in Python, and am always looking to learn new tools, but I don't have the time to scrape voice data for each actor and train a model (not that I even know where I'd start on that front). As an example, imagine this clip from Shrek except "parfait" is replaced with "pen*s". https://youtu.be/-FtCTW2rVFM?t=86 I don't know this threads rules, so I censor "pen*s" just in case. Any suggestions would be appreciated! Thank you. submitted by /u/wendeborn8 [link] [comments]  ( 41 min )
    AI APIs Implementation - Stack and recommendations
    Hi everyone ! I had this weird idea lately, with the rise of AI to the mainstream, to implement some kind of "Simulator", where one could input a webcam picture, some basic info through a form, & get an alternate small narrated video narrated of one's future. I'm looking for suggestions for open source AI APIs I could use for it. My idea, very simplistically put, is to combine them through input, outputs and some basic logic manipulation. I'm also looking for recommendations for the stack I could use. I've worked with NodeJS and Java w/ Spring Boot on backend, & Angular for frontend. Any recommendations based in the APIs? submitted by /u/WhereIsBryan [link] [comments]  ( 41 min )
    RecolorNeRF is like a basic Photoshop for NeRFs
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Google's AI Tool "MusicLM" creates Music based on text descriptions
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    Which AI chat apps can have phone conversations with me?
    Hi. I tried to voice chat with Kuki on Telegram but she still doesn't have this ability. Replika has voice calls. Do you know other AI apps that can already speak using voice? submitted by /u/Trainer_Red99 [link] [comments]  ( 40 min )
    You against the machine: Can you spot which art was created by A.I.?
    submitted by /u/robbinpetertopaypaul [link] [comments]  ( 41 min )
    📌[Searchcolab] "8K Nature" Links in comment
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    With all of this heated arguing over AI, It's time for a more realistic, balanced analysis. As a filmmaker/writer and tech entrepreneur, here's a perspective that hardly anyone is considering and perhaps one that could help bridge the gap between the haters and lovers of AI.
    submitted by /u/CyborgWriter [link] [comments]  ( 41 min )
    📌[Searchcolab] Generative AI is climbing the *Dimensional Ladder*. I made a figure to show the milestones!
    1D: MusicLM, VALL-E 2D: Stable Diffusion, DALL-E, MidJourney 3D (or 2+1D): Imagen-video, Phenaki 3D: Magic3D, DreamFusion, Point-E 4D (or 3+1D): Make-A-Video-3D [Searchcolab] What’s next? 🤔 https://reddit.com/link/10p6vw9/video/gqbnrsaxh7fa1/player submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    Is there a AI that you can feed books into, so it learns, then communicates using that info?
    Similar to chatgpt, etc. but it doesn't have all the knowledge, but just from a very special field. E.g. if you would like to create an AI for kids, you feed 1000 kids books into it, and it would respond exclusively in a child-friendly or goofy way. It's simply not possible that it responds in any other way, because it doesn't know about the rest of the world. submitted by /u/EndlessSenseless [link] [comments]  ( 41 min )
    AI Has Successfully Imitated Human Evolution—and Might Do It Even Better
    submitted by /u/Itchy0101 [link] [comments]  ( 40 min )
    Google AI music samples/copywrite problems.
    https://ainewsbase.com/google-musiclm-copyright-issues-not-releasing/ The samples they do show might just sound weird because of the stored file or whatever but the sound definitely sounds kinda weird. submitted by /u/SPEEDYFISHY2000 [link] [comments]  ( 40 min )
    3 Books About Artificial Intelligence Everyone Should Read
    I spend about 5 hours a day reading about AI and the news is a lot to keep up with - however there's a lot of big picture AI information that is a good foundation for understanding the future implications of AI. I think these three books offer a good variety of perspective from the philosophical, business, and strategy side of AI. Philosophy: Superintelligence by Nick Bostrom - This S-tier book explores the potential risks and benefits of creating artificial intelligence beyond human levels. The book argues that if the development of superintelligence proceeds rapidly, it could pose an existential threat to humanity. Bostrom paints an incredibly articulate perspective on why we must consider the possibility of creating AI in a way that aligns with human values, and that we must prepare …  ( 44 min )
    Report Says AI Could Potentially Replace 85 Million Jobs Worldwide By 2025 — Are Interns On The List?
    submitted by /u/Mental_Character7367 [link] [comments]  ( 41 min )
    New Artificial Intelligence Subreddit?
    I suggest the need for a new subreddit, where people can announce their creations. You know: "I've created an AI to make s'mores..." "I've created an AI to sort my comic books..." "I've created an AI to find me a girlfriend..." ...ad nauseum. submitted by /u/PredictorX1 [link] [comments]  ( 40 min )
    AI and construction industry
    Hello guys, Have you found any application of AI in the construction industry (notwithstanding the design/modelization)? I've been following the industry for ages and I believe there is so much to be done particularly with AI as it can manage dependancies.. I've quite a few ideas too; get in touch if you are interested to discuss over this :) submitted by /u/MexsEU [link] [comments]  ( 41 min )
    “Nothing Forever”, — AI-generated, always streaming parody of ‘90s sitcoms
    submitted by /u/tinylobsta [link] [comments]  ( 41 min )
    ChatGPT Surpasses Instagram With 10 Million Daily Users In Just 40 Days
    submitted by /u/liquidocelotYT [link] [comments]  ( 42 min )
    Vector animals bundle
    submitted by /u/annal201 [link] [comments]  ( 40 min )
    AI is becoming a commodity
    submitted by /u/_utisz_ [link] [comments]  ( 40 min )
  • Open

    [D] Is the YoloR paper worth looking into?
    Doing a survey of object detection papers with plausible application to pose-estimation tasks. Came across the paper "You Only Learn One Representation" and, while the theory seems interesting, I want to hear people's opinions before doing a deep dive into the theory. submitted by /u/answersareallyouneed [link] [comments]  ( 42 min )
    [D] Towards A Token-Free Future In NLP
    https://peltarion.com/blog/data-science/towards-a-token-free-future-in-nlp submitted by /u/EducationalCicada [link] [comments]  ( 42 min )
    [P] I launched “CatchGPT”, a supervised model trained with millions of text examples, to detect GPT created content
    I’m an ML Engineer at Hive AI and I’ve been working on a ChatGPT Detector. Here is a free demo we have up: https://hivemoderation.com/ai-generated-content-detection From our benchmarks it’s significantly better than similar solutions like GPTZero and OpenAI’s GPT2 Output Detector. On our internal datasets, we’re seeing balanced accuracies of >99% for our own model compared to around 60% for GPTZero and 84% for OpenAI’s GPT2 Detector. Feel free to try it out and let us know if you have any feedback! submitted by /u/qthai912 [link] [comments]  ( 56 min )
    [D] Are there neural net plugins to assist audio editing of Youtube screencasts?
    In order to improve my talking skills, I am doing a little series on how to setup Stable Diffusion on Paperspace, and I am astounded how much time it takes to do the audio editing. Well, part of the reason is that I've only been doing this for 3 days and my process is very inefficient, but it feels that in the current time, neural nets should be able to do things like remove uhms, lip smacking and breath intakes. I've looked around, and this post from 9 years ago says the only choice is to edit it by hand. Is that still true? submitted by /u/abstractcontrol [link] [comments]  ( 43 min )
    [D] What's stopping you from working on speech and voice?
    I've been working in the speech and voice space for a while now and am now building out some tooling in the space to make it easier for researchers/engineers/developers to build speech processing systems and features; I'd love to hear what people in ML struggle with when you're trying to build or work with speech processing for your projects/products (beyond speech-to-text APIs) submitted by /u/jiamengial [link] [comments]  ( 47 min )
    [D] DL university research PC suggestions?
    I am a researcher at a US university and have a budget of 25k to build a PC for training various ML algorithms (e.g. DRL, neuromorphic computing, VAE, etc). I'm trying to decide between going for prebuilds (like https://lambdalabs.com/gpu-workstations/vector) or building with consumer cards like 4090s. Any advice on which is the most bang for the price? Im not sure how much Im giving up by going for consumer 24g cards vs a6000, 6000 ada but prebuild prices go up quick. Warrantee vs building it myself isn't an issue submitted by /u/seanrescs [link] [comments]  ( 44 min )
    [R] Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models - Stanford University Eric Zelikman et al - Beats prior code generation sota by over 75%!
    Paper: https://arxiv.org/abs/2212.10561 Github: https://github.com/ezelikman/parsel Twitter: https://twitter.com/ericzelikman/status/1618426056163356675?s=20 Website: https://zelikman.me/parselpaper/ Code Generation on APPS Leaderboard: https://paperswithcode.com/sota/code-generation-on-apps Abstract: Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be …  ( 45 min )
    [P] Keras model production deployment
    Hi guys. It's been some time since I started developing my Keras models, but now is the first time I am trying to push it to production. My Keras model looks like this: model = Sequential() model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(TimeDistributed(Dense(1, activation='sigmoid'))) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) My problem is I need to run through about 25 of these for every written sentence. There is going to be an online editor, where users can paste text for my analysis. That means up to about 300 words or about 20 sentences at once. With the current time to run each network (about 0.2s), that means 25 * 0,2 * 20 or about 100s per user input. I am going for 30 seconds at most with potentially dozens of users at once. Ideally on a Raspberry Pi 4. The internet is surely gonna back me up I thought to myself and started googling. If only I know what kind of a rabbit hole I was about to fall into. First I converted my Keras model into a TensorFlow frozen graph model. 10x time improvement on CPU, but still at 0.2s on average. Another thing I think may boost the performance is retraining the models for variable input shape (currently I always feed in 50 values). With the average sentence size of 16 words this may, from what I understand, lead to a 3 times boost? My question is: now what? What can I do to make it faster? Is it even possible to run it on a Raspberry Pi 4 and get reasonable response times? If not, what is my best option on a tight budget? submitted by /u/ProfessionalOverall8 [link] [comments]  ( 45 min )
    [D] I want to understand the broad steps for building something like Adept.AI
    From the given link!, I gather that it is a large-scale Transformer trained to use digital tools like a web browser. Right now, it’s hooked up to a Chrome extension which allows it to observe what’s happening in the browser and take certain actions, like clicking, typing, and scrolling, etc. I am interested in knowing the broad steps involved in building something like this. submitted by /u/smred123 [link] [comments]  ( 43 min )
    [Discussion] ChatGPT and language understanding benchmarks
    The general consensus seems to be that large language models, and ChatGPT in particular, have a problem with accuracy and hallucination. As compared to what, is often unclear, but let's say as compared to other NLP methods of question answering, language understanding or as compared to Google Search. I haven't really been able to find any reliable sources documenting this accuracy problem, though. The SuperGLUE benchmark has GPT-3 ranked #24, not terrible, but outperformed by old models like T5, which seems odd. GLUE nothing. SQUAD nothing. So, I'm curious: Is there any benchmark or metric reflecting the seeming step-function made by ChatGPT that's got everyone so excited? I definitely feel like there's a difference between gpt-3 and chatGPT, but is it measurable or is it just vibes? Is there any metric showing ChatGPT's problem with fact hallucination and accuracy? Am I off the mark here looking at question-answering benchmarks as an assessment of LLMs? Thanks submitted by /u/mettle [link] [comments]  ( 46 min )
    [D]Are There Studies on text-davinci-003's Zero/Few-shot Performance on Various Academic Benchmarks?
    Has anyone come across studies on GPT3 text-davinci-003's zero/few-shot performance over various NLP benchmarks and how they compare to current SoTA? E.g GLUE, SuperGLUE and over more classic ones like CoNLL 2003 NER. I thought it would be pretty interesting to see how far zero/few-shot learning with LLM has progressed with RLHF and instruction tuning. Am surprised that nobody has done such a benchmark yet. submitted by /u/gamerx88 [link] [comments]  ( 42 min )
    [D] Sparse Ridge Regression
    Hi all! Given X ∈ ℝ Nx, Y ∈ ℝ Ny, β ∈ ℝ+, so W = YXT(XXT+βI)-1 (with the Moore–Penrose pseudoinverse) where A = YXT and B = XXT+βI. If we consider an arbitrary number of indices/units < Nx, and so we consider only some columns of matrix A and some columns and rows (crosses) of B. The rest of A and B are zeros. The approach above of sparsify A and B will break the ridge regression solution when W=AB-1? If yes, there are ways to avoid it? Many thanks! submitted by /u/antodima [link] [comments]  ( 43 min )
    [R] A Robust Hypothesis Test for Tree Ensemble Pruning
    I'm looking for help/feedback with this paper. Please let me know if the method is interesting and if there's ways to improve it! https://arxiv.org/abs/2301.10115 Abstract: Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms. submitted by /u/asi_dm [link] [comments]  ( 43 min )
    [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!)
    https://github.com/tysam-code/hlb-CIFAR10 submitted by /u/tysam_and_co [link] [comments]  ( 53 min )
  • Open

    Top Android App Development Trends in 2023
    The global mobile app development revenue is $526 billion, making it one of the most flourishing industries worldwide. Android controls 73% of the market share. So, if you have plans to build an Android app, there’s no better time. However, to ensure your Android app stands out, you must keep an eye on the latest… Read More »Top Android App Development Trends in 2023 The post Top Android App Development Trends in 2023 appeared first on Data Science Central.  ( 22 min )
    Enabling contextual computing in today’s enterprise information fabrics
    During the 1970s, Ethernet pioneer and 3Com Internet equipment company founder Bob Metcalfe was working on something called the “Data Reconfiguration Service” for the early Internet. “It was an effort to write a special purpose programming language to convert data formats, Metcalfe said during a 2021 OriginTrail.io panel session. “And the goal was so that… Read More »Enabling contextual computing in today’s enterprise information fabrics The post Enabling contextual computing in today’s enterprise information fabrics appeared first on Data Science Central.  ( 21 min )
  • Open

    Amazon SageMaker built-in LightGBM now offers distributed training using Dask
    Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, […]  ( 12 min )
    Build a water consumption forecasting solution for a water utility agency using Amazon Forecast
    Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts, without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating supply and demand for inventory management, travel demand forecasting, workforce planning, and computing cloud infrastructure usage. You can use Forecast […]  ( 10 min )
  • Open

    Good autocomplete
    I’m not sure whether automatic text completion on a mobile device is a net good. It sometimes saves a few taps, but it seems like it’s at least as likely to cause extra work. Although I’m ambivalent about autocomplete on my phone, I like it in my text editor. The difference is that in my […] Good autocomplete first appeared on John D. Cook.  ( 6 min )
  • Open

    Best recurrent RL library?
    Does anyone know which library has the best support for recurrent RL algorithms? It seems like many implement recurrent PPO and maybe one other recurrent algorithm and that's it. Most implementations treat recurrence as an afterthought, making it buggy and hard to extend. I'd like a library with first-class recurrent support for: DQN/Rainbow SAC PPO etc. Is anyone familiar with such a library? submitted by /u/smorad [link] [comments]  ( 42 min )
    Generative Meta-Learning for Robust Quality-Diversity Ensemble under Stochastic Rewards
    Gen-Meta is a learning-to-learn method for evolutionary illumination that is competitive against SotA methods in Nevergrad, with a much superior scalability for large-scale optimization. The key to out-of-sample robustness in portfolio optimization is quality-diversity optimization, where one aims to obtain multiple diverse solutions of high quality, rather than one. Generative meta-learning is the only portfolio optimization method that performs QD optimization to obtain a robust ensemble portfolio consisting of several de-correlated sub-portfolios. In the below image, the red line is the index to be tracked, and the blue line is the sparse portfolio ensembled from a thousand behaviorally-diverse sub-portfolios co-optimized (other lines). ​ Red Line: Tracked Index, Blue Line: Sparse Ensemble, Others: Diverse Subportfolios In Gen-Meta portfolio optimization, a Monte-Carlo optimization is performed over those portfolio candidates to reward each individual separately in randomly selected historical periods. To further optimize the portfolio robustness, the portfolio weights of the candidates are heavily corrupted first by adding noise and then dropping out the vast majority of their weights. I previously open-sourced the application of Gen-Meta in sparse index-tracking. Hence, I invite you to do your ablation study to see how each technique affects the out-of-sample robustness. The following repository includes comments on those critical techniques performed to obtain a robust ensemble from behaviorally-diverse high-quality portfolios co-optimized with Gen-Meta. The codes for Gen-Meta in sparse index-tracking The comparison in-between Gen-Meta & Nevergrad submitted by /u/k_yuksel [link] [comments]  ( 42 min )

  • Open

    We built a browser extension that unlocks browser mode capabilities using ChatGPT: MULTI·ON: AI Web Co-Pilot powered by ChatGPT
    submitted by /u/DragonLord9 [link] [comments]  ( 41 min )
    AI Dream 150 - MY HEAVENLY DREAM BY AI - Part6 TEASER - AI Video vqgan ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Just published my new edu-tainment video on A.I. Tried to make it funny but also drop some mind-blasting facts!
    Things the video covers: What is intelligence? What is A.I.? What is the best currently available and what are the benefits? How does it work? What are the downsides? The increasing speed of human technological advancement Why A.I. actually terrifies me! (Some scenarios) I hope you enjoy it! submitted by /u/casualbob_uk [link] [comments]  ( 41 min )
    AI will drastically reduce developer jobs.
    Farmers still exist today but they exist in drastically fewer numbers than two centuries ago. The modern farming machinery and techniques did not replace farmers but made the industry much less labor intensive. Nowadays programming is a labor intensive activity with relative high salaries. AI is introducing the possibility to do this activity, that worldwide cost companies billions of dollars in programmers salaries, much more efficiently. In my opinion, this is the goal of companies like OpenAI. They know that they can’t remove humans out of the loop because current AI is not able to substitute all human cognitive capabilities that intervine in a software developer daily job; like talking with the client, figuring out what he wants and translating it to functional requirements. But nonetheless, they think they have a clear shot to make programming a non labor intensive activity like farming is today. Of course, this is a compelling multibillion business opportunity that is attracting increasing capital from the tech and the financial sectors. submitted by /u/masterile [link] [comments]  ( 43 min )
    Found this list of AI tools. It was nice discovering some ai video editing tools i have not heard before.
    submitted by /u/lshic [link] [comments]  ( 40 min )
    Elon Musk Say AI Will be Able to Simulate Conciousness, for me thats very difficult to happen but idk
    https://youtu.be/Y6gXZ61NnOE submitted by /u/sigmabruuh [link] [comments]  ( 42 min )
    Best image generators for graphic Novel?
    I'm trying to see if I can use an image generator to illustrate a story I am writing. The thing is the story itself is action-packed and dramatic so I thought I could illustrate it instead as a graphic novel generated by AI but I am worried about the consistency an quality of the images. Also, I can't really use passages in the story as a prompt because the generators don't seem to illustrate the scene well. Any suggestions? submitted by /u/swagonflyyyy [link] [comments]  ( 41 min )
    📌[Searchcolab] Site to share and find Colab Notebooks, CKPT and SafeTensors
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    What ChatGPT Could Mean for the Metaverse
    submitted by /u/Zurevu [link] [comments]  ( 40 min )
    AI (GPT) where you can ask data questions in English and automatically generate the answer - as if you have your own personal automated data analyst
    submitted by /u/lfogliantis [link] [comments]  ( 46 min )
    Big Tech was moving cautiously on AI. Then came ChatGPT.
    submitted by /u/nikko_fan [link] [comments]  ( 40 min )
    AI Dream 150 - COSMIC STRUCTURES Part5 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    📌[Searchcolab] Disco Diffusion v5.6. Link in comments
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Using Ai to CLONE my Voice!
    https://www.youtube.com/watch?v=4RUQ_vZ-og0 This Ai can instantly clone your voice and sound exactly like you! It's both amazing and terrifying. The Ai is called ElevenLabs - it has features like text to speech and voice cloning! submitted by /u/Peter3tv33 [link] [comments]  ( 41 min )
    Google AI can create music in any genre from a text description
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    Knight Rider game. Midjourney, ChatGPT, Figma.
    submitted by /u/sidianmsjones [link] [comments]  ( 40 min )
    AI generated music video
    submitted by /u/aquin1313 [link] [comments]  ( 40 min )
    What if Patrick Bateman was a Data Scientist? An AI-Generated Video
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )
  • Open

    Why did the original ResNet paper not use dropout?
    The ResNet paper by Kaiming He et al. does not use dropout for the models. A lot of models prior to ResNets, such as AlexNet and VGGNet gained from using dropout. Why did the authors choose not to use dropout for ResNets ? Is it because they use L2 regularization(weight decay) and batch normalization which are forms of regularization which can substitute dropout regularization ? submitted by /u/V1bicycle [link] [comments]  ( 41 min )
    Answer this please
    so, I have been learning what DL is and how NN learns to do stuff. From what I understand is the repeated iteration will take random weights and at some point those weights will be kinda perfect for the given task (plz correct me if i'm wrong) Ok, so lets take an example of a task like path finding AI, so we make a NN and train it to go from point A to point B, now it is trained and doing nice and goes to point b perfectly, SO here the weights are set to go from point A to point B right? What if we give the point B somewhere else, How will the AI get perfect weights as the current weights are only perfect for current point B What if we put an obstacle in between point A and B, how will the NN set weights, or is it something like a range of weights which are perfect for any given task for NN ​ IDK if I explained it right, plz comment if you have question about my question, and answer also💕 submitted by /u/Severe-Improvement32 [link] [comments]  ( 42 min )
  • Open

    [R] Industrial Case Study of GNNs with PyTorch Geometric for Document Understanding
    submitted by /u/how-it-is- [link] [comments]  ( 42 min )
    [R] Incorrect Ranking of Vessel Segmentation Algorithms
    In a recent article, we reviewed dozens of image segmentation algorithms and pointed out mathematically that in many cases the reported performance scores could not be the results of the evaluation methods claimed by the authors. The scores are primary indicators of value and serve as measures of the state-of-the-art to be outperformed by new algorithms. Unfortunately, algorithm rankings turned out to be incorrect in many of 100 papers and the problem is systematic. The pressure to outperform flawed performance scores to get published keeps the trend on-going. How should the community deal with a phenomenon like this: flaws uncovered, factual, undeniable yet on-going? Is the 258th algorithm proposed for a problem more valuable than reproducing a highly cited article? Should it be mandatory to share source code? Is there a merit in developing consistency checks like the ones we did? Any comments are welcome! https://arxiv.org/abs/2111.03853 submitted by /u/AttilaFazekas [link] [comments]  ( 43 min )
    [D] Remote PhD
    Hi all, During the pandemic many software companies transitioned their workforce to "fully-remote" or "partially-remote"; therefore, I was wondering if any reputable institutions offer a remote CS PhD? For context, I know of several individuals who have sorted out remote work with their PIs on a per-person basis (typically after the first 1-2 years of study), but I am not aware of any labs or programs that advertise remote study. Thank you in advance for the responses. Cheers, Matt submitted by /u/TheRealMrMatt [link] [comments]  ( 45 min )
    [N][R] Compiling and running GLM-130B on a local machine (4x 3090s, int4 quantization) - Author: Alex J. Champandard
    Twitter link to his post: https://twitter.com/alexjc/status/1617152800571416577?s=46&t=CMQT9rK4F1Lt7g7aX2vTJA also important in that regard: The case for 4-bit precision: k-bit Inference Scaling Laws - Tim Dettmers Paper: https://arxiv.org/abs/2212.09720 https://preview.redd.it/7nn0pfhn81fa1.jpg?width=585&format=pjpg&auto=webp&s=2d05998c32fb1eacf56c45e830047381d544f51f https://preview.redd.it/0084vhhn81fa1.jpg?width=598&format=pjpg&auto=webp&s=c9512275714964faa312e8fb2d96ab8ded7dd127 submitted by /u/Singularian2501 [link] [comments]  ( 42 min )
    [P] AI Content Detector
    Have you tried ChatGPT? It's super cool but some users are also using it to create automated content submissions and resulting in an increase in AI-generated plagiarism. I have made a tool as a college project to detect content generated using AI. Go ahead and validate your content on AI Content Detector If you are an educator worried about automated content submissions or developers worried about search engine penalties, this tool will help everyone to efficiently detect content generated using AI. submitted by /u/YoutubeStruggle [link] [comments]  ( 44 min )
    [D] what is roughly the cost of human-annotation vs compute to adapt a LLM?
    Let's say I pull a pre-trained LLM off of huggingface. In broad strokes (making whatever assumptions appropriate), what is the relative cost of getting human annotation data versus actually incorporating those data in through training? I've been trying to get this stats and so far the ratio seems to be 2:1, meaning if you spent 10k dollars collecting human annotations, you should expect to spend 5k on compute (finetune, RLHF, ect) but I'd be happy if someone with more experience can chime in. submitted by /u/evanthebouncy [link] [comments]  ( 42 min )
    [D] AI Theory - Signal Processing?
    On This page of Meta AI research where they mention AI theory as a topic, they mention that they use techniques from Signal Processing. As someone with an Electrical Engineering background, and interests in Mathematics and AI, I found this very intriguing. Can someone tell me some of the ways signal processing has been used in AI theory? Some papers or some work done? submitted by /u/a_khalid1999 [link] [comments]  ( 47 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 43 min )
    [P] Automating a Youtube Shorts channel with Huggingface Transformers and After Effects
    I’ll try to get into detail about the implementation and difficulties in case it is useful for anyone else trying to do something similar with an applied ML project, so there’s a TLDR at the end if you’d like the short version/result. At the end of last year I convinced myself to start 2023 by creating a side-project that I'd actually finish and deploy and perhaps earn some “passive” income (spoiler, not so passive after all :P), and after some brainstorming I settled on making an automated Youtube channel about finance news since I had just gotten into investing. Shorts seemed to be more manageable and monetization is changing in February so I went with that. My rough initial idea was to get online articles, summarize them, make a basic compilation with some combination of pymovie, open…  ( 47 min )
    [D] GPT-Index vs Langchain
    Someone I work with wrote the below for our internal team (shared with permission) and I thought some here may find it helpful. Recently, I built an app that uses GPT-Index & LangChain to provide an answer to a question based on a piece of text as context. I found GPT-Index to be much easier and straightforward to integrate, but it seems like LangChain has more features and is more powerful. Here's my experience integrating both of them. GPT-Index First thing I did was review their docs to make sure I understood what GPT-Index was, what it could do, and how I was going to use it I went back and forth a couple times figuring out how I was going to use it. Then I found the quickstart guide It seemed like the quickstart guide would work so I followed the guide and after a few tries, …  ( 45 min )
    [P] Targeted Summarization - A tool for information extraction
    submitted by /u/helliun [link] [comments]  ( 44 min )
    [D] How do people keep up with ML news that is not NLP related?
    Lately, NLP is taking up most of the public space, much of AI news is focused on LLM after Chat-GPT took the spotlight. How do non-NLP people keep up with news? I recently saw a post on reddit where tree models are still being improved. There are other topics too, like the recent trend in Model Explainability which feels to have slowed down. I'd guess this all gets into the more categorical questions which I am wrapping up with 'How do YOU get your ML news'? How does information gathering differ between those in Applied ML and AI researchers (or even further, between those in Business Analytics and those in more 'AI' fields) What sort of interesting things are out there in the world of ML now? (model or non-model related) In looking for Use Cases, does this partially come down to your field? (Finance reads finance news, pharma reads pharma news) ​ Many of the AI/ML Newsletters which I subscribed to when I was less experienced seemed to be full of variety, but as they are all converging to NLP recently maybe it is time to cleanse the subscriptions, or find some new resources. submitted by /u/shaner92 [link] [comments]  ( 44 min )
    [R] InstructPix2Pix: Learning to Follow Image Editing Instructions
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 45 min )
    [P] We Built an ML Search Engine that can find exact timestamps for anything on Youtube using OpenAI Whisper and UKPLab's SBERT Sentence Transformers
    submitted by /u/tomiwa1a [link] [comments]  ( 44 min )
    [Discussion] Stable Diffusion Models with Subject/Keyword References
    I was looking for research that centers around Stable Diffusion models but can be trained with seed images of a specific subject, so that if someone refers to a keyword like "Me" or "I" it would then produce images relative to the keyword of interest. Something like "I am swimming in a beautiful ocean with mountains in the background and wearing a speedo". Then the person subject in the photo would be myself since I referenced "I". submitted by /u/sambrojangles [link] [comments]  ( 43 min )
  • Open

    Small-scale automation
    Saving keystrokes is overrated, but maintaining concentration is underrated. This post is going to look at automating small tasks in order to maintain concentration, not to save time. If a script lets you easily carry out some ancillary task without taking your concentration off your main task, that’s a big win. Maybe the script only […] Small-scale automation first appeared on John D. Cook.  ( 5 min )
  • Open

    how to do when the simulation fails
    Hi, When my agent succeeds to achieve the goal, it receives done = True. But, when it fails in the simulation, for example, it crashes onto the obstacle, should I end the episode and set the done to "True" or is it okay if I just give it to penalty like reward -= 5 at the specific steps? Thanks submitted by /u/sonlightinn [link] [comments]  ( 42 min )

  • Open

    InstructPix2Pix Video: One Vase, Many Artists
    submitted by /u/anitakirkovska [link] [comments]  ( 40 min )
    Make-a-Video3D: Meta generates 3D scenes from text
    submitted by /u/henlo_there_fren [link] [comments]  ( 40 min )
    AI Characters Become Self-Aware Ft. MrBeast - (Funny Moments)
    submitted by /u/SnowDustHD [link] [comments]  ( 40 min )
    AI Dream 150 - INTERSTELLAR PORTAL Part4 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI will also help you find a job: job offers related to this area grow by 31%
    submitted by /u/nikko_fan [link] [comments]  ( 40 min )
    What are some AI models out there that are not owned by large businesses that require my personal info to use?
    submitted by /u/temporaryAMA [link] [comments]  ( 40 min )
    Explain Cinderella as if you are Donald Trump
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Should I be learning data analysis now if my goal to become an AI engineer/researcher?
    Hello, I'm currently self-studying math with the goal of learning up to liner-algebra and then returning to school for a BS in mathematics followed by a graduate and then post-graduate degree in a subjecting relating to AI research. Given this goal I was wondering if I should also be self-studying data analysis along with the math. I reason that this might be good since I'd be getting hands-on experience with data and I might even be able to do it as a job while I study. In my math studies I'm currently on Algebra 2 so if I should be learning data analysis, can I start learning it now or would it be better to start after I've learned more advanced math like calculus? submitted by /u/multioutletplug [link] [comments]  ( 42 min )
    Hello guys. Could you please share experience with learning and working with Neural networks? I have already done with some of the tutorials, but want to know which language is best suitable to create AI (Python, Java).
    submitted by /u/Your_mag [link] [comments]  ( 42 min )
    Google's MusicLM is Astoundingly Good at Making AI-Generated Music, But They're Not Releasing it Due to Copyright Concerns
    submitted by /u/HODLTID [link] [comments]  ( 45 min )
    Research
    Hello everyone, I am currently doing a college research on artificial intelligence and ethics, looking to specify in the field of Bias and fairness in AI. I am looking for someone with some knowledge in the field to conduct a brief interview with in which I ask a couple of questions related to the topic. If you are interested, kindly @ me. Thanks :) submitted by /u/AyanT_01 [link] [comments]  ( 40 min )
    ChatGPT is not all that good
    I get inaccurate information form ChatGPT when I ask questions in areas where I have a good bit of knowledge. I couple of times it has cited specific books or papers by title and author and, as far as I can tell, the books/papers don't exists. The second time it did this, I ask it why it makes this kind of odd mistakes and it said: "As a language model, my information is based on the text that I have been trained on, and I try to provide the most accurate and relevant information based on that data. However, I am not infallible and sometimes make mistakes or provide information that is inaccurate or out-of-date. I apologize for any inaccuracies in my previous responses, and for any confusion that they may have caused. It's important to know that the information provided by AI models like me should be taken as a starting point for further research and not as the final answer. It's always a good idea to consult with experts or verify information with multiple sources. Again, I apologize for any inaccuracies in my previous statements and any confusion they may have caused." But it is still weird that it would make up specific books and papers, I don't see why a language model would do that. Edit: Maybe the text that it is trained on references specific books and paper that don't exist. And I guess that would be text that are not indexed by Google and other places where I am doing my fact checking such that I cannot figure out where it is getting this stuff. submitted by /u/facinabush [link] [comments]  ( 46 min )
    Made a website to teach you how to make passive/active income using AI!
    I recently created a website called https://cashwithai.com that is dedicated to helping people learn how to make money using AI like ChatGPT. The website offers a variety of resources, including a QuickStart guide, case studies, and tips and tricks for monetizing AI-generated content. Additionally, I'm offering free 1-on-1 consultations to anyone who is looking for personalized advice and guidance on how to make money with AI. I'm not running ads or charging; I run purely off donations. Let me know if you have any questions! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    New Incredible Text-To-Music Generation Model By Google
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )
    What is the information content in a logic statement?
    I recently had a lot of fun working with my talented peers at IBM Research putting together a point of view to this fascinating question. Wanted to share with you our work and to get feedback from you. ibm.biz/logic_and_information submitted by /u/Consistent_Listen127 [link] [comments]  ( 40 min )
    AI will not replace marketers, but marketers who use AI will replace those who don’t.
    AI will not replace marketers, but marketers who use AI will replace those who don’t. submitted by /u/TheVellerShow [link] [comments]  ( 40 min )
    OpenAI Hired Developers to Train its AI to Replace Developers
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 24 min )
    image to voice ai stable image to voice
    submitted by /u/VNKT-FOREVER [link] [comments]  ( 40 min )
    Real world applications of AI
    There has been many advancements in especially in CV, NLP etc. But, I don't see many new AI models that are solving real world problems. There has been a lot of advancements in AI in RL, DL etc but I rarely see new applications in real world. All i see are text to image models, advanced chat bots, better game playing AIs etc. These are definitely amazing, but, I wanna see new stuff which is making a good impact on the real world. Alpha-fold, driverless cars etc are sorta things I am looking for. I don't know if I am just bluntly unaware of new stuff in AI which solves practical problems, or whether not much new stuff is happening in those areas. So, I would be glad if I can know more about how AI is being used in new ways to solve real-wprld problems, or any new AI research trying to tackle a real world problem? Sry if it's a stupid question, and sry for spamming "real world" too much. submitted by /u/Unhappy_Version7565 [link] [comments]  ( 44 min )
    Want to catch up on the breakthroughs in AI the last 10 years. What should I read?
    Hi guys, I have a technical background and have studied some AI in the past. I'd like to catch up on the latest developments in AI over the last ~10 years. Do you have any recommendations on what to read? I was thinking of maybe trying to get the top most cited papers in the last 10 years and reading them. But I'm not sure where would be the best place to find that. Any suggestions? Thanks in advance. submitted by /u/AlexWD [link] [comments]  ( 41 min )
    spooky season 💅
    submitted by /u/Rich_Dragon17 [link] [comments]  ( 40 min )
    BuzzFeed Plans to 'Hire' ChatGPT as Newest Content Creator
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
  • Open

    [P] tiny-diffusion: a minimal PyTorch implementation of probabilistic diffusion models for 2D datasets
    submitted by /u/tanelai [link] [comments]  ( 43 min )
    [D] Running a small Bloom checkpoint on a mini pc.
    Has anyone been able to run a small LLM checkpoint in commodity hardware, like a mini PC? If so, what were your specs? submitted by /u/onedjscream [link] [comments]  ( 42 min )
    [D] could multiple-input transformers reduce the pain of the training data acquisition problem?
    so it's big pop-sci news that we are running out of quality textual training data (soft-paywalled article, but you get the idea) to produce chinchilla-optimal language models, and they appear to continue learning new abilities as data and parameter size increase. when an infant learns what a cat is, it is not only described, but the infant can see it and understand its form and behavior in a way it can then go on to describe and extrapolate from (even if they are blind, they can touch it and understand its shape and feel its fur). LLMs have to do this the hard way: their generalized understanding of the shape and behaviour of a cat comes from textual descriptions of them (and they would need quite a lot in order to understand!) most of the research i have seen into multiple input transformer models has been with the purpose of task completion (google's embodied language model robot butlers etc, which often use textual descriptions fed to a normal LLM, see https://innermonologue.github.io/ ) or image recognition and understanding (such as in CLIP) but not necessarily applying it to textual completion, which seems like it could benefit from a more visual understanding of the world so, in the medium or short term, to improve performance on text-completion tasks, what are your thoughts on using image training as well as textual to improve generalization for LLMs with fewer text tokens on a new architecture? (also, please excuse any ignorance i may posess: i'm a bit of an armchair ai enjoyer) submitted by /u/Dankmemexplorer [link] [comments]  ( 44 min )
    GAN Training Gradient questions [D]
    The main reason why this is not in the simple questions thread is the need to includee an image. Here is an image of my generator's and discriminator's gradients being logged onto wandb. As you can see, they have these weird hasms, where the distribution of the gradients becomes very close to zero. These chasms are correlated for the discriminator and generator and seem very regular. Anyone experienced anything like this and maybe has a hunch of what might be the cause? https://preview.redd.it/3qipyez6stea1.png?width=624&format=png&auto=webp&s=f5fb39803082add2fd78979e880e6e784b9e4c0c submitted by /u/Hub_Pli [link] [comments]  ( 43 min )
    [D] Goodness of fit question
    For regressions, R-squared and Adj. R-Squared are typically considered the primary goodness-of-fit measures. But in many supervised machine-learning models, RMSE is the main measure that I keep running across. For example, decision tree models that I create in R using Rpart do that. So, my question is how to compare the predictive accuracy of OLS regression models that report R-sq to equivalent Rpart regression trees that report RMSE. submitted by /u/bridgeton_man [link] [comments]  ( 43 min )
    [N] OpenAI has 1000s of contractors to fine-tune codex
    submitted by /u/yazriel0 [link] [comments]  ( 45 min )
    [P] GPT JupyterLab - JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells.
    Hi everyone, I made a JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells. This extension passes your current notebook cell to the GPT API and completes your code/text for you. You can customize the GPT parameters in the Advanced Settings menu. I made this extension when I couldn't find any Copilot/Codex extensions for JupyterLab. It doesn't make sense that ML folks don't have an easy way to use AI generated code in their own tools. VS Code does allow you use Copilot, but I've gotten used to Jupyter and a lot of ML/DS folks I know still prefer using Jupyter over VS code. Installation pip install gpt_jupyterlab GitHub Repo: https://github.com/henshinger/gpt-jupyterlab/ Demo GPT JupyterLab Demo Note: You will need your own OpenAI API Key to use this extension. Would love to get your feedback! submitted by /u/henshinger [link] [comments]  ( 44 min )
    [P] Launching my first ever open-source project and it might make your ChatGPT answers better
    I am building an open-source ML observability and refinement toolkit which recently got investment from YCombinator. The tool helps ML practitioners to: 1. Understand how their models are performing in production 2. Catch edge-cases and outliers to help them refine their models 3. Allow them to customise the tool according to their needs (hence, open-source) 4. Bring data-security at the forefront (hence, self hosted) You can check out the project https://github.com/uptrain-ai/uptrain and would love to hear feedback from the community submitted by /u/Vegetable-Skill-9700 [link] [comments]  ( 43 min )
    [P] Parse research papers into structured data
    ​ https://preview.redd.it/1t7spoqxdsea1.png?width=1920&format=png&auto=webp&s=b4643e418d942260b16019ee250edf56a4336b4b paperai is a semantic search and workflow application for medical/scientific papers. It can be used to take a set of research papers, parse the content and turn it into structured data. paperetl is the underlying library that parses basic metadata such as title, publication and date out of the papers. https://preview.redd.it/5jywynarfsea1.png?width=1084&format=png&auto=webp&s=437b6cdb65fbf12d6d238f84fd94ec4b85dec93b In addition to standard metadata, paperai can also run extractive queries to build additional columns. https://preview.redd.it/8rzfj016esea1.png?width=1138&format=png&auto=webp&s=ccd73c34f0f3001dfe05a0f8c480cf73451dd447 Example notebooks and Docker files can be found on GitHub. paperai | paperetl submitted by /u/davidmezzetti [link] [comments]  ( 43 min )
    [P] Annotating text with sparse human annotations and different length chunks
    I want to automate the annotation of a domain-specific text (complicated contracts) by finetuning a BERT model. However the annotated text I've been provided by the domain experts has been sparsely annotated (i.e. paragraph 40-48 has been fully annotated, while 15 other paragraphs only have certain classes annotated for certain words. Most paragraphs have nothing annotated (like 70% of the entire corpus) Another complication is that for 1 class, the entire paragraph should be annotated in this class, while for others it's a single word or a sentence. There are 7 classes in total and in the end, all tokens should be annotated to one of the 7 classes. 6 out of 7 classes are also pretty domain-specific and not something like 'location' or 'person' or POS. I've been thinking about using an annotation framework (i.e. LabelStudio or Prodigy) that supports custom models (i.e. finetuned BERT) and active learning to rapidly increase annotated texts by speeding up the work of human annotators (which are domain experts and usually don't have a lot of time for this). However, it's pretty unclear whether my use case would be support by this, especially with the issue of sparse text. I've also considered making the problem easier by using another model for the specific class that applies to an entire paragraph. And/or by using the fully annotated paragraphs for finetuning and using the sparse paragraphs for validation. A final consideration is using GPT-3, but I'm not sure how/if it is able to classify entire sentences/paragraphs with multiple classes and how the prompt should be formatted as. Any suggestions/ideas? submitted by /u/Background_Claim7907 [link] [comments]  ( 44 min )
    [P] Framework agnostic python package for running RWKV, RNN based models.
    https://pypi.org/project/rwkvstic/ Currently supports tensorflow, pytorch, jax Also has support for tensor streaming, 8bit jit-quant and multi-gpu. Run RWKV 7B on 8GB of vram or 14B on 16GB of vram. submitted by /u/hazardous1222 [link] [comments]  ( 42 min )
    [R] META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 45 min )
    [D] Could forward-forward learning enable training large models with distributed computing?
    One problem with distributed learning with backprop is that the first layer can't update their weights until the computation has travelled all the way down to the last layer and then backpropagated back up. If all your layers are on different machines connected by a high-latency internet connection, this will take a long time. In forward-forward learning, learning is local - each layer operates independently and only needs to communicate with the layers above and below it. The results are almost-but-not-quite as good as backprop. But each layer can immediately update their weights based only on the information they received from the previous layer. Network latency no longer matters; the limit is just the bandwidth of the slowest machine. submitted by /u/currentscurrents [link] [comments]  ( 46 min )
  • Open

    Laptop Recommendations for RL
    I am looking to buy a laptop for my rl projects and I wanted to know what people in the industry recommended for training models locally and how significant OS, CPU and GPUs really are. submitted by /u/PleasantBase6967 [link] [comments]  ( 43 min )
    The value of RL feedback on language models: "[Character.ai] engagement rose by more than 30 percent." --Noam Shazeer
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Optimal machine to run 350M language model on?
    Trying to set up a discord bot. submitted by /u/Ah_Books [link] [comments]  ( 40 min )
  • Open

    Don't overfit the history -- Recursive time series data augmentation. (arXiv:2207.02891v2 [cs.LG] UPDATED)
    Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks, we need to understand that we fit our model on available data, which is a unique realized history. Training on a single realization often induces severe overfitting lacking generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call Recursive Interpolation Method, denoted as RIM. New samples are generated using a recursive interpolation function of all previous values in such a way that the enhanced samples preserve the original inherent time series dynamics. We perform theoretical analysis to characterize the proposed RIM and to guarantee its test performance. We apply RIM to diverse real world time series cases to achieve strong performance over non-augmented data on regression, classification, and reinforcement learning tasks.  ( 2 min )
    Interaction Decompositions for Tensor Network Regression. (arXiv:2208.06029v2 [cs.LG] UPDATED)
    It is well known that tensor network regression models operate on an exponentially large feature space, but questions remain as to how effectively they are able to utilize this space. Using a polynomial featurization, we propose the interaction decomposition as a tool that can assess the relative importance of different regressors as a function of their polynomial degree. We apply this decomposition to tensor ring and tree tensor network models trained on the MNIST and Fashion MNIST datasets, and find that up to 75% of interaction degrees are contributing meaningfully to these models. We also introduce a new type of tensor network model that is explicitly trained on only a small subset of interaction degrees, and find that these models are able to match or even outperform the full models using only a fraction of the exponential feature space. This suggests that standard tensor network models utilize their polynomial regressors in an inefficient manner, with the lower degree terms being vastly under-utilized.  ( 2 min )
    MusicLM: Generating Music From Text. (arXiv:2301.11325v1 [cs.SD])
    We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.  ( 2 min )
    f-divergences and their applications in lossy compression and bounding generalization error. (arXiv:2206.11042v3 [cs.IT] UPDATED)
    In this paper, we provide three applications for $f$-divergences: (i) we introduce Sanov's upper bound on the tail probability of the sum of independent random variables based on super-modular $f$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $f$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $f$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $f$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the $f$-rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds.  ( 2 min )
    Overcoming the Pitfalls of Prediction Error in Operator Learning for Bilevel Planning. (arXiv:2208.07737v2 [cs.AI] UPDATED)
    Bilevel planning, in which a high-level search over an abstraction of an environment is used to guide low-level decision-making, is an effective approach to solving long-horizon tasks in continuous state and action spaces. Recent work has shown how to enable such bilevel planning by learning action and transition model abstractions in the form of symbolic operators and neural samplers. In this work, we show that existing symbolic operator learning approaches fall short in many natural environments where agent actions tend to cause a large number of irrelevant propositions to change. This is primarily because they attempt to learn operators that optimize the prediction error with respect to observed changes in the propositions. To overcome this issue, we propose to learn operators that only model changes necessary for abstract planning to achieve the specified goal. Experimentally, we show that our approach learns operators that lead to efficient planning across 10 different hybrid robotics domains, including 4 from the challenging BEHAVIOR-100 benchmark, with generalization to novel initial states, goals, and objects.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning. (arXiv:2206.08657v3 [cs.CV] UPDATED)
    Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose Bridge-Tower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, Bridge-Tower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at \url{https://github.com/microsoft/BridgeTower}.  ( 2 min )
    Ontology-enhanced Prompt-tuning for Few-shot Learning. (arXiv:2201.11332v1 [cs.CL] CROSS LISTED)
    Few-shot Learning (FSL) is aimed to make predictions based on a limited number of samples. Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks. However, the priors adopted by the existing methods suffer from challenging knowledge missing, knowledge noise, and knowledge heterogeneity, which hinder the performance for few-shot learning. In this study, we explore knowledge injection for FSL with pre-trained language models and propose ontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop the ontology transformation based on the external knowledge graph to address the knowledge missing issue, which fulfills and converts structure knowledge to text. We further introduce span-sensitive knowledge injection via a visible matrix to select informative knowledge to handle the knowledge noise issue. To bridge the gap between knowledge and text, we propose a collective training algorithm to optimize representations jointly. We evaluate our proposed OntoPrompt in three tasks, including relation extraction, event extraction, and knowledge graph completion, with eight datasets. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.  ( 2 min )
    Predicting Wind-Driven Spatial Deposition through Simulated Color Images using Deep Autoencoders. (arXiv:2202.01762v3 [cs.LG] UPDATED)
    For centuries, scientists have observed nature to understand the laws that govern the physical world. The traditional process of turning observations into physical understanding is slow. Imperfect models are constructed and tested to explain relationships in data. Powerful new algorithms can enable computers to learn physics by observing images and videos. Inspired by this idea, instead of training machine learning models using physical quantities, we used images, that is, pixel information. For this work, and as a proof of concept, the physics of interest are wind-driven spatial patterns. These phenomena include features in Aeolian dunes and volcanic ash deposition, wildfire smoke, and air pollution plumes. We use computer model simulations of spatial deposition patterns to approximate images from a hypothetical imaging device whose outputs are red, green, and blue (RGB) color images with channel values ranging from 0 to 255. In this paper, we explore deep convolutional neural network-based autoencoders to exploit relationships in wind-driven spatial patterns, which commonly occur in geosciences, and reduce their dimensionality. Reducing the data dimension size with an encoder enables training deep, fully connected neural network models linking geographic and meteorological scalar input quantities to the encoded space. Once this is achieved, full spatial patterns are reconstructed using the decoder. We demonstrate this approach on images of spatial deposition from a pollution source, where the encoder compresses the dimensionality to 0.02% of the original size, and the full predictive model performance on test data achieves a normalized root mean squared error of 8%, a figure of merit in space of 94% and a precision-recall area under the curve of 0.93.  ( 3 min )
    Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models. (arXiv:2212.10561v2 [cs.CL] UPDATED)
    Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Adaptive Gradient Methods with Local Guarantees. (arXiv:2203.01400v3 [cs.LG] UPDATED)
    Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.  ( 2 min )
    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning. (arXiv:2301.10921v1 [cs.LG])
    The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning. To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data. We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold. We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.  ( 2 min )
    Explaining Visual Biases as Words by Generating Captions. (arXiv:2301.11104v1 [cs.LG])
    We aim to diagnose the potential biases in image classifiers. To this end, prior works manually labeled biased attributes or visualized biased features, which need high annotation costs or are often ambiguous to interpret. Instead, we leverage two types (generative and discriminative) of pre-trained vision-language models to describe the visual bias as a word. Specifically, we propose bias-to-text (B2T), which generates captions of the mispredicted images using a pre-trained captioning model to extract the common keywords that may describe visual biases. Then, we categorize the bias type as spurious correlation or majority bias by checking if it is specific or agnostic to the class, based on the similarity of class-wise mispredicted images and the keyword upon a pre-trained vision-language joint embedding space, e.g., CLIP. We demonstrate that the proposed simple and intuitive scheme can recover well-known gender and background biases, and discover novel ones in real-world datasets. Moreover, we utilize B2T to compare the classifiers using different architectures or training methods. Finally, we show that one can obtain debiased classifiers using the B2T bias keywords and CLIP, in both zero-shot and full-shot manners, without using any human annotation on the bias.  ( 2 min )
    On the Convergence of No-Regret Learning Dynamics in Time-Varying Games. (arXiv:2301.11241v1 [cs.LG])
    Most of the literature on learning in games has focused on the restrictive setting where the underlying repeated game does not change over time. Much less is known about the convergence of no-regret learning algorithms in dynamic multiagent settings. In this paper, we characterize the convergence of \emph{optimistic gradient descent (OGD)} in time-varying games by drawing a strong connection with \emph{dynamic regret}. Our framework yields sharp convergence bounds for the equilibrium gap of OGD in zero-sum games parameterized on the \emph{minimal} first-order variation of the Nash equilibria and the second-order variation of the payoff matrices, subsuming known results for static games. Furthermore, we establish improved \emph{second-order} variation bounds under strong convexity-concavity, as long as each game is repeated multiple times. Our results also apply to time-varying \emph{general-sum} multi-player games via a bilinear formulation of correlated equilibria, which has novel implications for meta-learning and for obtaining refined variation-dependent regret bounds, addressing questions left open in prior papers. Finally, we leverage our framework to also provide new insights on dynamic regret guarantees in static games.  ( 2 min )
    Box$^2$EL: Concept and Role Box Embeddings for the Description Logic EL++. (arXiv:2301.11118v1 [cs.AI])
    Representation learning in the form of semantic embeddings has been successfully applied to a variety of tasks in natural language processing and knowledge graphs. Recently, there has been growing interest in developing similar methods for learning embeddings of entire ontologies. We propose Box$^2$EL, a novel method for representation learning of ontologies in the Description Logic EL++, which represents both concepts and roles as boxes (i.e. axis-aligned hyperrectangles), such that the logical structure of the ontology is preserved. We theoretically prove the soundness of our model and conduct an extensive empirical evaluation, in which we achieve state-of-the-art results in subsumption prediction, link prediction, and deductive reasoning. As part of our evaluation, we introduce a novel benchmark for evaluating EL++ embedding models on predicting subsumptions involving both atomic and complex concepts.  ( 2 min )
    Finding Regions of Counterfactual Explanations via Robust Optimization. (arXiv:2301.11113v1 [cs.LG])
    Counterfactual explanations play an important role in detecting bias and improving the explainability of data-driven classification models. A counterfactual explanation (CE) is a minimal perturbed data point for which the decision of the model changes. Most of the existing methods can only provide one CE, which may not be achievable for the user. In this work we derive an iterative method to calculate robust CEs, i.e. CEs that remain valid even after the features are slightly perturbed. To this end, our method provides a whole region of CEs allowing the user to choose a suitable recourse to obtain a desired outcome. We use algorithmic ideas from robust optimization and prove convergence results for the most common machine learning methods including logistic regression, decision trees, random forests, and neural networks. Our experiments show that our method can efficiently generate globally optimal robust CEs for a variety of common data sets and classification models.  ( 2 min )
    Neural Inverse Operators for Solving PDE Inverse Problems. (arXiv:2301.11167v1 [cs.LG])
    A large class of inverse problems for PDEs are only well-defined as mappings from operators to functions. Existing operator learning frameworks map functions to functions and need to be modified to learn inverse maps from data. We propose a novel architecture termed Neural Inverse Operators (NIOs) to solve these PDE inverse problems. Motivated by the underlying mathematical structure, NIO is based on a suitable composition of DeepONets and FNOs to approximate mappings from operators to functions. A variety of experiments are presented to demonstrate that NIOs significantly outperform baselines and solve PDE inverse problems robustly, accurately and are several orders of magnitude faster than existing direct and PDE-constrained optimization methods.  ( 2 min )
    On the Mathematics of Diffusion Models. (arXiv:2301.11108v1 [cs.LG])
    This paper attempts to present the stochastic differential equations of diffusion models in a manner that is accessible to a broad audience. The diffusion process is defined over a population density in R^d. Of particular interest is a population of images. In a diffusion model one first defines a diffusion process that takes a sample from the population and gradually adds noise until only noise remains. The fundamental idea is to sample from the population by a reverse-diffusion process mapping pure noise to a population sample. The diffusion process is defined independent of any ``interpretation'' but can be analyzed using the mathematics of variational auto-encoders (the ``VAE interpretation'') or the Fokker-Planck equation (the ``score-matching intgerpretation''). Both analyses yield reverse-diffusion methods involving the score function. The Fokker-Planck analysis yields a family of reverse-diffusion SDEs parameterized by any desired level of reverse-diffusion noise including zero (deterministic reverse-diffusion). The VAE analysis yields the reverse-diffusion SDE at the same noise level as the diffusion SDE. The VAE analysis also yields a useful expression for computing the population probabilities of a given point (image). This formula for the probability of a given point does not seem to follow naturally from the Fokker-Planck analysis. Much, but apparently not all, of the mathematics presented here can be found in the literature. Attributions are given at the end of the paper.  ( 2 min )
    Causal Counterfactuals for Improving the Robustness of Reinforcement Learning. (arXiv:2211.05551v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is applied in a wide variety of fields. RL enables agents to learn tasks autonomously by interacting with the environment. The more critical the tasks are, the higher the demand for the robustness of the RL systems. Causal RL combines RL and causal inference to make RL more robust. Causal RL agents use a causal representation to capture the invariant causal mechanisms that can be transferred from one task to another. Currently, there is limited research in Causal RL, and existing solutions are usually not complete or feasible for real-world applications. In this work, we propose CausalCF, the first complete Causal RL solution incorporating ideas from Causal Curiosity and CoPhy. Causal Curiosity provides an approach for using interventions, and CoPhy is modified to enable the RL agent to perform counterfactuals. We apply CausalCF to complex robotic tasks and show that it improves the RL agent's robustness using a realistic simulation environment called CausalWorld.  ( 2 min )
    New Approach to Malware Detection Using Optimized Convolutional Neural Network. (arXiv:2301.11161v1 [cs.CR])
    Cyber-crimes have become a multi-billion-dollar industry in the recent years. Most cybercrimes/attacks involve deploying some type of malware. Malware that viciously targets every industry, every sector, every enterprise and even individuals has shown its capabilities to take entire business organizations offline and cause significant financial damage in billions of dollars annually. Malware authors are constantly evolving in their attack strategies and sophistication and are developing malware that is difficult to detect and can lay dormant in the background for quite some time in order to evade security controls. Given the above argument, Traditional approaches to malware detection are no longer effective. As a result, deep learning models have become an emerging trend to detect and classify malware. This paper proposes a new convolutional deep learning neural network to accurately and effectively detect malware with high precision. This paper is different than most other papers in the literature in that it uses an expert data science approach by developing a convolutional neural network from scratch to establish a baseline of the performance model first, explores and implements an improvement model from the baseline model, and finally it evaluates the performance of the final model. The baseline model initially achieves 98% accurate rate but after increasing the depth of the CNN model, its accuracy reaches 99.183 which outperforms most of the CNN models in the literature. Finally, to further solidify the effectiveness of this CNN model, we use the improved model to make predictions on new malware samples within our dataset.  ( 2 min )
    Planning Automated Driving with Accident Experience Referencing and Common-sense Inferencing. (arXiv:2301.10892v1 [cs.RO])
    Although a typical autopilot system far surpasses humans in term of sensing accuracy, performance stability and response agility, such a system is still far behind humans in the wisdom of understanding an unfamiliar environment with creativity, adaptivity and resiliency. Current AD brains are basically expert systems featuring logical computations, which resemble the thinking flow of a left brain working at tactical level. A right brain is needed to upgrade the safety of automated driving vehicle onto next generation by making intuitive strategical judgements that can supervise the tactical action planning. In this work, we present the concept of an Automated Driving Strategical Brain (ADSB): a framework of a scene perception and scene safety evaluation system that works at a higher abstraction level, incorporating experience referencing, common-sense inferring and goal-and-value judging capabilities, to provide a contextual perspective for decision making within automated driving planning. The ADSB brain architecture is made up of the Experience Referencing Engine (ERE), the Common-sense Referencing Engine (CIE) and the Goal and Value Keeper (GVK). 1,614,748 cases from FARS/CRSS database of NHTSA in the period 1975 to 2018 are used for the training of ERE model. The kernel of CIE is a trained model, COMET-BART by ATOMIC, which can be used to provide directional advice when tactical-level environmental perception conclusions are ambiguous; it can also use future scenario models to remind tactical-level decision systems to plan ahead of a perceived hazard scene. GVK can take in any additional expert-hand-written rules that are of qualitative nature. Moreover, we believe that with good scalability, the ADSB approach provides a potential solution to the problem of long-tail corner cases encountered in the validation of a rule-based planning algorithm.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    Efficient Hyperdimensional Computing. (arXiv:2301.10902v1 [cs.LG])
    Hyperdimensional computing (HDC) uses binary vectors of high dimensions to perform classification. Due to its simplicity and massive parallelism, HDC can be highly energy-efficient and well-suited for resource-constrained platforms. However, in trading off orthogonality with efficiency, hypervectors may use tens of thousands of dimensions. In this paper, we will examine the necessity for such high dimensions. In particular, we give a detailed theoretical analysis of the relationship among dimensions of hypervectors, accuracy, and orthogonality. The main conclusion of this study is that a much lower dimension, typically less than 100, can also achieve similar or even higher detecting accuracy compared with other state-of-the-art HDC models. Based on this insight, we propose a suite of novel techniques to build HDC models that use binary hypervectors of dimensions that are orders of magnitude smaller than those found in the state-of-the-art HDC models, yet yield equivalent or even improved accuracy and efficiency. For image classification, we achieved an HDC accuracy of 96.88\% with a dimension of only 32 on the MNIST dataset. We further explore our methods on more complex datasets like CIFAR-10 and show the limits of HDC computing.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Improving Graph Generation by Restricting Graph Bandwidth. (arXiv:2301.10857v1 [cs.LG])
    Deep graph generative modeling has proven capable of learning the distribution of complex, multi-scale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that significantly reduces the output space of existing graph generative models. Specifically, starting from the observation that many real-world graphs have low graph bandwidth, we restrict graph bandwidth during training and generation. Our strategy improves both generation scalability and quality without increasing architectural complexity or reducing expressiveness. Our approach is compatible with existing graph generative methods, and we describe its application to both autoregressive and one-shot models. We extensively validate our strategy on synthetic and real datasets, including molecular graphs. Our experiments show that, in addition to improving generation efficiency, our approach consistently improves generation quality and reconstruction accuracy. The implementation is made available.  ( 2 min )
    A Practical Influence Approximation for Privacy-Preserving Data Filtering in Federated Learning. (arXiv:2205.11518v2 [cs.CR] UPDATED)
    Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation (`lazy' influence) that can be implemented in a privacy-preserving manner. Each participant uses his own data to evaluate the influence of another participant's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for highly effective filtering of corrupted data in a variety of applications. Importantly, we show that most of the corrupted data can be filtered out (recall of $>90\%$, and even up to $100\%$), even under really strong privacy guarantees ($\varepsilon \leq 1$).  ( 2 min )
    Qualitative Analysis of a Graph Transformer Approach to Addressing Hate Speech: Adapting to Dynamically Changing Content. (arXiv:2301.10871v1 [cs.LG])
    Our work advances an approach for predicting hate speech in social media, drawing out the critical need to consider the discussions that follow a post to successfully detect when hateful discourse may arise. Using graph transformer networks, coupled with modelling attention and BERT-level natural language processing, our approach can capture context and anticipate upcoming anti-social behaviour. In this paper, we offer a detailed qualitative analysis of this solution for hate speech detection in social networks, leading to insights into where the method has the most impressive outcomes in comparison with competitors and identifying scenarios where there are challenges to achieving ideal performance. Included is an exploration of the kinds of posts that permeate social media today, including the use of hateful images. This suggests avenues for extending our model to be more comprehensive. A key insight is that the focus on reasoning about the concept of context positions us well to be able to support multi-modal analysis of online posts. We conclude with a reflection on how the problem we are addressing relates especially well to the theme of dynamic change, a critical concern for all AI solutions for social impact. We also comment briefly on how mental health well-being can be advanced with our work, through curated content attuned to the extent of hate in posts.  ( 2 min )
    Counterfactual Analysis in Dynamic Latent State Models. (arXiv:2205.13832v2 [cs.LG] UPDATED)
    We provide an optimization-based framework to perform counterfactual analysis in a dynamic model with hidden states. Our framework is grounded in the "abduction, action, and prediction" approach to answer counterfactual queries and handles two key challenges where (1) the states are hidden and (2) the model is dynamic. Recognizing the lack of knowledge on the underlying causal mechanism and the possibility of infinitely many such mechanisms, we optimize over this space and compute upper and lower bounds on the counterfactual quantity of interest. Our work brings together ideas from causality, state-space models, simulation, and optimization, and we apply it on a breast cancer case study. To the best of our knowledge, we are the first to compute lower and upper bounds on a counterfactual query in a dynamic latent-state model.  ( 2 min )
    Assistive Recipe Editing through Critiquing. (arXiv:2205.02454v2 [cs.CL] UPDATED)
    There has recently been growing interest in the automatic generation of cooking recipes that satisfy some form of dietary restrictions, thanks in part to the availability of online recipe data. Prior studies have used pre-trained language models, or relied on small paired recipe data (e.g., a recipe paired with a similar one that satisfies a dietary constraint). However, pre-trained language models generate inconsistent or incoherent recipes, and paired datasets are not available at scale. We address these deficiencies with RecipeCrit, a hierarchical denoising auto-encoder that edits recipes given ingredient-level critiques. The model is trained for recipe completion to learn semantic relationships within recipes. Our work's main innovation is our unsupervised critiquing module that allows users to edit recipes by interacting with the predicted ingredients; the system iteratively rewrites recipes to satisfy users' feedback. Experiments on the Recipe1M recipe dataset show that our model can more effectively edit recipes compared to strong language-modeling baselines, creating recipes that satisfy user constraints and are more correct, serendipitous, coherent, and relevant as measured by human judges.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Stochastic Online Fisher Markets: Static Pricing Limits and Adaptive Enhancements. (arXiv:2205.00825v3 [cs.GT] UPDATED)
    In a Fisher market, agents (users) spend a budget of (artificial) currency to buy goods that maximize their utilities while a central planner sets prices on capacity-constrained goods such that the market clears. However, the efficacy of pricing schemes in achieving an equilibrium outcome in Fisher markets typically relies on complete knowledge of users' budgets and utilities and requires that transactions happen in a static market wherein all users are present simultaneously. As a result, we study an online variant of Fisher markets, wherein budget-constrained users with privately known utility and budget parameters, drawn i.i.d. from a distribution $\mathcal{D}$, enter the market sequentially. In this setting, we develop an algorithm that adjusts prices solely based on observations of user consumption, i.e., revealed preference feedback, and achieves a regret and capacity violation of $O(\sqrt{n})$, where $n$ is the number of users and the good capacities scale as $O(n)$. Here, our regret measure is the optimality gap in the objective of the Eisenberg-Gale program between an online algorithm and an offline oracle with complete information on users' budgets and utilities. To establish the efficacy of our approach, we show that any uniform (static) pricing algorithm, including one that sets expected equilibrium prices with complete knowledge of the distribution $\mathcal{D}$, cannot achieve both a regret and constraint violation of less than $\Omega(\sqrt{n})$. While our revealed preference algorithm requires no knowledge of the distribution $\mathcal{D}$, we show that if $\mathcal{D}$ is known, then an adaptive variant of expected equilibrium pricing achieves $O(\log(n))$ regret and constant capacity violation for discrete distributions. Finally, we present numerical experiments to demonstrate the performance of our revealed preference algorithm relative to several benchmarks.  ( 3 min )
    Revisiting the Adversarial Robustness-Accuracy Tradeoff in Robot Learning. (arXiv:2204.07373v2 [cs.RO] UPDATED)
    Adversarial training (i.e., training on adversarially perturbed input data) is a well-studied method for making neural networks robust to potential adversarial attacks during inference. However, the improved robustness does not come for free but rather is accompanied by a decrease in overall model accuracy and performance. Recent work has shown that, in practical robot learning applications, the effects of adversarial training do not pose a fair trade-off but inflict a net loss when measured in holistic robot performance. This work revisits the robustness-accuracy trade-off in robot learning by systematically analyzing if recent advances in robust training methods and theory in conjunction with adversarial robot learning, are capable of making adversarial training suitable for real-world robot applications. We evaluate three different robot learning tasks ranging from autonomous driving in a high-fidelity environment amenable to sim-to-real deployment to mobile robot navigation and gesture recognition. Our results demonstrate that, while these techniques make incremental improvements on the trade-off on a relative scale, the negative impact on the nominal accuracy caused by adversarial training still outweighs the improved robustness by an order of magnitude. We conclude that although progress is happening, further advances in robust learning methods are necessary before they can benefit robot learning tasks in practice.  ( 2 min )
    Privacy preserving n-party scalar product protocol. (arXiv:2112.09436v4 [cs.CR] UPDATED)
    Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data, both on horizontal and vertically partitioned data. However, it relies on specialized techniques and algorithms to perform the necessary computations. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. Unfortunately, the solutions currently proposed in the literature focus mainly on two-party scenarios, even though scenarios with a higher number of data parties are becoming more relevant. For example when performing analyses that require counting the number of samples which fulfill certain criteria defined across various sites, such as calculating the information gain at a node in a decision tree. In this paper we propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method. Our proposed solution relies on a recursive resolution of smaller scalar products. After describing our proposed method, we discuss potential scalability issues. Finally, we describe the privacy guarantees and identify any concerns, as well as comparing the proposed method to the original solution in this aspect.  ( 2 min )
    A Light-weight Deep Human Activity Recognition Algorithm Using Multi-knowledge Distillation. (arXiv:2107.07331v4 [cs.LG] UPDATED)
    Inertial sensor-based human activity recognition (HAR) is the base of many human-centered mobile applications. Deep learning-based fine-grained HAR models enable accurate classification in various complex application scenarios. Nevertheless, the large storage and computational overhead of the existing fine-grained deep HAR models hinder their widespread deployment on resource-limited platforms. Inspired by the knowledge distillation's reasonable model compression and potential performance improvement capability, we design a multi-level HAR modeling pipeline called Stage-Logits-Memory Distillation (SMLDist) based on the widely-used MobileNet. By paying more attention to the frequency-related features during the distillation process, the SMLDist improves the HAR classification robustness of the students. We also propose an auto-search mechanism in the heterogeneous classifiers to improve classification performance. Extensive simulation results demonstrate that SMLDist outperforms various state-of-the-art HAR frameworks in accuracy and F1 macro score. The practical evaluation of the Jetson Xavier AGX platform shows that the SMLDist model is both energy-efficient and computation-efficient. These experiments validate the reasonable balance between the robustness and efficiency of the proposed model. The comparative experiments of knowledge distillation on six public datasets also demonstrate that the SMLDist outperforms other advanced knowledge distillation methods of students' performance, which verifies the good generalization of the SMLDist on other classification tasks, including but not limited to HAR.  ( 2 min )
    Textual Explanations and Critiques in Recommendation Systems. (arXiv:2205.07268v2 [cs.LG] UPDATED)
    Artificial intelligence and machine learning algorithms have become ubiquitous. Although they offer a wide range of benefits, their adoption in decision-critical fields is limited by their lack of interpretability, particularly with textual data. Moreover, with more data available than ever before, it has become increasingly important to explain automated predictions. Generally, users find it difficult to understand the underlying computational processes and interact with the models, especially when the models fail to generate the outcomes or explanations, or both, correctly. This problem highlights the growing need for users to better understand the models' inner workings and gain control over their actions. This dissertation focuses on two fundamental challenges of addressing this need. The first involves explanation generation: inferring high-quality explanations from text documents in a scalable and data-driven manner. The second challenge consists in making explanations actionable, and we refer to it as critiquing. This dissertation examines two important applications in natural language processing and recommendation tasks. Overall, we demonstrate that interpretability does not come at the cost of reduced performance in two consequential applications. Our framework is applicable to other fields as well. This dissertation presents an effective means of closing the gap between promise and practice in artificial intelligence.  ( 2 min )
    RoFL: Robustness of Secure Federated Learning. (arXiv:2107.03311v4 [cs.CR] UPDATED)
    Even though recent years have seen many attacks exposing severe vulnerabilities in Federated Learning (FL), a holistic understanding of what enables these attacks and how they can be mitigated effectively is still lacking. In this work, we demystify the inner workings of existing (targeted) attacks. We provide new insights into why these attacks are possible and why a definitive solution to FL robustness is challenging. We show that the need for ML algorithms to memorize tail data has significant implications for FL integrity. This phenomenon has largely been studied in the context of privacy; our analysis sheds light on its implications for ML integrity. We show that certain classes of severe attacks can be mitigated effectively by enforcing constraints such as norm bounds on clients' updates. We investigate how to efficiently incorporate these constraints into secure FL protocols in the single-server setting. Based on this, we propose RoFL, a new secure FL system that extends secure aggregation with privacy-preserving input validation. Specifically, RoFL can enforce constraints such as $L_2$ and $L_\infty$ bounds on high-dimensional encrypted model updates.  ( 2 min )
    Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. (arXiv:2203.11085v3 [cs.HC] UPDATED)
    Creating presentation slides is a critical but time-consuming task for data scientists. While researchers have proposed many AI techniques to lift data scientists' burden on data preparation and model selection, few have targeted the presentation creation task. Based on the needs identified from a formative study, this paper presents NB2Slides, an AI system that facilitates users to compose presentations of their data science work. NB2Slides uses deep learning methods as well as example-based prompts to generate slides from computational notebooks, and take users' input (e.g., audience background) to structure the slides. NB2Slides also provides an interactive visualization that links the slides with the notebook to help users further edit the slides. A follow-up user evaluation with 12 data scientists shows that participants believed NB2Slides can improve efficiency and reduces the complexity of creating slides. Yet, participants questioned the future of full automation and suggested a human-AI collaboration paradigm.  ( 2 min )
    Constant matters: Fine-grained Complexity of Differentially Private Continual Observation. (arXiv:2202.11205v5 [cs.DS] UPDATED)
    We study fine-grained error bounds for differentially private algorithms for counting under continual observation. Our main insight is that the matrix mechanism when using lower-triangular matrices can be used in the continual observation model. More specifically, we give an explicit factorization for the counting matrix $M_\mathsf{count}$ and upper bound the error explicitly. We also give a fine-grained analysis, specifying the exact constant in the upper bound. Our analysis is based on upper and lower bounds of the {\em completely bounded norm} (cb-norm) of $M_\mathsf{count}$. Along the way, we improve the best-known bound of 28 years by Mathias (SIAM Journal on Matrix Analysis and Applications, 1993) on the cb-norm of $M_\mathsf{count}$ for a large range of the dimension of $M_\mathsf{count}$. Furthermore, we are the first to give concrete error bounds for various problems under continual observation such as binary counting, maintaining a histogram, releasing an approximately cut-preserving synthetic graph, many graph-based statistics, and substring and episode counting. Finally, we note that our result can be used to get a fine-grained error bound for non-interactive local learning {and the first lower bounds on the additive error for $(\epsilon,\delta)$-differentially-private counting under continual observation.} Subsequent to this work, Henzinger et al. (SODA2023) showed that our factorization also achieves fine-grained mean-squared error.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Learning to Act Safely with Limited Exposure and Almost Sure Certainty. (arXiv:2105.08748v3 [eess.SY] UPDATED)
    This paper puts forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for an unbounded number of exploratory trials. This is indeed possible, provided that one is willing to navigate trade-offs between optimality, level of exposure to unsafe events, and the maximum detection time of unsafe actions. We illustrate this concept in two complementary settings. We first focus on the canonical multi-armed bandit problem and study the intrinsic trade-offs of learning safety in the presence of uncertainty. Under mild assumptions on sufficient exploration, we provide an algorithm that provably detects all unsafe machines in an (expected) finite number of rounds. The analysis also unveils a trade-off between the number of rounds needed to secure the environment and the probability of discarding safe machines. We then consider the problem of finding optimal policies for a Markov Decision Process (MDP) with almost sure constraints. We show that the action-value function satisfies a barrier-based decomposition which allows for the identification of feasible policies independently of the reward process. Using this decomposition, we develop a Barrier-learning algorithm, that identifies such unsafe state-action pairs in a finite expected number of steps. Our analysis further highlights a trade-off between the time lag for the underlying MDP necessary to detect unsafe actions, and the level of exposure to unsafe events. Simulations corroborate our theoretical findings, further illustrating the aforementioned trade-offs, and suggesting that safety constraints can speed up the learning process.  ( 2 min )
    Joint Training of Deep Ensembles Fails Due to Learner Collusion. (arXiv:2301.11323v1 [cs.LG])
    Ensembles of machine learning models have been well established as a powerful method of improving performance over a single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal of optimizing their joint performance. In the case of deep ensembles of neural networks, we are provided with the opportunity to directly optimize the true objective: the joint performance of the ensemble as a whole. Surprisingly, however, directly minimizing the loss of the ensemble appears to rarely be applied in practice. Instead, most previous research trains individual models independently with ensembling performed post hoc. In this work, we show that this is for good reason - joint optimization of ensemble loss results in degenerate behavior. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to demonstrate the practical implications of this effect finding that, in some cases, a balance between independent training and joint optimization can improve performance over the former while avoiding the degeneracies of the latter.  ( 2 min )
    A Context-based Multi-task Hierarchical Inverse Reinforcement Learning Algorithm. (arXiv:2210.01969v2 [cs.LG] UPDATED)
    Multi-task Imitation Learning (MIL) aims to train a policy capable of performing a distribution of tasks, which is essential for general-purpose robots, based on multi-task expert demonstrations. Existing MIL algorithms suffer from low data efficiency and poor performance on complex long-horizontal tasks. We develop Multi-task Hierarchical Adversarial Inverse Reinforcement Learning (MH-AIRL) to learn hierarchically-structured multi-task policies, which is more beneficial for compositional tasks with long horizons and has higher expert data efficiency through identifying and transferring reusable basic skills across tasks. To realize this, MH-AIRL effectively synthesizes context-based multi-task learning, AIRL (an IL approach), and hierarchical policy learning. Further, MH-AIRL can be adopted to demonstrations without the task or skill annotations (i.e., state-action pairs only) which are more accessible in practice. Theoretical justifications are provided for each module of MH-AIRL, and evaluations on challenging multi-task settings demonstrate superior performance and transferability of the multi-task policies learned with MH-AIRL as compared to SOTA MIL baselines.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Recursive deep learning framework for forecasting the decadal world economic outlook. (arXiv:2301.10874v1 [cs.LG])
    Gross domestic product (GDP) is the most widely used indicator in macroeconomics and the main tool for measuring a country's economic ouput. Due to the diversity and complexity of the world economy, a wide range of models have been used, but there are challenges in making decadal GDP forecasts given unexpected changes such as pandemics and wars. Deep learning models are well suited for modeling temporal sequences have been applied for time series forecasting. In this paper, we develop a deep learning framework to forecast the GDP growth rate of the world economy over a decade. We use Penn World Table as the source of our data, taking data from 1980 to 2019, across 13 countries, such as Australia, China, India, the United States and so on. We test multiple deep learning models, LSTM, BD-LSTM, ED-LSTM and CNN, and compared their results with the traditional time series model (ARIMA,VAR). Our results indicate that ED-LSTM is the best performing model. We present a recursive deep learning framework to predict the GDP growth rate in the next ten years. We predict that most countries will experience economic growth slowdown, stagnation or even recession within five years; only China, France and India are predicted to experience stable, or increasing, GDP growth.  ( 2 min )
    Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection. (arXiv:2211.09858v2 [cs.SD] UPDATED)
    Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.  ( 2 min )
    Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild. (arXiv:2208.07522v5 [cs.LG] UPDATED)
    Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Since moderation policies vary depending on countries and types of products, it is common to train and deploy the models per policy. However, this approach is highly inefficient, especially when the policies change, requiring dataset re-labeling and model re-training on the shifted data distribution. To alleviate this cost inefficiency, social media platforms often employ third-party content moderation services that provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons, instead of directly providing final moderation decisions. However, making a reliable automated moderation decision from the prediction scores of the multiple subtasks for a specific target policy has not been widely explored yet. In this study, we formulate real-world scenarios of content moderation and introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way. Extensive experiments demonstrate that our approach shows better performance in content moderation compared to existing threshold optimization methods and heuristics.  ( 2 min )
    Iterative Teaching by Label Synthesis. (arXiv:2110.14432v5 [cs.LG] UPDATED)
    In this paper, we consider the problem of iterative machine teaching, where a teacher provides examples sequentially based on the current iterative learner. In contrast to previous methods that have to scan over the entire pool and select teaching examples from it in each iteration, we propose a label synthesis teaching framework where the teacher randomly selects input teaching examples (e.g., images) and then synthesizes suitable outputs (e.g., labels) for them. We show that this framework can avoid costly example selection while still provably achieving exponential teachability. We propose multiple novel teaching algorithms in this framework. Finally, we empirically demonstrate the value of our framework.  ( 2 min )
    VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media. (arXiv:2208.09021v3 [cs.CV] UPDATED)
    We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, achieved by encoding images using a linear projection of patches instead of an object detector. However, it is pretrained on captioning datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data, there is a notable shift from captioning language data, as well as diversity of tasks. We indeed find evidence that the language capacity of ViLT is lacking. The key insight and novelty of VAuLT is to propagate the output representations of a large language model (LM) like BERT to the language input of ViLT. We show that joint training of the LM and ViLT can yield relative improvements up to 20% over ViLT and achieve state-of-the-art or comparable performance on VL tasks involving richer language inputs and affective constructs, such as for Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is available at https://github.com/gchochla/VAuLT.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Characterizing the Influence of Graph Elements. (arXiv:2210.07441v2 [cs.LG] UPDATED)
    Influence function, a method from robust statistics, measures the changes of model parameters or some functions about model parameters concerning the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need for expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph and formulated an influence function to approximate the changes in model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of an SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to guide the adversarial attacks on GCNs effectively.  ( 2 min )
    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. (arXiv:2210.13382v3 [cs.LG] UPDATED)
    Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.  ( 2 min )
    Review of Natural Language Processing in Pharmacology. (arXiv:2208.10228v2 [cs.CL] UPDATED)
    Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers.  ( 2 min )
    Cut and Learn for Unsupervised Object Detection and Instance Segmentation. (arXiv:2301.11320v1 [cs.CV])
    We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks using our robust loss function. We further improve the performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Certified Interpretability Robustness for Class Activation Mapping. (arXiv:2301.11324v1 [cs.LG])
    Interpreting machine learning models is challenging but crucial for ensuring the safety of deep networks in autonomous driving systems. Due to the prevalence of deep learning based perception models in autonomous vehicles, accurately interpreting their predictions is crucial. While a variety of such methods have been proposed, most are shown to lack robustness. Yet, little has been done to provide certificates for interpretability robustness. Taking a step in this direction, we present CORGI, short for Certifiably prOvable Robustness Guarantees for Interpretability mapping. CORGI is an algorithm that takes in an input image and gives a certifiable lower bound for the robustness of the top k pixels of its CAM interpretability map. We show the effectiveness of CORGI via a case study on traffic sign data, certifying lower bounds on the minimum adversarial perturbation not far from (4-5x) state-of-the-art attack methods.  ( 2 min )
    Predictive Crypto-Asset Automated Market Making Architecture for Decentralized Finance using Deep Reinforcement Learning. (arXiv:2211.01346v2 [q-fin.TR] UPDATED)
    The study proposes a quote-driven predictive automated market maker (AMM) platform with on-chain custody and settlement functions, alongside off-chain predictive reinforcement learning capabilities to improve liquidity provision of real-world AMMs. The proposed AMM architecture is an augmentation to the Uniswap V3, a cryptocurrency AMM protocol, by utilizing a novel market equilibrium pricing for reduced divergence and slippage loss. Further, the proposed architecture involves a predictive AMM capability, utilizing a deep hybrid Long Short-Term Memory (LSTM) and Q-learning reinforcement learning framework that looks to improve market efficiency through better forecasts of liquidity concentration ranges, so liquidity starts moving to expected concentration ranges, prior to asset price movement, so that liquidity utilization is improved. The augmented protocol framework is expected have practical real-world implications, by (i) reducing divergence loss for liquidity providers, (ii) reducing slippage for crypto-asset traders, while (iii) improving capital efficiency for liquidity provision for the AMM protocol. To our best knowledge, there are no known protocol or literature that are proposing similar deep learning-augmented AMM that achieves similar capital efficiency and loss minimization objectives for practical real-world applications.  ( 2 min )
    What you need to know to train recurrent neural networks to make Flip Flops memories and more. (arXiv:2010.07858v3 [cs.LG] UPDATED)
    Training neural networks to perform different tasks is relevant across various disciplines beyond Machine Learning. In particular, Recurrent Neural Networks (RNNs) are of great interest to different scientific communities. Open-source frameworks dedicated to Machine Learning, such as Tensorflow [1] and Keras [2] have produced significant changes in the development of technologies that we currently use. One relevant problem that can be approached with them is how to build the models to study dynamical systems and the brain. Specifically, how to extract the relevant information to answer the scientific questions of interest. The purpose of the present work is to contribute to this aim by analyzing a temporal processing task, in this case, a 3-bit Flip Flop memory. The modelling procedure in every step is shown: from equations to the software development. The networks obtained were analyzed to describe the dynamics and to show different visualization and analysis tools. The code developed in this premier is also provided to be used for modelling other tasks or systems.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning. (arXiv:2301.11321v1 [cs.LG])
    Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in an off-policy control task.  ( 2 min )
    Anatomy-aware and acquisition-agnostic joint registration with SynthMorph. (arXiv:2301.11329v1 [eess.IV])
    Affine image registration is a cornerstone of medical-image processing and analysis. While classical algorithms can achieve excellent accuracy, they solve a time-consuming optimization for every new image pair. Deep-learning (DL) methods learn a function that maps an image pair to an output transform. Evaluating the functions is fast, but capturing large transforms can be challenging, and networks tend to struggle if a test-image characteristic shifts from the training domain, such as the contrast or resolution. A majority of affine methods are also agnostic to the anatomy the user wishes to align; the registration will be inaccurate if algorithms consider all structures in the image. We address these shortcomings with a fast, robust, and easy-to-use DL tool for affine and deformable registration of any brain image without preprocessing, right off the MRI scanner. First, we rigorously analyze how competing architectures learn affine transforms across a diverse set of neuroimaging data, aiming to truly capture the behavior of methods in the real world. Second, we leverage a recent strategy to train networks with wildly varying images synthesized from label maps, yielding robust performance across acquisition specifics. Third, we optimize the spatial overlap of select anatomical labels, which enables networks to distinguish between anatomy of interest and irrelevant structures, removing the need for preprocessing that excludes content that would otherwise reduce the accuracy of anatomy-specific registration. We combine the affine model with prior work on deformable registration and test brain-specific registration across a landscape of MRI protocols unseen at training, demonstrating consistent and improved accuracy compared to existing tools. We distribute our code and tool at https://w3id.org/synthmorph, providing a single complete end-to-end solution for registration of brain MRI.  ( 3 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    ZiCo: Zero-shot NAS via Inverse Coefficient of Variation on Gradients. (arXiv:2301.11300v1 [cs.LG])
    Neural Architecture Search (NAS) is widely used to automatically design the neural network with the best performance among a large number of candidate architectures. To reduce the search time, zero-shot NAS aims at designing training-free proxies that can predict the test performance of a given architecture. However, as shown recently, none of the zero-shot proxies proposed to date can actually work consistently better than a naive proxy, namely, the number of network parameters (#Params). To improve this state of affairs, as the main theoretical contribution, we first reveal how some specific gradient properties across different samples impact the convergence rate and generalization capacity of neural networks. Based on this theoretical analysis, we propose a new zero-shot proxy, ZiCo, the first proxy that works consistently better than #Params. We demonstrate that ZiCo works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS, TransNASBench-101) for multiple applications (e.g., image classification/reconstruction and pixel-level prediction). Finally, we demonstrate that the optimal architectures found via ZiCo are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, ZiCo-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs on ImageNet within 0.4 GPU days.  ( 2 min )
    Open Problems in Applied Deep Learning. (arXiv:2301.11316v1 [cs.LG])
    This work formulates the machine learning mechanism as a bi-level optimization problem. The inner level optimization loop entails minimizing a properly chosen loss function evaluated on the training data. This is nothing but the well-studied training process in pursuit of optimal model parameters. The outer level optimization loop is less well-studied and involves maximizing a properly chosen performance metric evaluated on the validation data. This is what we call the "iteration process", pursuing optimal model hyper-parameters. Among many other degrees of freedom, this process entails model engineering (e.g., neural network architecture design) and management, experiment tracking, dataset versioning and augmentation. The iteration process could be automated via Automatic Machine Learning (AutoML) or left to the intuitions of machine learning students, engineers, and researchers. Regardless of the route we take, there is a need to reduce the computational cost of the iteration step and as a direct consequence reduce the carbon footprint of developing artificial intelligence algorithms. Despite the clean and unified mathematical formulation of the iteration step as a bi-level optimization problem, its solutions are case specific and complex. This work will consider such cases while increasing the level of complexity from supervised learning to semi-supervised, self-supervised, unsupervised, few-shot, federated, reinforcement, and physics-informed learning. As a consequence of this exercise, this proposal surfaces a plethora of open problems in the field, many of which can be addressed in parallel.  ( 2 min )
    BayesSpeech: A Bayesian Transformer Network for Automatic Speech Recognition. (arXiv:2301.11276v1 [eess.AS])
    Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks. These models tend to be lighter weight and require less training time than traditional RNN-based approaches. However, these models take frequentist approach to weight training. In theory, network weights are drawn from a latent, intractable probability distribution. We introduce BayesSpeech for end-to-end Automatic Speech Recognition. BayesSpeech is a Bayesian Transformer Network where these intractable posteriors are learned through variational inference and the local reparameterization trick without recurrence. We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Understanding Finetuning for Factual Knowledge Extraction from Language Models. (arXiv:2301.11293v1 [cs.CL])
    Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful) phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1- model mixing and 2- mixture finetuning with the LM's pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.  ( 2 min )
    Text-To-4D Dynamic Scene Generation. (arXiv:2301.11280v1 [cs.CV])
    We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.  ( 2 min )
    Classification of vertices on social networks by multiple approaches. (arXiv:2301.11288v1 [cs.SI])
    Due to the advent of the expressions of data other than tabular formats, the topological compositions which make samples interrelated came into prominence. Analogically, those networks can be interpreted as social connections, dataflow maps, citation influence graphs, protein bindings, etc. However, in the case of social networks, it is highly crucial to evaluate the labels of discrete communities. The reason underneath for such a study is the non-negligible importance of analyzing graph networks to partition the vertices by using the topological features of network graphs, solely. For each of these interaction-based entities, a social graph, a mailing dataset, and two citation sets are selected as the testbench repositories. This paper, it was not only assessed the most valuable method but also determined how graph neural networks work and the need to improve against non-neural network approaches which are faster and computationally cost-effective. Also, this paper showed a limit to be excesses by prospective graph neural network variations by using the topological features of networks trialed.  ( 2 min )
    Gaussian process regression and conditional Karhunen-Lo\'{e}ve models for data assimilation in inverse problems. (arXiv:2301.11279v1 [cs.LG])
    We present a model inversion algorithm, CKLEMAP, for data assimilation and parameter estimation in partial differential equation models of physical systems with spatially heterogeneous parameter fields. These fields are approximated using low-dimensional conditional Karhunen-Lo\'{e}ve expansions, which are constructed using Gaussian process regression models of these fields trained on the parameters' measurements. We then assimilate measurements of the state of the system and compute the maximum a posteriori estimate of the CKLE coefficients by solving a nonlinear least-squares problem. When solving this optimization problem, we efficiently compute the Jacobian of the vector objective by exploiting the sparsity structure of the linear system of equations associated with the forward solution of the physics problem. The CKLEMAP method provides better scalability compared to the standard MAP method. In the MAP method, the number of unknowns to be estimated is equal to the number of elements in the numerical forward model. On the other hand, in CKLEMAP, the number of unknowns (CKLE coefficients) is controlled by the smoothness of the parameter field and the number of measurements, and is in general much smaller than the number of discretization nodes, which leads to a significant reduction of computational cost with respect to the standard MAP method. To show its advantage in scalability, we apply CKLEMAP to estimate the transmissivity field in a two-dimensional steady-state subsurface flow model of the Hanford Site by assimilating synthetic measurements of transmissivity and hydraulic head. We find that the execution time of CKLEMAP scales nearly linearly as $N^{1.33}$, where $N$ is the number of discretization nodes, while the execution time of standard MAP scales as $N^{2.91}$. The CKLEMAP method improved execution time without sacrificing accuracy when compared to the standard MAP.  ( 3 min )
    Real-Time Digital Twins: Vision and Research Directions for 6G and Beyond. (arXiv:2301.11283v1 [eess.SP])
    This article presents a vision where \textit{real-time} digital twins of the physical wireless environments are continuously updated using multi-modal sensing data from the distributed infrastructure and user devices, and are used to make communication and sensing decisions. This vision is mainly enabled by the advances in precise 3D maps, multi-modal sensing, ray-tracing computations, and machine/deep learning. This article details this vision, explains the different approaches for constructing and utilizing these real-time digital twins, discusses the applications and open problems, and presents a research platform that can be used to investigate various digital twin research directions.  ( 2 min )
    Online Convex Optimization with Stochastic Constraints: Zero Constraint Violation and Bandit Feedback. (arXiv:2301.11267v1 [math.OC])
    This paper studies online convex optimization with stochastic constraints. We propose a variant of the drift-plus-penalty algorithm that guarantees $O(\sqrt{T})$ expected regret and zero constraint violation, after a fixed number of iterations, which improves the vanilla drift-plus-penalty method with $O(\sqrt{T})$ constraint violation. Our algorithm is oblivious to the length of the time horizon $T$, in contrast to the vanilla drift-plus-penalty method. This is based on our novel drift lemma that provides time-varying bounds on the virtual queue drift and, as a result, leads to time-varying bounds on the expected virtual queue length. Moreover, we extend our framework to stochastic-constrained online convex optimization under two-point bandit feedback. We show that by adapting our algorithmic framework to the bandit feedback setting, we may still achieve $O(\sqrt{T})$ expected regret and zero constraint violation, improving upon the previous work for the case of identical constraint functions. Numerical results demonstrate our theoretical results.  ( 2 min )
    AlignGraph: A Group of Generative Models for Graphs. (arXiv:2301.11273v1 [cs.SI])
    It is challenging for generative models to learn a distribution over graphs because of the lack of permutation invariance: nodes may be ordered arbitrarily across graphs, and standard graph alignment is combinatorial and notoriously expensive. We propose AlignGraph, a group of generative models that combine fast and efficient graph alignment methods with a family of deep generative models that are invariant to node permutations. Our experiments demonstrate that our framework successfully learns graph distributions, outperforming competitors by 25% -560% in relevant performance scores.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    A Benchmark Study by using various Machine Learning Models for Predicting Covid-19 trends. (arXiv:2301.11257v1 [cs.LG])
    Machine learning and deep learning play vital roles in predicting diseases in the medical field. Machine learning algorithms are widely classified as supervised, unsupervised, and reinforcement learning. This paper contains a detailed description of our experimental research work in that we used a supervised machine-learning algorithm to build our model for outbreaks of the novel Coronavirus that has spread over the whole world and caused many deaths, which is one of the most disastrous Pandemics in the history of the world. The people suffered physically and economically to survive in this lockdown. This work aims to understand better how machine learning, ensemble, and deep learning models work and are implemented in the real dataset. In our work, we are going to analyze the current trend or pattern of the coronavirus and then predict the further future of the covid-19 confirmed cases or new cases by training the past Covid-19 dataset by using the machine learning algorithm such as Linear Regression, Polynomial Regression, K-nearest neighbor, Decision Tree, Support Vector Machine and Random forest algorithm are used to train the model. The decision tree and the Random Forest algorithm perform better than SVR in this work. The performance of SVR and lasso regression are low in all prediction areas Because the SVR is challenging to separate the data using the hyperplane for this type of problem. So SVR mostly gives a lower performance in this problem. Ensemble (Voting, Bagging, and Stacking) and deep learning models(ANN) also predict well. After the prediction, we evaluated the model using MAE, MSE, RMSE, and MAPE. This work aims to find the trend/pattern of the covid-19.  ( 3 min )
    Molecular Language Model as Multi-task Generator. (arXiv:2301.11259v1 [cs.LG])
    Molecule generation with desired properties has grown immensely in popularity by disruptively changing the way scientists design molecular structures and providing support for chemical and materials design. However, despite the promising outcome, previous machine learning-based deep generative models suffer from a reliance on complex, task-specific fine-tuning, limited dimensional latent spaces, or the quality of expert rules. In this work, we propose MolGen, a pre-trained molecular language model that effectively learns and shares knowledge across multiple generation tasks and domains. Specifically, we pre-train MolGen with the chemical language SELFIES on more than 100 million unlabelled molecules. We further propose multi-task molecular prefix tuning across several molecular generation tasks and different molecular domains (synthetic & natural products) with a self-feedback mechanism. Extensive experiments show that MolGen can obtain superior performances on well-known molecular generation benchmark datasets. The further analysis illustrates that MolGen can accurately capture the distribution of molecules, implicitly learn their structural characteristics, and efficiently explore the chemical space with the guidance of multi-task molecular prefix tuning. Codes, datasets, and the pre-trained model will be available in https://github.com/zjunlp/MolGen.  ( 2 min )
    BiBench: Benchmarking and Analyzing Network Binarization. (arXiv:2301.11233v1 [cs.CV])
    Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research.  ( 2 min )
    Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data. (arXiv:2301.11174v1 [cs.CV])
    We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In contrast to manually annotating all the training samples, separately collecting uni-modal datasets is immensely easier, e.g., a large-scale image dataset and a sentence dataset. We leverage such massive unpaired image and caption data upon standard paired data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples in an adversarial learning fashion, where the joint distribution of image and caption is learned. Our method trains a captioner to learn from a paired data and to progressively associate unpaired data. This approach shows noticeable performance improvement even in challenging scenarios including out-of-task data (i.e., relational captioning, where the target task is different from the unpaired data) and web-crawled data. We also show that our proposed method is theoretically well-motivated and has a favorable global optimal property. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired COCO dataset demonstrate the consistent effectiveness of our semisupervised learning method with unpaired data compared to competing methods.  ( 2 min )
    Causal Graph Discovery from Self and Mutually Exciting Time Series. (arXiv:2301.11197v1 [cs.LG])
    We present a generalized linear structural causal model, coupled with a novel data-adaptive linear regularization, to recover causal directed acyclic graphs (DAGs) from time series. By leveraging a recently developed stochastic monotone Variational Inequality (VI) formulation, we cast the causal discovery problem as a general convex optimization. Furthermore, we develop a non-asymptotic recovery guarantee and quantifiable uncertainty by solving a linear program to establish confidence intervals for a wide range of non-linear monotone link functions. We validate our theoretical results and show the competitive performance of our method via extensive numerical experiments. Most importantly, we demonstrate the effectiveness of our approach in recovering highly interpretable causal DAGs over Sepsis Associated Derangements (SADs) while achieving comparable prediction performance to powerful ``black-box'' models such as XGBoost. Thus, the future adoption of our proposed method to conduct continuous surveillance of high-risk patients by clinicians is much more likely.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    A Graph Neural Network with Negative Message Passing for Graph Coloring. (arXiv:2301.11164v1 [cs.LG])
    Graph neural networks have received increased attention over the past years due to their promising ability to handle graph-structured data, which can be found in many real-world problems such as recommended systems and drug synthesis. Most existing research focuses on using graph neural networks to solve homophilous problems, but little attention has been paid to heterophily-type problems. In this paper, we propose a graph network model for graph coloring, which is a class of representative heterophilous problems. Different from the conventional graph networks, we introduce negative message passing into the proposed graph neural network for more effective information exchange in handling graph coloring problems. Moreover, a new loss function taking into account the self-information of the nodes is suggested to accelerate the learning process. Experimental studies are carried out to compare the proposed graph model with five state-of-the-art algorithms on ten publicly available graph coloring problems and one real-world application. Numerical results demonstrate the effectiveness of the proposed graph neural network.  ( 2 min )
    Deep Laplacian-based Options for Temporally-Extended Exploration. (arXiv:2301.11181v1 [cs.LG])
    Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.  ( 2 min )
    Flex-Net: A Graph Neural Network Approach to Resource Management in Flexible Duplex Networks. (arXiv:2301.11166v1 [cs.NI])
    Flexible duplex networks allow users to dynamically employ uplink and downlink channels without static time scheduling, thereby utilizing the network resources efficiently. This work investigates the sum-rate maximization of flexible duplex networks. In particular, we consider a network with pairwise-fixed communication links. Corresponding combinatorial optimization is a non-deterministic polynomial (NP)-hard without a closed-form solution. In this respect, the existing heuristics entail high computational complexity, raising a scalability issue in large networks. Motivated by the recent success of Graph Neural Networks (GNNs) in solving NP-hard wireless resource management problems, we propose a novel GNN architecture, named Flex-Net, to jointly optimize the communication direction and transmission power. The proposed GNN produces near-optimal performance meanwhile maintaining a low computational complexity compared to the most commonly used techniques. Furthermore, our numerical results shed light on the advantages of using GNNs in terms of sample complexity, scalability, and generalization capability.  ( 2 min )
    Train Hard, Fight Easy: Robust Meta Reinforcement Learning. (arXiv:2301.11147v1 [cs.LG])
    A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we propose a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL often leads to both biased gradients and data inefficiency. We prove that the former disappears in MRL, and address the latter via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML learns substantially different meta-policies and achieves robust returns on several navigation and continuous control benchmarks.  ( 2 min )
    Which Experiences Are Influential for Your Agent? Policy Iteration with Turn-over Dropout. (arXiv:2301.11168v1 [cs.LG])
    In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about the influence is valuable for various purposes, including experience cleansing and analysis. One method for estimating the influence of individual experiences is agent comparison, but it is prohibitively expensive when there is a large number of experiences. In this paper, we present PI+ToD as a method for efficiently estimating the influence of experiences. PI+ToD is a policy iteration that efficiently estimates the influence of experiences by utilizing turn-over dropout. We demonstrate the efficiency of PI+ToD with experiments in MuJoCo environments.  ( 2 min )
    Convolutional Learning on Simplicial Complexes. (arXiv:2301.11163v1 [cs.LG])
    We propose a simplicial complex convolutional neural network (SCCNN) to learn data representations on simplicial complexes. It performs convolutions based on the multi-hop simplicial adjacencies via common faces and cofaces independently and captures the inter-simplicial couplings, generalizing state-of-the-art. Upon studying symmetries of the simplicial domain and the data space, it is shown to be permutation and orientation equivariant, thus, incorporating such inductive biases. Based on the Hodge theory, we perform a spectral analysis to understand how SCCNNs regulate data in different frequencies, showing that the convolutions via faces and cofaces operate in two orthogonal data spaces. Lastly, we study the stability of SCCNNs to domain deformations and examine the effects of various factors. Empirical results show the benefits of higher-order convolutions and inter-simplicial couplings in simplex prediction and trajectory prediction.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Federated Learning over Coupled Graphs. (arXiv:2301.11099v1 [cs.LG])
    Graphs are widely used to represent the relations among entities. When one owns the complete data, an entire graph can be easily built, therefore performing analysis on the graph is straightforward. However, in many scenarios, it is impractical to centralize the data due to data privacy concerns. An organization or party only keeps a part of the whole graph data, i.e., graph data is isolated from different parties. Recently, Federated Learning (FL) has been proposed to solve the data isolation issue, mainly for Euclidean data. It is still a challenge to apply FL on graph data because graphs contain topological information which is notorious for its non-IID nature and is hard to partition. In this work, we propose a novel FL framework for graph data, FedCog, to efficiently handle coupled graphs that are a kind of distributed graph data, but widely exist in a variety of real-world applications such as mobile carriers' communication networks and banks' transaction networks. We theoretically prove the correctness and security of FedCog. Experimental results demonstrate that our method FedCog significantly outperforms traditional FL methods on graphs. Remarkably, our FedCog improves the accuracy of node classification tasks by up to 14.7%.  ( 2 min )
    FedHQL: Federated Heterogeneous Q-Learning. (arXiv:2301.11135v1 [cs.LG])
    Federated Reinforcement Learning (FedRL) encourages distributed agents to learn collectively from each other's experience to improve their performance without exchanging their raw trajectories. The existing work on FedRL assumes that all participating agents are homogeneous, which requires all agents to share the same policy parameterization (e.g., network architectures and training configurations). However, in real-world applications, agents are often in disagreement about the architecture and the parameters, possibly also because of disparate computational budgets. Because homogeneity is not given in practice, we introduce the problem setting of Federated Reinforcement Learning with Heterogeneous And bLack-box agEnts (FedRL-HALE). We present the unique challenges this new setting poses and propose the Federated Heterogeneous Q-Learning (FedHQL) algorithm that principally addresses these challenges. We empirically demonstrate the efficacy of FedHQL in boosting the sample efficiency of heterogeneous agents with distinct policy parameterization using standard RL tasks.  ( 2 min )
    Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning. (arXiv:2301.11153v1 [cs.LG])
    Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.  ( 2 min )
    SQ Lower Bounds for Random Sparse Planted Vector Problem. (arXiv:2301.11124v1 [cs.LG])
    Consider the setting where a $\rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $n\geq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $n\ll \rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $n\ll \rho^2 d^{2}$ and $\rho\gg \frac{1}{\sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minerva: A File-Based Ransomware Detector. (arXiv:2301.11050v1 [cs.CR])
    Ransomware is a rapidly evolving type of malware designed to encrypt user files on a device, making them inaccessible in order to exact a ransom. Ransomware attacks resulted in billions of dollars in damages in recent years and are expected to cause hundreds of billions more in the next decade. With current state-of-the-art process-based detectors being heavily susceptible to evasion attacks, no comprehensive solution to this problem is available today. This paper presents Minerva, a new approach to ransomware detection. Unlike current methods focused on identifying ransomware based on process-level behavioral modeling, Minerva detects ransomware by building behavioral profiles of files based on all the operations they receive in a time window. Minerva addresses some of the critical challenges associated with process-based approaches, specifically their vulnerability to complex evasion attacks. Our evaluation of Minerva demonstrates its effectiveness in detecting ransomware attacks, including those that are able to bypass existing defenses. Our results show that Minerva identifies ransomware activity with an average accuracy of 99.45% and an average recall of 99.66%, with 99.97% of ransomware detected within 1 second.  ( 2 min )
    Incomplete Multi-view Clustering via Prototype-based Imputation. (arXiv:2301.11045v1 [cs.LG])
    In this paper, we study how to achieve two characteristics highly-expected by incomplete multi-view clustering (IMvC). Namely, i) instance commonality refers to that within-cluster instances should share a common pattern, and ii) view versatility refers to that cross-view samples should own view-specific patterns. To this end, we design a novel dual-stream model which employs a dual attention layer and a dual contrastive learning loss to learn view-specific prototypes and model the sample-prototype relationship. When the view is missed, our model performs data recovery using the prototypes in the missing view and the sample-prototype relationship inherited from the observed view. Thanks to our dual-stream model, both cluster- and view-specific information could be captured, and thus the instance commonality and view versatility could be preserved to facilitate IMvC. Extensive experiments demonstrate the superiority of our method on six challenging benchmarks compared with 11 approaches. The code will be released.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices. (arXiv:2301.10999v1 [cs.LG])
    The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortunately, these metrics are slow to evaluate using simulators (where available) and typically require measurement on the target hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that predicts inference latency, energy, and memory footprint on an arbitrary DNN TFlite graph (TFL, 2017). In contrast, previously published performance predictors can only predict latency and are restricted to pre-defined construction rules or search spaces. This paper also describes the EdgeDLPerf dataset of 134,912 DNNs randomly sampled from four task search spaces and annotated with inference performance metrics from three edge hardware platforms. Using this dataset, we train PerfSAGE and provide experimental results that demonstrate state-of-the-art prediction accuracy with a Mean Absolute Percentage Error of <5% across all targets and model search spaces. These results: (1) Outperform previous state-of-art GNN-based predictors (Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3) Demonstrate predictions on arbitrary input graphs without modifications to the feature extractor.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Multi-Agent congestion cost minimization with linear function approximation. (arXiv:2301.10993v1 [cs.LG])
    This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.  ( 2 min )
    Time-sensitive Learning for Heterogeneous Federated Edge Intelligence. (arXiv:2301.10977v1 [cs.LG])
    Real-time machine learning has recently attracted significant interest due to its potential to support instantaneous learning, adaptation, and decision making in a wide range of application domains, including self-driving vehicles, intelligent transportation, and industry automation. We investigate real-time ML in a federated edge intelligence (FEI) system, an edge computing system that implements federated learning (FL) solutions based on data samples collected and uploaded from decentralized data networks. FEI systems often exhibit heterogenous communication and computational resource distribution, as well as non-i.i.d. data samples, resulting in long model training time and inefficient resource utilization. Motivated by this fact, we propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model. Training acceleration solutions for both TS-FL with synchronous coordination (TS-FL-SC) and asynchronous coordination (TS-FL-ASC) are investigated. To address straggler effect in TS-FL-SC, we develop an analytical solution to characterize the impact of selecting different subsets of edge servers on the overall model training time. A server dropping-based solution is proposed to allow slow-performance edge servers to be removed from participating in model training if their impact on the resulting model accuracy is limited. A joint optimization algorithm is proposed to minimize the overall time consumption of model training by selecting participating edge servers, local epoch number. We develop an analytical expression to characterize the impact of staleness effect of asynchronous coordination and straggler effect of FL on the time consumption of TS-FL-ASC. Experimental results show that TS-FL-SC and TS-FL-ASC can provide up to 63% and 28% of reduction, in the overall model training time, respectively.  ( 2 min )
    A Fully First-Order Method for Stochastic Bilevel Optimization. (arXiv:2301.10945v1 [math.OC])
    We consider stochastic unconstrained bilevel optimization problems when only the first-order gradient oracles are available. While numerous optimization methods have been proposed for tackling bilevel problems, existing methods either tend to require possibly expensive calculations regarding Hessians of lower-level objectives, or lack rigorous finite-time performance guarantees. In this work, we propose a Fully First-order Stochastic Approximation (F2SA) method, and study its non-asymptotic convergence properties. Specifically, we show that F2SA converges to an $\epsilon$-stationary solution of the bilevel problem after $\epsilon^{-7/2}, \epsilon^{-5/2}$, and $\epsilon^{-3/2}$ iterations (each iteration using $O(1)$ samples) when stochastic noises are in both level objectives, only in the upper-level objective, and not present (deterministic settings), respectively. We further show that if we employ momentum-assisted gradient estimators, the iteration complexities can be improved to $\epsilon^{-5/2}, \epsilon^{-4/2}$, and $\epsilon^{-3/2}$, respectively. We demonstrate even superior practical performance of the proposed method over existing second-order based approaches on MNIST data-hypercleaning experiments.  ( 2 min )
    Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure. (arXiv:2301.10956v1 [cs.LG])
    Graph Neural Networks (GNNs) are popular models for graph learning problems. GNNs show strong empirical performance in many practical tasks. However, the theoretical properties have not been completely elucidated. In this paper, we investigate whether GNNs can exploit the graph structure from the perspective of the expressive power of GNNs. In our analysis, we consider graph generation processes that are controlled by hidden node features, which contain all information about the graph structure. A typical example of this framework is kNN graphs constructed from the hidden features. In our main results, we show that GNNs can recover the hidden node features from the input graph alone, even when all node features, including the hidden features themselves and any indirect hints, are unavailable. GNNs can further use the recovered node features for downstream tasks. These results show that GNNs can fully exploit the graph structure by themselves, and in effect, GNNs can use both the hidden and explicit node features for downstream tasks. In the experiments, we confirm the validity of our results by showing that GNNs can accurately recover the hidden features using a GNN architecture built based on our theoretical analysis.  ( 2 min )
    Visiting Distant Neighbors in Graph Convolutional Networks. (arXiv:2301.10960v1 [cs.LG])
    We extend the graph convolutional network method for deep learning on graph data to higher order in terms of neighboring nodes. In order to construct representations for a node in a graph, in addition to the features of the node and its immediate neighboring nodes, we also include more distant nodes in the calculations. In experimenting with a number of publicly available citation graph datasets, we show that this higher order neighbor visiting pays off by outperforming the original model especially when we have a limited number of available labeled data points for the training of the model.  ( 2 min )
    Privacy-Preserving Joint Edge Association and Power Optimization for the Internet of Vehicles via Federated Multi-Agent Reinforcement Learning. (arXiv:2301.11014v1 [cs.LG])
    Proactive edge association is capable of improving wireless connectivity at the cost of increased handover (HO) frequency and energy consumption, while relying on a large amount of private information sharing required for decision making. In order to improve the connectivity-cost trade-off without privacy leakage, we investigate the privacy-preserving joint edge association and power allocation (JEAPA) problem in the face of the environmental uncertainty and the infeasibility of individual learning. Upon modelling the problem by a decentralized partially observable Markov Decision Process (Dec-POMDP), it is solved by federated multi-agent reinforcement learning (FMARL) through only sharing encrypted training data for federatively learning the policy sought. Our simulation results show that the proposed solution strikes a compelling trade-off, while preserving a higher privacy level than the state-of-the-art solutions.  ( 2 min )
    Neural Dynamic Focused Topic Model. (arXiv:2301.10988v1 [cs.CL])
    Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by Williamson et al. (2010), such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics. In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the dynamic Focused Topic Model. Indeed, we develop a neural model for topic evolution which exploits sequences of Bernoulli random variables in order to track the appearances of topics, thereby decoupling their activities from their proportions. We evaluate our model on three different datasets (the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset) and show that it (i) outperforms state-of-the-art topic models in generalization tasks and (ii) performs comparably to them on prediction tasks, while employing roughly the same number of parameters, and converging about two times faster. Source code to reproduce our experiments is available online.  ( 2 min )
    On the Importance of Noise Scheduling for Diffusion Models. (arXiv:2301.10972v1 [cs.CV])
    We empirically study the effect of noise scheduling strategies for denoising diffusion generative models. There are three findings: (1) the noise scheduling is crucial for the performance, and the optimal one depends on the task (e.g., image sizes), (2) when increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels), and (3) simply scaling the input data by a factor of $b$ while keeping the noise schedule function fixed (equivalent to shifting the logSNR by $\log b$) is a good strategy across image sizes. This simple recipe, when combined with recently proposed Recurrent Interface Network (RIN), yields state-of-the-art pixel-based diffusion models for high-resolution images on ImageNet, enabling single-stage, end-to-end generation of diverse and high-fidelity images at 1024$\times$1024 resolution for the first time (without upsampling/cascades).  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    SparDA: Accelerating Dynamic Sparse Deep Neural Networks via Sparse-Dense Transformation. (arXiv:2301.10936v1 [cs.LG])
    Due to its high cost-effectiveness, sparsity has become the most important approach for building efficient deep-learning models. However, commodity accelerators are built mainly for efficient dense computation, creating a huge gap for general sparse computation to leverage. Existing solutions have to use time-consuming compiling to improve the efficiency of sparse kernels in an ahead-of-time manner and thus are limited to static sparsity. A wide range of dynamic sparsity opportunities is missed because their sparsity patterns are only known at runtime. This limits the future of building more biological brain-like neural networks that should be dynamically and sparsely activated. In this paper, we bridge the gap between sparse computation and commodity accelerators by proposing a system, called Spider, for efficiently executing deep learning models with dynamic sparsity. We identify an important property called permutation invariant that applies to most deep-learning computations. The property enables Spider (1) to extract dynamic sparsity patterns of tensors that are only known at runtime with little overhead; and (2) to transform the dynamic sparse computation into an equivalent dense computation which has been extremely optimized on commodity accelerators. Extensive evaluation on diverse models shows Spider can extract and transform dynamic sparsity with negligible overhead but brings up to 9.4x speedup over state-of-art solutions.  ( 2 min )
    Affective Faces for Goal-Driven Dyadic Communication. (arXiv:2301.10939v1 [cs.CV])
    We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker, our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context. Our approach further allows the listener to be conditioned on their own goals, personalities, or backgrounds. Our approach models conversations through a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we propose a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress. See our website for video results, data, and code: https://realtalk.cs.columbia.edu.  ( 2 min )
    Super-Resolution Analysis via Machine Learning: A Survey for Fluid Flows. (arXiv:2301.10937v1 [physics.flu-dyn])
    This paper surveys machine-learning-based super-resolution reconstruction for vortical flows. Super resolution aims to find the high-resolution flow fields from low-resolution data and is generally an approach used in image reconstruction. In addition to surveying a variety of recent super-resolution applications, we provide case studies of super-resolution analysis for an example of two-dimensional decaying isotropic turbulence. We demonstrate that physics-inspired model designs enable successful reconstruction of vortical flows from spatially limited measurements. We also discuss the challenges and outlooks of machine-learning-based super-resolution analysis for fluid flow applications. The insights gained from this study can be leveraged for super-resolution analysis of numerical and experimental flow data.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Efficient Trust Region-Based Safe Reinforcement Learning with Low-Bias Distributional Actor-Critic. (arXiv:2301.10923v1 [cs.LG])
    To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.  ( 2 min )
    Partial advantage estimator for proximal policy optimization. (arXiv:2301.10920v1 [cs.LG])
    Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to $\lambda$-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and $\mu$RTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.  ( 2 min )
    SuperFed: Weight Shared Federated Learning. (arXiv:2301.10879v1 [cs.LG])
    Federated Learning (FL) is a well-established technique for privacy preserving distributed training. Much attention has been given to various aspects of FL training. A growing number of applications that consume FL-trained models, however, increasingly operate under dynamically and unpredictably variable conditions, rendering a single model insufficient. We argue for training a global family of models cost efficiently in a federated fashion. Training them independently for different tradeoff points incurs $O(k)$ cost for any k architectures of interest, however. Straightforward applications of FL techniques to recent weight-shared training approaches is either infeasible or prohibitively expensive. We propose SuperFed - an architectural framework that incurs $O(1)$ cost to co-train a large family of models in a federated fashion by leveraging weight-shared learning. We achieve an order of magnitude cost savings on both communication and computation by proposing two novel training mechanisms: (a) distribution of weight-shared models to federated clients, (b) central aggregation of arbitrarily overlapping weight-shared model parameters. The combination of these mechanisms is shown to reach an order of magnitude (9.43x) reduction in computation and communication cost for training a $5*10^{18}$-sized family of models, compared to independently training as few as $k = 9$ DNNs without any accuracy loss.  ( 2 min )
    Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation. (arXiv:2301.10814v1 [q-bio.BM])
    Protein-ligand binding prediction is a fundamental problem in AI-driven drug discovery. Prior work focused on supervised learning methods using a large set of binding affinity data for small molecules, but it is hard to apply the same strategy to other drug classes like antibodies as labelled data is limited. In this paper, we explore unsupervised approaches and reformulate binding energy prediction as a generative modeling task. Specifically, we train an energy-based model on a set of unlabelled protein-ligand complexes using SE(3) denoising score matching and interpret its log-likelihood as binding affinity. Our key contribution is a new equivariant rotation prediction network called Neural Euler's Rotation Equations (NERE) for SE(3) score matching. It predicts a rotation by modeling the force and torque between protein and ligand atoms, where the force is defined as the gradient of an energy function with respect to atom coordinates. We evaluate NERE on protein-ligand and antibody-antigen binding affinity prediction benchmarks. Our model outperforms all unsupervised baselines (physics-based and statistical potentials) and matches supervised learning methods in the antibody case.  ( 2 min )
    Joint action loss for proximal policy optimization. (arXiv:2301.10919v1 [cs.LG])
    PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.  ( 2 min )
    Distilling Cognitive Backdoor Patterns within an Image. (arXiv:2301.10908v1 [cs.LG])
    This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.  ( 2 min )
    Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL). (arXiv:2301.10888v1 [cs.LG])
    Introduction Data imbalance is one of the crucial issues in big data analysis with fewer labels. For example, in real-world healthcare data, spam detection labels, and financial fraud detection datasets. Many data balance methods were introduced to improve machine learning algorithms' performance. Research claims SMOTE and SMOTE-based data-augmentation (generate new data points) methods could improve algorithm performance. However, we found in many online tutorials, the valuation methods were applied based on synthesized datasets that introduced bias into the evaluation, and the performance got a false improvement. In this study, we proposed, a new evaluation framework for imbalanced data learning methods. We have experimented on five data balance methods and whether the performance of algorithms will improve or not. Methods We collected 8 imbalanced healthcare datasets with different imbalanced rates from different domains. Applied 6 data augmentation methods with 11 machine learning methods testing if the data augmentation will help with improving machine learning performance. We compared the traditional data augmentation evaluation methods with our proposed cross-validation evaluation framework Results Using traditional data augmentation evaluation meta hods will give a false impression of improving the performance. However, our proposed evaluation method shows data augmentation has limited ability to improve the results. Conclusion EFIDL is more suitable for evaluating the prediction performance of an ML method when data are augmented. Using an unsuitable evaluation framework will give false results. Future researchers should consider the evaluation framework we proposed when dealing with augmented datasets. Our experiments showed data augmentation does not help improve ML prediction performance.  ( 2 min )
    GPU-based Private Information Retrieval for On-Device Machine Learning Inference. (arXiv:2301.10904v1 [cs.CR])
    On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information during on-device ML inference. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) develop a novel algorithm for accelerating PIR on GPUs, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than $20 \times$ over an optimized CPU PIR implementation, and our co-design techniques obtain over $5 \times$ additional throughput improvement at fixed model quality. Together, on various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100,000$ queries per second -- a $>100 \times$ throughput improvement over a naively implemented system -- while maintaining model accuracy, and limiting inference communication and response latency to within $300$KB and $<100$ms respectively.  ( 2 min )
    Learning Gradients of Convex Functions with Monotone Gradient Networks. (arXiv:2301.10862v1 [cs.LG])
    While much effort has been devoted to deriving and studying effective convex formulations of signal processing problems, the gradients of convex functions also have critical applications ranging from gradient-based optimization to optimal transport. Recent works have explored data-driven methods for learning convex objectives, but learning their monotone gradients is seldom studied. In this work, we propose Cascaded and Modular Monotone Gradient Networks (C-MGN and M-MGN respectively), two monotone gradient neural network architectures for directly learning the gradients of convex functions. We show that our networks are simpler to train, learn monotone gradient fields more accurately, and use significantly fewer parameters than state of the art methods. We further demonstrate their ability to learn optimal transport mappings to augment driving image data.  ( 2 min )
    Reef-insight: A framework for reef habitat mapping with clustering methods via remote sensing. (arXiv:2301.10876v1 [cs.LG])
    Environmental damage has been of much concern, particularly coastal areas and the oceans given climate change and drastic effects of pollution and extreme climate events. Our present day analytical capabilities along with the advancements in information acquisition techniques such as remote sensing can be utilized for the management and study of coral reef ecosystems. In this paper, we present Reef-insight, an unsupervised machine learning framework that features advanced clustering methods and remote sensing for reef community mapping. Our framework compares different clustering methods to evaluate them for reef community mapping using remote sensing data. We evaluate four major clustering approaches such as k- means, hierarchical clustering, Gaussian mixture model, and density-based clustering based on qualitative and visual assessment. We utilise remote sensing data featuring Heron reef island region in the Great Barrier Reef of Australia. Our results indicate that clustering methods using remote sensing data can well identify benthic and geomorphic clusters that are found in reefs when compared to other studies. Our results indicate that Reef-insight can generate detailed reef community maps outlining distinct reef habitats and has the potential to enable further insights for reef restoration projects. We release our framework as open source software to enable its extension to different parts of the world  ( 2 min )
    Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram. (arXiv:2301.10856v1 [cs.CY])
    In response to disinformation and propaganda from Russian online media following the Russian invasion of Ukraine, Russian outlets including Russia Today and Sputnik News were banned throughout Europe. Many of these Russian outlets, in order to reach their audiences, began to heavily promote their content on messaging services like Telegram. In this work, to understand this phenomenon, we study how 16 Russian media outlets have interacted with and utilized 732 Telegram channels throughout 2022. To do this, we utilize a multilingual version of the foundational model MPNet to embed articles and Telegram messages in a shared embedding space and semantically compare content. Leveraging a parallelized version of DP-Means clustering, we perform paragraph-level topic/narrative extraction and time-series analysis with Hawkes Processes. With this approach, across our websites, we find between 2.3% (ura.news) and 26.7% (ukraina.ru) of their content originated/resulted from activity on Telegram. Finally, tracking the spread of individual narratives, we measure the rate at which these websites and channels disseminate content within the Russian media ecosystem.  ( 2 min )
    Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning. (arXiv:2301.10886v1 [cs.LG])
    We present AIRS: Automatic Intrinsic Reward Shaping that intelligently and adaptively provides high-quality intrinsic rewards to enhance exploration in reinforcement learning (RL). More specifically, AIRS selects shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem. Moreover, we develop an intrinsic reward toolkit to provide efficient and reliable implementations of diverse intrinsic reward approaches. We test AIRS on various tasks of Procgen games and DeepMind Control Suite. Extensive simulation demonstrates that AIRS can outperform the benchmarking schemes and achieve superior performance with simple architecture.  ( 2 min )
    When Layers Play the Lottery, all Tickets Win at Initialization. (arXiv:2301.10835v1 [cs.LG])
    Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets.  ( 2 min )
    RobustPdM: Designing Robust Predictive Maintenance against Adversarial Attacks. (arXiv:2301.10822v1 [cs.CR])
    The state-of-the-art predictive maintenance (PdM) techniques have shown great success in reducing maintenance costs and downtime of complicated machines while increasing overall productivity through extensive utilization of Internet-of-Things (IoT) and Deep Learning (DL). Unfortunately, IoT sensors and DL algorithms are both prone to cyber-attacks. For instance, DL algorithms are known for their susceptibility to adversarial examples. Such adversarial attacks are vastly under-explored in the PdM domain. This is because the adversarial attacks in the computer vision domain for classification tasks cannot be directly applied to the PdM domain for multivariate time series (MTS) regression tasks. In this work, we propose an end-to-end methodology to design adversarially robust PdM systems by extensively analyzing the effect of different types of adversarial attacks and proposing a novel adversarial defense technique for DL-enabled PdM models. First, we propose novel MTS Projected Gradient Descent (PGD) and MTS PGD with random restarts (PGD_r) attacks. Then, we evaluate the impact of MTS PGD and PGD_r along with MTS Fast Gradient Sign Method (FGSM) and MTS Basic Iterative Method (BIM) on Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), and Bi-directional LSTM based PdM system. Our results using NASA's turbofan engine dataset show that adversarial attacks can cause a severe defect (up to 11X) in the RUL prediction, outperforming the effectiveness of the state-of-the-art PdM attacks by 3X. Furthermore, we present a novel approximate adversarial training method to defend against adversarial attacks. We observe that approximate adversarial training can significantly improve the robustness of PdM models (up to 54X) and outperforms the state-of-the-art PdM defense methods by offering 3X more robustness.  ( 2 min )
    Salesforce CausalAI Library: A Fast and Scalable Framework for Causal Analysis of Time Series and Tabular Data. (arXiv:2301.10859v1 [cs.LG])
    We introduce the Salesforce CausalAI Library, an open-source library for causal analysis using observational data. It supports causal discovery and causal inference for tabular and time series data, of both discrete and continuous types. This library includes algorithms that handle linear and non-linear causal relationships between variables, and uses multi-processing for speed-up. We also include a data generator capable of generating synthetic data with specified structural equation model for both the aforementioned data formats and types, that helps users control the ground-truth causal process while investigating various algorithms. Finally, we provide a user interface (UI) that allows users to perform causal analysis on data without coding. The goal of this library is to provide a fast and flexible solution for a variety of problems in the domain of causality. This technical report describes the Salesforce CausalAI API along with its capabilities, the implementations of the supported algorithms, and experiments demonstrating their performance and speed. Our library is available at \url{https://github.com/salesforce/causalai}.  ( 2 min )
    Improved Bitcoin Price Prediction based on COVID-19 data. (arXiv:2301.10840v1 [cs.LG])
    Social turbulence can affect people financial decisions, causing changes in spending and saving. During a global turbulence as significant as the COVID-19 pandemic, such changes are inevitable. Here we examine how the effects of COVID-19 on various jurisdictions influenced the global price of Bitcoin. We hypothesize that lock downs and expectations of economic recession erode people trust in fiat (government-issued) currencies, thus elevating cryptocurrencies. Hence, we expect to identify a causal relation between the turbulence caused by the pandemic, demand for Bitcoin, and ultimately its price. To test the hypothesis, we merged datasets of Bitcoin prices and COVID-19 cases and deaths. We also engineered extra features and applied statistical and machine learning (ML) models. We applied a Random Forest model (RF) to identify and rank the feature importance, and ran a Long Short-Term Memory (LSTM) model on Bitcoin prices data set twice: with and without accounting for COVID-19 related features. We find that adding COVID-19 data into the LSTM model improved prediction of Bitcoin prices.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    Unravelling physics beyond the standard model with classical and quantum anomaly detection. (arXiv:2301.10787v1 [hep-ex])
    Much hope for finding new physics phenomena at microscopic scale relies on the observations obtained from High Energy Physics experiments, like the ones performed at the Large Hadron Collider (LHC). However, current experiments do not indicate clear signs of new physics that could guide the development of additional Beyond Standard Model (BSM) theories. Identifying signatures of new physics out of the enormous amount of data produced at the LHC falls into the class of anomaly detection and constitutes one of the greatest computational challenges. In this article, we propose a novel strategy to perform anomaly detection in a supervised learning setting, based on the artificial creation of anomalies through a random process. For the resulting supervised learning problem, we successfully apply classical and quantum Support Vector Classifiers (CSVC and QSVC respectively) to identify the artificial anomalies among the SM events. Even more promising, we find that employing an SVC trained to identify the artificial anomalies, it is possible to identify realistic BSM events with high accuracy. In parallel, we also explore the potential of quantum algorithms for improving the classification accuracy and provide plausible conditions for the best exploitation of this novel computational paradigm.  ( 2 min )
    Increasing Fairness in Compromise on Accuracy via Weighted Vote with Learning Guarantees. (arXiv:2301.10813v1 [cs.LG])
    As the bias issue is being taken more and more seriously in widely applied machine learning systems, the decrease in accuracy in most cases deeply disturbs researchers when increasing fairness. To address this problem, we present a novel analysis of the expected fairness quality via weighted vote, suitable for both binary and multi-class classification. The analysis takes the correction of biased predictions by ensemble members into account and provides learning bounds that are amenable to efficient minimisation. We further propose a pruning method based on this analysis and the concepts of domination and Pareto optimality, which is able to increase fairness under a prerequisite of little or even no accuracy decline. The experimental results indicate that the proposed learning bounds are faithful and that the proposed pruning method can indeed increase ensemble fairness without much accuracy degradation.  ( 2 min )
    Gene-SGAN: a method for discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering. (arXiv:2301.10772v1 [q-bio.QM])
    Disease heterogeneity has been a critical challenge for precision diagnosis and treatment, especially in neurologic and neuropsychiatric diseases. Many diseases can display multiple distinct brain phenotypes across individuals, potentially reflecting disease subtypes that can be captured using MRI and machine learning methods. However, biological interpretability and treatment relevance are limited if the derived subtypes are not associated with genetic drivers or susceptibility factors. Herein, we describe Gene-SGAN - a multi-view, weakly-supervised deep clustering method - which dissects disease heterogeneity by jointly considering phenotypic and genetic data, thereby conferring genetic correlations to the disease subtypes and associated endophenotypic signatures. We first validate the generalizability, interpretability, and robustness of Gene-SGAN in semi-synthetic experiments. We then demonstrate its application to real multi-site datasets from 28,858 individuals, deriving subtypes of Alzheimer's disease and brain endophenotypes associated with hypertension, from MRI and SNP data. Derived brain phenotypes displayed significant differences in neuroanatomical patterns, genetic determinants, biological and clinical biomarkers, indicating potentially distinct underlying neuropathologic processes, genetic drivers, and susceptibility factors. Overall, Gene-SGAN is broadly applicable to disease subtyping and endophenotype discovery, and is herein tested on disease-related, genetically-driven neuroimaging phenotypes.  ( 2 min )
    Quantum anomaly detection in the latent space of proton collision events at the LHC. (arXiv:2301.10780v1 [quant-ph])
    We propose a new strategy for anomaly detection at the LHC based on unsupervised quantum machine learning algorithms. To accommodate the constraints on the problem size dictated by the limitations of current quantum hardware we develop a classical convolutional autoencoder. The designed quantum anomaly detection models, namely an unsupervised kernel machine and two clustering algorithms, are trained to find new-physics events in the latent representation of LHC data produced by the autoencoder. The performance of the quantum algorithms is benchmarked against classical counterparts on different new-physics scenarios and its dependence on the dimensionality of the latent space and the size of the training dataset is studied. For kernel-based anomaly detection, we identify a regime where the quantum model significantly outperforms its classical counterpart. An instance of the kernel machine is implemented on a quantum computer to verify its suitability for available hardware. We demonstrate that the observed consistent performance advantage is related to the inherent quantum properties of the circuit used.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )
    Generative Tertiary Structure-based RNA Design. (arXiv:2301.10774v1 [q-bio.BM])
    Learning from 3D biological macromolecules with artificial intelligence technologies has been an emerging area. Computational protein design, known as the inverse of protein structure prediction, aims to generate protein sequences that will fold into the defined structure. Analogous to protein design, RNA design is also an important topic in synthetic biology, which aims to generate RNA sequences by given structures. However, existing RNA design methods mainly focus on the secondary structure, ignoring the informative tertiary structure, which is commonly used in protein design. To explore the complex coupling between RNA sequence and 3D structure, we introduce an RNA tertiary structure modeling method to efficiently capture useful information from the 3D structure of RNA. For a fair comparison, we collect abundant RNA data and split the data according to tertiary structures. With the standard dataset, we conduct a benchmark by employing structure-based protein design approaches with our RNA tertiary structure modeling method. We believe our work will stimulate the future development of tertiary structure-based RNA design and bridge the gap between the RNA 3D structures and sequences.  ( 2 min )
  • Open

    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Two-step interpretable modeling of Intensive Care Acquired Infections. (arXiv:2301.11146v1 [stat.AP])
    We present a novel methodology for integrating high resolution longitudinal data with the dynamic prediction capabilities of survival models. The aim is two-fold: to improve the predictive power while maintaining interpretability of the models. To go beyond the black box paradigm of artificial neural networks, we propose a parsimonious and robust semi-parametric approach (i.e., a landmarking competing risks model) that combines routinely collected low-resolution data with predictive features extracted from a convolutional neural network, that was trained on high resolution time-dependent information. We then use saliency maps to analyze and explain the extra predictive power of this model. To illustrate our methodology, we focus on healthcare-associated infections in patients admitted to an intensive care unit.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Causal Inference with Hidden Mediators. (arXiv:2111.02927v2 [math.ST] UPDATED)
    Proximal causal inference was recently proposed as a framework to identify causal effects from observational data in the presence of hidden confounders for which proxies are available. In this paper, we extend the proximal causal inference approach to settings where identification of causal effects hinges upon a set of mediators which are not observed, yet error prone proxies of the hidden mediators are measured. Specifically, (i) We establish causal hidden mediation analysis, which extends classical causal mediation analysis methods for identifying natural direct and indirect effects under no unmeasured confounding to a setting where the mediator of interest is hidden, but proxies of it are available. (ii) We establish hidden front-door criterion, which extends the classical front-door criterion to allow for hidden mediators for which proxies are available. (iii) We show that the identification of a certain causal effect called population intervention indirect effect remains possible with hidden mediators in settings where challenges in (i) and (ii) might co-exist. We view (i)-(iii) as important steps towards the practical application of front-door criteria and mediation analysis as mediators are almost always measured with error and thus, the most one can hope for in practice is that the measurements are at best proxies of mediating mechanisms. We propose identification approaches for the parameters of interest in our considered models. For the estimation aspect, we propose an influence function-based estimation method and provide an analysis for the robustness of the estimators.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    On the Dissipation of Ideal Hamiltonian Monte Carlo Sampler. (arXiv:2209.07438v2 [stat.CO] UPDATED)
    We report on what seems to be an intriguing connection between variable integration time and partial velocity refreshment of Ideal Hamiltonian Monte Carlo samplers, both of which can be used for reducing the dissipative behavior of the dynamics. More concretely, we show that on quadratic potentials, efficiency can be improved through these means by a $\sqrt{\kappa}$ factor in Wasserstein-2 distance, compared to classical constant integration time, fully refreshed HMC. We additionally explore the benefit of randomized integrators for simulating the Hamiltonian dynamics under higher order regularity conditions.  ( 2 min )
    Conformal Prediction for Trustworthy Detection of Railway Signals. (arXiv:2301.11136v1 [stat.ML])
    We present an application of conformal prediction, a form of uncertainty quantification with guarantees, to the detection of railway signals. State-of-the-art architectures are tested and the most promising one undergoes the process of conformalization, where a correction is applied to the predicted bounding boxes (i.e. to their height and width) such that they comply with a predefined probability of success. We work with a novel exploratory dataset of images taken from the perspective of a train operator, as a first step to build and validate future trustworthy machine learning models for the detection of railway signals.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Granger Causal Chain Discovery for Sepsis-Associated Derangements via Multivariate Hawkes Processes. (arXiv:2209.04480v4 [stat.AP] UPDATED)
    Modern health care systems are conducting continuous, automated surveillance of the electronic medical record (EMR) to identify adverse events with increasing frequency; however, many events such as sepsis do not have elucidated prodromes (i.e., event chains) that can be used to identify and intercept the adverse event early in its course. Currently, there does not exist reliable framework for discovering or describing causal chains that precede adverse hospital events. Clinically relevant and interpretable results require a framework that can (1) infer temporal interactions across multiple patient features found in EMR data (e.g., labs, vital signs, etc.) and (2) can identify patterns that precede and are specific to an impending adverse event (e.g., sepsis). In this work, we propose a linear multivariate Hawkes process model, coupled with ReLU link function, to recover a Granger Causal (GC) graph with both exciting and inhibiting effects. We develop a scalable two-phase gradient-based method to maximize a surrogate-likelihood and estimate the problem parameters, which is shown to be effective via extensive numerical simulation. Our method is subsequently extended to a data set of patients admitted to an academic level 1 trauma center located in Atalanta, GA, where the estimated GC graph identifies several highly interpretable chains that precede sepsis. Here, we demonstrate the effectiveness of our approach in learning a GC graph over Sepsis Associated Derangements (SADs), but it can be generalized to other applications with similar requirements.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minimax estimation of discontinuous optimal transport maps: The semi-discrete case. (arXiv:2301.11302v1 [math.ST])
    We consider the problem of estimating the optimal transport map between two probability distributions, $P$ and $Q$ in $\mathbb R^d$, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution $Q$ is a discrete measure supported on a finite number of points in $\mathbb R^d$. We study a computationally efficient estimator initially proposed by Pooladian and Niles-Weed (2021), based on entropic optimal transport, and show in the semi-discrete setting that it converges at the minimax-optimal rate $n^{-1/2}$, independent of dimension. Other standard map estimation techniques both lack finite-sample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    Learning from Mistakes: Self-Regularizing Hierarchical Semantic Representations in Point Cloud Segmentation. (arXiv:2301.11145v1 [cs.CV])
    Recent advances in autonomous robotic technologies have highlighted the growing need for precise environmental analysis. LiDAR semantic segmentation has gained attention to accomplish fine-grained scene understanding by acting directly on raw content provided by sensors. Recent solutions showed how different learning techniques can be used to improve the performance of the model, without any architectural or dataset change. Following this trend, we present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model. First, classes are clustered into macro groups according to mutual prediction errors; then, the learning process is regularized by: (1) aligning class-conditional prototypical feature representation for both fine and coarse classes, (2) weighting instances with a per-class fairness index. Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture; indeed, experimental results showed that it enables state-of-the-art performances on different architectures, datasets and tasks, while ensuring more balanced class-wise results and faster convergence.  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )

  • Open

    ChatGPT can definitely print Russian propaganda including why Prime Minister Justin Trudeau should be charged with treason despite its Wikipedia page
    submitted by /u/Robinsonc1988 [link] [comments]  ( 40 min )
    Text-To-4D Dynamic Scene Generation
    submitted by /u/bperki8 [link] [comments]  ( 40 min )
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more!
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more! Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’re one of the first mobile multipurpose AI apps. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA(Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature, storylines, answer questions and more, (think of chatgpt). We’d love to have some people try it out, give us feedback, and keep in touch with us. https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/SonnyDoge22 [link] [comments]  ( 41 min )
    We need AI to take over all creative mediums and hobbies
    Before you quick draw shoot the messenger here, I want you to hear me out. This is coming from a person who enjoys drawing digital art and a person who used to make music for fun. Humans tend to be very selfish creatures, not in an evil sense, but rather a "phew, glad that was you and not me" sense, like if someone stepped in a pile of of dog poop that sucks for them, but it's not your problem. Once everybodies creative (not labor intenseive) livelyhoods get taken, then yeah, I could see how now AI replicating humanity could be a problem. Then maybe after that, people can enjoy not being replaced. That is all I have to say, god bless my reddit karma, I can feel this subreddit getting ready to dislike bomb this, but as long as this message gets out, that's all I want. submitted by /u/Zima_Re-L [link] [comments]  ( 41 min )
    AI Dream 150 - MY INCREDIBLE DREAM VISUALIZED BY AI - Part3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Humanity May Reach Singularity Within Just 7 Years, Trend Shows
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    VoiceGPT - ChatGPT Voice Assistant
    submitted by /u/nickbild [link] [comments]  ( 40 min )
    The Big Tech Royal Rumble for AI
    submitted by /u/foundersblock [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Google MusicLM turns language into music
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Looking to Convert Photos to Video - Anything Out There?
    I want to be able to upload several photos (10 - 100) and have the AI generate a video using those photos. I've seen Genmo, but that only uses one photo at a time. Is there anything that may be able to do this? Thanks submitted by /u/venicerocco [link] [comments]  ( 40 min )
    Freaky A.I concept..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    fully AI made video i found on yt
    https://www.youtube.com/watch?v=Vw-t826JcDQ submitted by /u/Optimal_Studio_2050 [link] [comments]  ( 40 min )
    Is there any AI that is able to decipher the lyrics of a song?
    Specifically this one: https://www.youtube.com/watch?v=MFv7apjatwM&ab_channel=Lux-Topic If there is no current AI that is able to listen to a song and write down the lyrics accurately, then I provide this idea freely. submitted by /u/A_Very_Horny_Zed [link] [comments]  ( 40 min )
    📌[Searchcolab] Voice Cloning + Image Processing + Lip Syncing. Link in comments
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Pix2Pix AI Model Inside Stable Diffusion Installation Guide!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    What will happen to the internet once ChatGPT disintermediates websites?
    If ChatGPT and AI search engines become the central place for acquiring knowledge, essentially scraping and disintermediating websites; what commercial incentive will many website owners have to generate new content once their ad revenue is gone? What will happen to the internet? submitted by /u/DelPrive235 [link] [comments]  ( 41 min )
    🧬 The Age of A.I., Longevity and Biotech - Is a "Synthetic biology singularity" coming?
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    Outsmarting AI Detection Tools: How to Make Your AI-Generated Content Fly Under the Radar
    submitted by /u/PapaDudu [link] [comments]  ( 43 min )
    Don’t Chat With ChatGPT: Amazon’s Warning To Employees
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    what is the best way to upscale comics? BSRGAN doesn't work well with text.
    submitted by /u/mhczbnoykrqvzazfth [link] [comments]  ( 41 min )
    📌[Searchcolab] Text-To-4D Dynamic Scene Generation.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    Why is AI better at visual art than music generation?
    I've just been wondering why people have been able to develop AI models such as GANs and more recently diffusion models that are so good at creating images but music generation has remained somewhat stagnant. From what I understand, there are two ways AI are generally trained on music: through MIDI files and using raw audio. The MIDI files are probably easier for the AI to work with but there's massive loss of relevant data that an AI would need to really become proficient in music generation. On the other hand, AIs trained on the raw audio tend to produce somewhat fuzzy audio that often includes what sounds like nearly distinguishable lyrics. I'm personally convinced the latter is the method that will be universally used in the future, but it clearly has a long way to go. So does anyone have any insights regarding why it's straggling a bit? Perhaps it's because software like Stable Diffusion are piggybacking off the image recognition advancements that have been in very high demand the past couple decades? Perhaps music is, in fact, a more complicated or more strict artform and our human minds are biased to think that these two tasks should be roughly equivalent in difficulty? It's certainly not due to a lack of data to train on, so maybe we just don't have a model that's really suited to analyzing and imitating music. submitted by /u/No-Phrase1116 [link] [comments]  ( 44 min )
  • Open

    MusicLM: Generating Music From Text
    submitted by /u/nickb [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
  • Open

    Pygame window does not closes and kernel freezes
    I was rendering 'MountainCar-v0' environment in "human" render_mode. When I call env.close(), my pygame window doesn't closes automatically and I need to Force Quit it, and unfortunately my kernel dies. I'm using Jupyter Notebook. Following are my versions: PC: Mac OS Ventura 13.1 - Python: Python 3.9.9 - Jupyter:jupyter 1.0.0 - gym: 0.26.2 - ipykernel: 6.6.0 - ipython: 7.30.1 ​ Please let me know how I can solve this issue, Thanks! submitted by /u/Character-Yellow-796 [link] [comments]  ( 41 min )
    Tuning hyperparamters in RL.
    Hello everyone. I have a question about hyperparameter tuning for PPO. usually, when we tune them ( using Optuna for example), each set of hyperparameters is tested for a small number of steps ( 150000 for example ) then we pick the ones that yielded the best reward. Yet, in the long run, the asymptotic convergence might not be reached by the best hyperparameters found during tuning but rather, other hyperparameters that performed worse during tuning are the ones that might results in better asymptotic convergence. Is there any way to overcome this issue and maybe find the best hyperparameters without having to test for a large number of steps. Note: I'm using a custom environment with a continuous action space and image observations. submitted by /u/Many_Reception_4921 [link] [comments]  ( 42 min )
    Model-based hierarchical reinforcement learning
    Hi, do you know papers that combine model-based and hierarchical reinforcement learning, where also the lower level is a model-based approach? I cannot find sufficient paper about it submitted by /u/aika98oe [link] [comments]  ( 42 min )
  • Open

    [D] Monitoring and Retraining Models with Label-Changing Interventions
    When a trained ML model is implemented to predict an adverse event, the user might take steps to avoid that event. The outcome of the user's actions could either be successful or failure. In training with strictly observational data, a typical confusion matrix contains: ŷ=0, y=0 -> True Negative ŷ=0, y=1 -> False Negative ŷ=1, y=1 -> True Positive ŷ=1, y=0 -> False Positive When using the model, some results get confounded if the user acts based on the predictions ŷ=0, y=0, a=0 -> True Negative, No Intervention ŷ=0, y=1, a=0 -> False Negative, No Intervention ŷ=1, y=1, a=1 -> True Positive, Failed Intervention ŷ=1, y=0, a=1 -> False Positive OR Successful Intervention Ignoring the possibility that the intervention caused the adverse event, the involvement of the user may lead to an increase in the number of false positives that are perceived. Continuous monitoring becomes difficult due to perceived faster degradation of the model. Furthermore, retraining the model in the future may be hindered by labels that do not accurately reflect the true values. One approach that I've been proposed is to make sure there is always a hold-out set. Allow some random records to get scores, but do never act on them. This gives both a monitoring and retraining dataset. Are there other solutions that people use here? I've found the papers below, but I cannot say that I completely understand how to practically implement them. Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions (https://arxiv.org/abs/2211.09781) Model updating after interventions paradoxically introduces bias (https://arxiv.org/abs/2010.11530) submitted by /u/waiting4omscs [link] [comments]  ( 43 min )
    [P] Using algorithms or models from papers for commercial use
    Hey! I am reading the GET3D paper by Nvidia. The paper is listed with the Nvidia license which states: 3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. The Work or derivative works thereof may be used or intended for use by Nvidia or its affiliates commercially or non-commercially. As used herein, "non-commercially" means for research or evaluation purposes only and not for any direct or indirect monetary gain. Does it mean there is no commercial way of using the ideas in the paper? Is it possible to use the ideas from that paper or any other paper by Nvidia in some product? As the idea from the paper is only the tool or a part of the product but is not the product itself. submitted by /u/romantimm25 [link] [comments]  ( 44 min )
    [D] Google Predoctoral Program (India) 2023
    Has anyone got any interview email? Did they start the interview process? submitted by /u/Around-star [link] [comments]  ( 42 min )
    [D] ImageNet2012 Advice
    I'm currently at the point in my PhD career that I've developed some extremely successful components of CNNs, architecture, activation, etc. Outperforming default choices on CIFAR10, CIFAR100, Flowers, Caltech101, and other smaller datasets. With how success the results currently are, we want to publish to a top tier conference, specifically NeurIPS this Spring, deadline around May 13th. However, we (me and my advisor) agree that to publish at NeurIPS, our developments need to be backed up by ImageNet. The problem is that we have never trained on ImageNet before (so no experience), and have a limited computational budget. Although our university personally owns 2 A100 40 GB GPUs that we can use, they are shared within the entire university, so a 2 day job takes about 1 week in queue (don't know if we can get the results in time by May). On the other hand, we also don't know if we can get a $2500 grant in time to use cloud resources. For those who have trained on ImageNet, what are some common pitfalls, best ways to transfer data, downloading the dataset, etc? If you performed it on the cloud, how did you do so? How long was your time to train? Expenses? Did you run each model once or three times? Early stopping using validation or test set? NOTE: We will only be using Tensorflow... submitted by /u/MyActualUserName99 [link] [comments]  ( 44 min )
    [D] Best large language model for Named Entity Extraction?
    I'd like to extract named entities, something like this: "[Text]: Microsoft (the word being a portmanteau of "microcomputer software") was founded by Bill Gates on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. Steve Ballmer replaced Gates as CEO in 2000, and later envisioned a "devices and services" strategy. [Name]: Steve Ballmer [Position]: CEO [Company]: Microsoft " Tried it on GPT-Neox with 20b parameters with mixed success, is there anything better out there to try for a few-shot learning (without fine tuning)? submitted by /u/TankAttack [link] [comments]  ( 44 min )
    [R] ETLP: Event-based Three-factor Local Plasticity for online learning with neuromorphic hardware
    Neuromorphic perception with event-based sensors, asynchronous hardware and spiking neurons is showing promising results for real-time and energy-efficient inference in embedded systems. The next promise of brain-inspired computing is to enable adaptation to changes at the edge with online learning. However, the parallel and distributed architectures of neuromorphic hardware based on co-localized compute and memory imposes locality constraints to the on-chip learning rules. We propose in this work the Event-based Three-factor Local Plasticity (ETLP) rule that uses (1) the pre-synaptic spike trace, (2) the post-synaptic membrane voltage and (3) a third factor in the form of projected labels with no error calculation, that also serve as update triggers. We apply ETLP with feedforward and recurrent spiking neural networks on visual and auditory event-based pattern recognition, and compare it to Back-Propagation Through Time (BPTT) and eProp. We show a competitive performance in accuracy with a clear advantage in the computational complexity for ETLP. We also show that when using local plasticity, threshold adaptation in spiking neurons and a recurrent topology are necessary to learn spatio-temporal patterns with a rich temporal structure. Finally, we provide a proof of concept hardware implementation of ETLP on FPGA to highlight the simplicity of its computational primitives and how they can be mapped into neuromorphic hardware for online learning with low-energy consumption and real-time interaction. Full paper: https://arxiv.org/abs/2301.08281 submitted by /u/ferquinve [link] [comments]  ( 43 min )
    [D] MusicLM: Generating Music From Text
    How far do you think this can go? Is it a memorization machine or can it create new songs? https://google-research.github.io/seanet/musiclm/examples/ submitted by /u/carlthome [link] [comments]  ( 44 min )
    [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
    Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks, but are difficult to deploy because of their massive size and computational costs. For instance, the top-performing GPT-175B model has 175 billion parameters, which total at least 320GB (counting multiples of 1024) of storage in half-precision (FP16) format, leading it to require at least five A100 GPUs with 80GB of memory each for inference. It is therefore natural that there has been significant interest in reducing these costs via model compression. To date, virtually all existing GPT compression approaches have focused on quantization, that is, reducing the precision of the numerical representation of individual weights. A complementary approa…  ( 47 min )
    [D] Meta AI Residency 2023
    Now that apps are closed did anyone hear back yet? please follow this thread and update your status below, tbh I don't really think I have much of a chance but I'm excited none the less. ​ to follow a thread please press the bell on the top right submitted by /u/BeautyInUgly [link] [comments]  ( 42 min )
    [D] Moving away from Unicode for more equal token representation across global languages?
    Edit: as has been explained in the comments, unicode is not the issue so much as the byte-pair encoding scheme, which artificially limits the vocabulary size of the model and leads to less common language using more tokens. I'd like to discuss the impacts of increasing the vocabulary size on transformer model computational requirements. Many languages, like Chinese, Japanese Kanji, Korean, Telugu, etc use complex logograms to represent words and concepts. Unfortunately, these languages are severely "punished" in GPT3 because they are expensive to tokenize due to the way unicode represents them. Instead of unicode representing them as a single code point, logograms are typically represented as a sum of multiple graphemes, meaning that multiple unicode code points underlie their descriptio…  ( 49 min )
    [D] Why are there no End2End Speech Recognition models using the same Encoder-Decoder learning process as BART (no CTC) ?
    I'm new to CTC. After learning about CTC and its application in End2End training for Speech Recognition, I figured that if we want to generate a target sequence (transcript), given a source sequence features, we could use the vanilla Encoder-Decoder architecture in Transformer (also used in T5, BART, etc) alone, without the need of CTC, yet why people are only using CTC for End2End Speech Recoginition, or using hybrid of CTC and Decoder in some papers ? Thanks. submitted by /u/KarmaCut132 [link] [comments]  ( 43 min )
  • Open

    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.  ( 3 min )
    The Devil is the Classifier: Investigating Long Tail Relation Classification with Decoupling Analysis. (arXiv:2009.07022v1 [cs.LG] CROSS LISTED)
    Long-tailed relation classification is a challenging problem as the head classes may dominate the training phase, thereby leading to the deterioration of the tail performance. Existing solutions usually address this issue via class-balancing strategies, e.g., data re-sampling and loss re-weighting, but all these methods adhere to the schema of entangling learning of the representation and classifier. In this study, we conduct an in-depth empirical investigation into the long-tailed problem and found that pre-trained models with instance-balanced sampling already capture the well-learned representations for all classes; moreover, it is possible to achieve better long-tailed classification ability at low cost by only adjusting the classifier. Inspired by this observation, we propose a robust classifier with attentive relation routing, which assigns soft weights by automatically aggregating the relations. Extensive experiments on two datasets demonstrate the effectiveness of our proposed approach. Code and datasets are available in https://github.com/zjunlp/deepke.  ( 2 min )
    Contrastive Triple Extraction with Generative Transformer. (arXiv:2009.06207v8 [cs.CL] CROSS LISTED)
    Triple extraction is an essential task in information extraction for natural language processing and knowledge graph construction. In this paper, we revisit the end-to-end triple extraction task for sequence generation. Since generative triple extraction may struggle to capture long-term dependencies and generate unfaithful triples, we introduce a novel model, contrastive triple extraction with a generative transformer. Specifically, we introduce a single shared transformer module for encoder-decoder-based generation. To generate faithful results, we propose a novel triplet contrastive training object. Moreover, we introduce two mechanisms to further improve model performance (i.e., batch-wise dynamic attention-masking and triple-wise calibration). Experimental results on three datasets (i.e., NYT, WebNLG, and MIE) show that our approach achieves better performance than that of baselines.  ( 2 min )
    Interaction Modeling with Multiplex Attention. (arXiv:2208.10660v2 [cs.LG] UPDATED)
    Modeling multi-agent systems requires understanding how agents interact. Such systems are often difficult to model because they can involve a variety of types of interactions that layer together to drive rich social behavioral dynamics. Here we introduce a method for accurately modeling multi-agent systems. We present Interaction Modeling with Multiplex Attention (IMMA), a forward prediction model that uses a multiplex latent graph to represent multiple independent types of interactions and attention to account for relations of different strengths. We also introduce Progressive Layer Training, a training strategy for this architecture. We show that our approach outperforms state-of-the-art models in trajectory forecasting and relation inference, spanning three multi-agent scenarios: social navigation, cooperative task achievement, and team sports. We further demonstrate that our approach can improve zero-shot generalization and allows us to probe how different interactions impact agent behavior.  ( 2 min )
    Deep learning in a bilateral brain with hemispheric specialization. (arXiv:2209.06862v6 [q-bio.NC] UPDATED)
    The brains of all bilaterally symmetric animals on Earth are divided into left and right hemispheres. The anatomy and functionality of the hemispheres have a large degree of overlap, but there are asymmetries and they specialize to possess different attributes. Several studies have used computational models to mimic hemispheric asymmetries with a focus on reproducing human data on semantic and visual processing tasks. In this study, we aimed to understand how dual hemispheres could interact in a given task. We propose a bilateral artificial neural network that imitates a lateralization observed in nature: that the left hemisphere specializes in specificity and the right in generalities. We used two ResNet-9 convolutional neural networks with different training objectives and tested it on an image classification task. Our analysis found that the hemispheres represent complementary features that are exploited by a network head which implements a type of weighted attention. The bilateral architecture outperformed a range of baselines of similar representational capacity that don't exploit differential specialization. The results demonstrate the efficacy of bilateralism, contribute to an understanding of bilateralism in biological brains and the architecture serves as an inductive bias when designing new AI systems.  ( 2 min )
    Long-tail Relation Extraction via Knowledge Graph Embeddings and Graph Convolution Networks. (arXiv:1903.01306v1 [cs.IR] CROSS LISTED)
    We propose a distance supervised relation extraction approach for long-tailed, imbalanced data which is prevalent in real-world settings. Here, the challenge is to learn accurate "few-shot" models for classes existing at the tail of the class distribution, for which little data is available. Inspired by the rich semantic correlations between classes at the long tail and those at the head, we take advantage of the knowledge from data-rich classes at the head of the distribution to boost the performance of the data-poor classes at the tail. First, we propose to leverage implicit relational knowledge among class labels from knowledge graph embeddings and learn explicit relational knowledge using graph convolution networks. Second, we integrate that relational knowledge into relation extraction model by coarse-to-fine knowledge-aware attention mechanism. We demonstrate our results for a large-scale benchmark dataset which show that our approach significantly outperforms other baselines, especially for long-tail relations.  ( 2 min )
    Towards Dynamic Stability Assessment of Power Grid Topologies using Graph Neural Networks. (arXiv:2206.06369v3 [cs.LG] UPDATED)
    To mitigate climate change, the share of renewable energies in power production needs to be increased. Renewables introduce new challenges to power grids regarding the dynamic stability due to decentralization, reduced inertia and volatility in production. Since dynamic stability simulations are intractable and exceedingly expensive for large grids, graph neural networks (GNNs) are a promising method to reduce the computational effort of analyzing dynamic stability of power grids. We provide new datasets of dynamic stability of synthetic power grids and find that GNNs are surprisingly effective at predicting the highly non-linear targets from topological information only. Furthermore, we use GNNs to demonstrate the accurate identification of particularly vulnerable nodes in power grids, so-called troublemakers. Lastly, we find that GNNs trained on small grids generate accurate predictions on a large synthetic model of the Texan power grid, which illustrates the potential for real-world applications of the presented approach.  ( 2 min )
    Why the pseudo label based semi-supervised learning algorithm is effective?. (arXiv:2211.10039v2 [cs.LG] UPDATED)
    Recently, pseudo label based semi-supervised learning has achieved great success in many fields. The core idea of the pseudo label based semi-supervised learning algorithm is to use the model trained on the labeled data to generate pseudo labels on the unlabeled data, and then train a model to fit the previously generated pseudo labels. We give a theory analysis for why pseudo label based semi-supervised learning is effective in this paper. We mainly compare the generalization error of the model trained under two settings: (1) There are N labeled data. (2) There are N unlabeled data and a suitable initial model. Our analysis shows that, firstly, when the amount of unlabeled data tends to infinity, the pseudo label based semi-supervised learning algorithm can obtain model which have the same generalization error upper bound as model obtained by normally training in the condition of the amount of labeled data tends to infinity. More importantly, we prove that when the amount of unlabeled data is large enough, the generalization error upper bound of the model obtained by pseudo label based semi-supervised learning algorithm can converge to the optimal upper bound with linear convergence rate. We also give the lower bound on sampling complexity to achieve linear convergence rate. Our analysis contributes to understanding the empirical successes of pseudo label-based semi-supervised learning.  ( 2 min )
    ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning. (arXiv:2102.12828v3 [cs.CL] CROSS LISTED)
    This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM). We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model. Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively. We further conduct comprehensive model analysis and observe interesting error cases, which may promote future researches.  ( 2 min )
    Ultra-NeRF: Neural Radiance Fields for Ultrasound Imaging. (arXiv:2301.10520v1 [eess.IV])
    We present a physics-enhanced implicit neural representation (INR) for ultrasound (US) imaging that learns tissue properties from overlapping US sweeps. Our proposed method leverages a ray-tracing-based neural rendering for novel view US synthesis. Recent publications demonstrated that INR models could encode a representation of a three-dimensional scene from a set of two-dimensional US frames. However, these models fail to consider the view-dependent changes in appearance and geometry intrinsic to US imaging. In our work, we discuss direction-dependent changes in the scene and show that a physics-inspired rendering improves the fidelity of US image synthesis. In particular, we demonstrate experimentally that our proposed method generates geometrically accurate B-mode images for regions with ambiguous representation owing to view-dependent differences of the US images. We conduct our experiments using simulated B-mode US sweeps of the liver and acquired US sweeps of a spine phantom tracked with a robotic arm. The experiments corroborate that our method generates US frames that enable consistent volume compounding from previously unseen views. To the best of our knowledge, the presented work is the first to address view-dependent US image synthesis using INR.
    Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. (arXiv:2108.13161v7 [cs.CL] CROSS LISTED)
    Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.
    PromptKG: A Prompt Learning Framework for Knowledge Graph Representation Learning and Application. (arXiv:2210.00305v1 [cs.CL] CROSS LISTED)
    Knowledge Graphs (KGs) often have two characteristics: heterogeneous graph structure and text-rich entity/relation information. KG representation models should consider graph structures and text semantics, but no comprehensive open-sourced framework is mainly designed for KG regarding informative text description. In this paper, we present PromptKG, a prompt learning framework for KG representation learning and application that equips the cutting-edge text-based methods, integrates a new prompt learning model and supports various tasks (e.g., knowledge graph completion, question answering, recommendation, and knowledge probing). PromptKG is publicly open-sourced at https://github.com/zjunlp/PromptKG with long-term technical support.
    Conceptualized Representation Learning for Chinese Biomedical Text Mining. (arXiv:2008.10813v1 [cs.CL] CROSS LISTED)
    Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://github.com/alibaba-research/ChineseBLUE.
    Normal vs. Adversarial: Salience-based Analysis of Adversarial Samples for Relation Extraction. (arXiv:2104.00312v4 [cs.CL] CROSS LISTED)
    Recent neural-based relation extraction approaches, though achieving promising improvement on benchmark datasets, have reported their vulnerability towards adversarial attacks. Thus far, efforts mostly focused on generating adversarial samples or defending adversarial attacks, but little is known about the difference between normal and adversarial samples. In this work, we take the first step to leverage the salience-based method to analyze those adversarial samples. We observe that salience tokens have a direct correlation with adversarial perturbations. We further find the adversarial perturbations are either those tokens not existing in the training set or superficial cues associated with relation labels. To some extent, our approach unveils the characters against adversarial samples. We release an open-source testbed, "DiagnoseAdv" in https://github.com/zjunlp/DiagnoseAdv.
    Reasoning Through Memorization: Nearest Neighbor Knowledge Graph Embeddings. (arXiv:2201.05575v2 [cs.CL] CROSS LISTED)
    Previous knowledge graph embedding approaches usually map entities to representations and utilize score functions to predict the target entities, yet they struggle to reason rare or emerging unseen entities. In this paper, we propose kNN-KGE, a new knowledge graph embedding approach with pre-trained language models, by linearly interpolating its entity distribution with k-nearest neighbors. We compute the nearest neighbors based on the distance in the entity embedding space from the knowledge store. Our approach can allow rare or emerging entities to be memorized explicitly rather than implicitly in model parameters. Experimental results demonstrate that our approach can improve inductive and transductive link prediction results and yield better performance for low-resource settings with only a few triples, which might be easier to reason via explicit memory. Code is available at https://github.com/zjunlp/KNN-KG.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Logic-Based Explainability in Machine Learning. (arXiv:2211.00541v2 [cs.AI] UPDATED)
    The last decade witnessed an ever-increasing stream of successes in Machine Learning (ML). These successes offer clear evidence that ML is bound to become pervasive in a wide range of practical uses, including many that directly affect humans. Unfortunately, the operation of the most successful ML models is incomprehensible for human decision makers. As a result, the use of ML models, especially in high-risk and safety-critical settings is not without concern. In recent years, there have been efforts on devising approaches for explaining ML models. Most of these efforts have focused on so-called model-agnostic approaches. However, all model-agnostic and related approaches offer no guarantees of rigor, hence being referred to as non-formal. For example, such non-formal explanations can be consistent with different predictions, which renders them useless in practice. This paper overviews the ongoing research efforts on computing rigorous model-based explanations of ML models; these being referred to as formal explanations. These efforts encompass a variety of topics, that include the actual definitions of explanations, the characterization of the complexity of computing explanations, the currently best logical encodings for reasoning about different ML models, and also how to make explanations interpretable for human decision makers, among others.
    Disentangled Contrastive Learning for Learning Robust Textual Representations. (arXiv:2104.04907v2 [cs.CL] CROSS LISTED)
    Although the self-supervised pre-training of transformer models has resulted in the revolutionizing of natural language processing (NLP) applications and the achievement of state-of-the-art results with regard to various benchmarks, this process is still vulnerable to small and imperceptible permutations originating from legitimate inputs. Intuitively, the representations should be similar in the feature space with subtle input permutations, while large variations occur with different meanings. This motivates us to investigate the learning of robust textual representation in a contrastive manner. However, it is non-trivial to obtain opposing semantic instances for textual samples. In this study, we propose a disentangled contrastive learning method that separately optimizes the uniformity and alignment of representations without negative sampling. Specifically, we introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity. Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines, as well as achieve promising improvements with invariance tests and adversarial attacks. The code is available in https://github.com/zxlzr/DCL.
    Multi-Agent Deep Reinforcement Learning for Efficient Passenger Delivery in Urban Air Mobility. (arXiv:2211.06890v2 [cs.MA] UPDATED)
    It has been considered that urban air mobility (UAM), also known as drone-taxi or electrical vertical takeoff and landing (eVTOL), will play a key role in future transportation. By putting UAM into practical future transportation, several benefits can be realized, i.e., (i) the total travel time of passengers can be reduced compared to traditional transportation and (ii) there is no environmental pollution and no special labor costs to operate the system because electric batteries will be used in UAM system. However, there are various dynamic and uncertain factors in the flight environment, i.e., passenger sudden service requests, battery discharge, and collision among UAMs. Therefore, this paper proposes a novel cooperative MADRL algorithm based on centralized training and distributed execution (CTDE) concepts for reliable and efficient passenger delivery in UAM networks. According to the performance evaluation results, we confirm that the proposed algorithm outperforms other existing algorithms in terms of the number of serviced passengers increase (30%) and the waiting time per serviced passenger decrease (26%).
    Self-Supervised Hierarchical Metrical Structure Modeling. (arXiv:2210.17183v2 [cs.SD] UPDATED)
    We propose a novel method to model hierarchical metrical structures for both symbolic music and audio signals in a self-supervised manner with minimal domain knowledge. The model trains and inferences on beat-aligned music signals and predicts an 8-layer hierarchical metrical tree from beat, measure to the section level. The training procedure does not require any hierarchical metrical labeling except for beats, purely relying on the nature of metrical regularity and inter-voice consistency as inductive biases. We show in experiments that the method achieves comparable performance with supervised baselines on multiple metrical structure analysis tasks on both symbolic music and audio signals. All demos, source code and pre-trained models are publicly available on GitHub.
    On Robustness and Bias Analysis of BERT-based Relation Extraction. (arXiv:2009.06206v5 [cs.CL] CROSS LISTED)
    Fine-tuning pre-trained models have achieved impressive performance on standard natural language processing benchmarks. However, the resultant model generalizability remains poorly understood. We do not know, for example, how excellent performance can lead to the perfection of generalization models. In this study, we analyze a fine-tuned BERT model from different perspectives using relation extraction. We also characterize the differences in generalization techniques according to our proposed improvements. From empirical experimentation, we find that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases (i.e., selection and semantic). These findings highlight opportunities for future improvements. Our open-sourced testbed DiagnoseRE is available in \url{https://github.com/zjunlp/DiagnoseRE}.
    Kformer: Knowledge Injection in Transformer Feed-Forward Layers. (arXiv:2201.05742v2 [cs.CL] CROSS LISTED)
    Recent days have witnessed a diverse set of knowledge injection models for pre-trained language models (PTMs); however, most previous studies neglect the PTMs' own ability with quantities of implicit knowledge stored in parameters. A recent study has observed knowledge neurons in the Feed Forward Network (FFN), which are responsible for expressing factual knowledge. In this work, we propose a simple model, Kformer, which takes advantage of the knowledge stored in PTMs and external knowledge via knowledge injection in Transformer FFN layers. Empirically results on two knowledge-intensive tasks, commonsense reasoning (i.e., SocialIQA) and medical question answering (i.e., MedQA-USMLE), demonstrate that Kformer can yield better performance than other knowledge injection technologies such as concatenation or attention-based injection. We think the proposed simple model and empirical findings may be helpful for the community to develop more powerful knowledge injection methods. Code available in https://github.com/zjunlp/Kformer.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    On the Probability of Necessity and Sufficiency of Explaining Graph Neural Networks: A Lower Bound Optimization Approach. (arXiv:2212.07056v2 [cs.LG] UPDATED)
    The explainability of Graph Neural Networks (GNNs) is critical to various GNN applications but remains an open challenge. A convincing explanation should be both necessary and sufficient simultaneously. However, existing GNN explaining approaches focus on only one of the two aspects, necessity or sufficiency, or a heuristic trade-off between the two. Theoretically, the Probability of Necessity and Sufficiency (PNS) can be applied to search for the most necessary and sufficient explanation since it can mathematically quantify the necessity and sufficiency of an explanation. Nevertheless, the difficulty of obtaining PNS due to non-monotonicity and the challenge of counterfactual estimation limit its wide use. To address the non-identifiability of PNS, we resort to a lower bound of PNS that can be optimized via counterfactual estimation, and propose Necessary and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. Specifically, we employ nearest neighbor matching to generate counterfactual samples and leverage continuous masks with a sampling strategy to optimize the lower bound. Empirical study shows that NSEG achieves excellent performance in generating the most necessary and sufficient explanations among a series of state-of-the-art methods.
    Learning to Ask for Data-Efficient Event Argument Extraction. (arXiv:2110.00479v1 [cs.CL] CROSS LISTED)
    Event argument extraction (EAE) is an important task for information extraction to discover specific argument roles. In this study, we cast EAE as a question-based cloze task and empirically analyze fixed discrete token template performance. As generating human-annotated question templates is often time-consuming and labor-intensive, we further propose a novel approach called "Learning to Ask," which can learn optimized question templates for EAE without human annotations. Experiments using the ACE-2005 dataset demonstrate that our method based on optimized questions achieves state-of-the-art performance in both the few-shot and supervised settings.
    Communication-Efficient Diffusion Strategy for Performance Improvement of Federated Learning with Non-IID Data. (arXiv:2207.07493v3 [cs.DC] UPDATED)
    Federated learning (FL) is a novel learning paradigm that addresses the privacy leakage challenge of centralized learning. However, in FL, users with non-independent and identically distributed (non-IID) characteristics can deteriorate the performance of the global model. Specifically, the global model suffers from the weight divergence challenge owing to non-IID data. To address the aforementioned challenge, we propose a novel diffusion strategy of the machine learning (ML) model (FedDif) to maximize the FL performance with non-IID data. In FedDif, users spread local models to neighboring users over D2D communications. FedDif enables the local model to experience different distributions before parameter aggregation. Furthermore, we theoretically demonstrate that FedDif can circumvent the weight divergence challenge. On the theoretical basis, we propose the communication-efficient diffusion strategy of the ML model, which can determine the trade-off between the learning performance and communication cost based on auction theory. The performance evaluation results show that FedDif improves the test accuracy of the global model by 10.37% compared to the baseline FL with non-IID settings. Moreover, FedDif improves the number of consumed sub-frames by 1.28 to 2.85 folds to the latest methods except for the model compression scheme. FedDif also improves the number of transmitted models by 1.43 to 2.67 folds to the latest methods.
    Relation Adversarial Network for Low Resource Knowledge Graph Completion. (arXiv:1911.03091v6 [cs.CL] CROSS LISTED)
    Knowledge Graph Completion (KGC) has been proposed to improve Knowledge Graphs by filling in missing connections via link prediction or relation extraction. One of the main difficulties for KGC is a low resource problem. Previous approaches assume sufficient training triples to learn versatile vectors for entities and relations, or a satisfactory number of labeled sentences to train a competent relation extraction model. However, low resource relations are very common in KGs, and those newly added relations often do not have many known samples for training. In this work, we aim at predicting new facts under a challenging setting where only limited training instances are available. We propose a general framework called Weighted Relation Adversarial Network, which utilizes an adversarial procedure to help adapt knowledge/features learned from high resource relations to different but related low resource relations. Specifically, the framework takes advantage of a relation discriminator to distinguish between samples from different relations, and help learn relation-invariant features more transferable from source relations to target relations. Experimental results show that the proposed approach outperforms previous methods regarding low resource settings for both link prediction and relation extraction.
    Schema-aware Reference as Prompt Improves Data-Efficient Relational Triple and Event Extraction. (arXiv:2210.10709v3 [cs.CL] CROSS LISTED)
    Information Extraction, which aims to extract structural relational triple or event from unstructured texts, often suffers from data scarcity issues. With the development of pre-trained language models, many prompt-based approaches to data-efficient information extraction have been proposed and achieved impressive performance. However, existing prompt learning methods for information extraction are still susceptible to several potential limitations: (i) semantic gap between natural language and output structure knowledge with pre-defined schema; (ii) representation learning with locally individual instances limits the performance given the insufficient features. In this paper, we propose a novel approach of schema-aware Reference As Prompt (RAP), which dynamically leverage schema and knowledge inherited from global (few-shot) training data for each sample. Specifically, we propose a schema-aware reference store, which unifies symbolic schema and relevant textual instances. Then, we employ a dynamic reference integration module to retrieve pertinent knowledge from the datastore as prompts during training and inference. Experimental results demonstrate that RAP can be plugged into various existing models and outperforms baselines in low-resource settings on four datasets of relational triple extraction and event extraction. In addition, we provide comprehensive empirical ablations and case analysis regarding different types and scales of knowledge in order to better understand the mechanisms of RAP. Code is available in https://github.com/zjunlp/RAP.
    LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training. (arXiv:2112.01404v2 [cs.CL] CROSS LISTED)
    Natural language generation from structured data mainly focuses on surface-level descriptions, suffering from uncontrollable content selection and low fidelity. Previous works leverage logical forms to facilitate logical knowledge-conditioned text generation. Though achieving remarkable progress, they are data-hungry, which makes the adoption for real-world applications challenging with limited data. To this end, this paper proposes a unified framework for logical knowledge-conditioned text generation in the few-shot setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach leverages self-training and samples pseudo logical forms based on content and structure consistency. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.
    Document-level Relation Extraction as Semantic Segmentation. (arXiv:2106.03618v2 [cs.CL] CROSS LISTED)
    Document-level relation extraction aims to extract relations among multiple entity pairs from a document. Previously proposed graph-based or transformer-based models utilize the entities independently, regardless of global information among relational triples. This paper approaches the problem by predicting an entity-level relation matrix to capture local and global information, parallel to the semantic segmentation task in computer vision. Herein, we propose a Document U-shaped Network for document-level relation extraction. Specifically, we leverage an encoder module to capture the context information of entities and a U-shaped segmentation module over the image-style feature map to capture global interdependency among triples. Experimental results show that our approach can obtain state-of-the-art performance on three benchmark datasets DocRED, CDR, and GDA.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    Variance-Reduced Conservative Policy Iteration. (arXiv:2212.06283v2 [cs.LG] UPDATED)
    We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.
    Linear TreeShap. (arXiv:2209.08192v2 [cs.LG] UPDATED)
    Decision trees are well-known due to their ease of interpretability. To improve accuracy, we need to grow deep trees or ensembles of trees. These are hard to interpret, offsetting their original benefits. Shapley values have recently become a popular way to explain the predictions of tree-based machine learning models. It provides a linear weighting to features independent of the tree structure. The rise in popularity is mainly due to TreeShap, which solves a general exponential complexity problem in polynomial time. Following extensive adoption in the industry, more efficient algorithms are required. This paper presents a more efficient and straightforward algorithm: Linear TreeShap. Like TreeShap, Linear TreeShap is exact and requires the same amount of memory.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    A Sequential Deep Learning Algorithm for Sampled Mixed-integer Optimisation Problems. (arXiv:2301.10703v1 [math.OC])
    Mixed-integer optimisation problems can be computationally challenging. Here, we introduce and analyse two efficient algorithms with a specific sequential design that are aimed at dealing with sampled problems within this class. At each iteration step of both algorithms, we first test the feasibility of a given test solution for each and every constraint associated with the sampled optimisation at hand, while also identifying those constraints that are violated. Subsequently, an optimisation problem is constructed with a constraint set consisting of the current basis -- namely the smallest set of constraints that fully specifies the current test solution -- as well as constraints related to a limited number of the identified violating samples. We show that both algorithms exhibit finite-time convergence towards the optimal solution. Algorithm 2 features a neural network classifier that notably improves the computational performance compared to Algorithm 1. We establish quantitatively the efficacy of these algorithms by means of three numerical tests: robust optimal power flow, robust unit commitment, and robust random mixed-integer linear program.
    Obstacle Identification and Ellipsoidal Decomposition for Fast Motion Planning in Unknown Dynamic Environments. (arXiv:2209.14233v2 [cs.RO] UPDATED)
    Collision avoidance in the presence of dynamic obstacles in unknown environments is one of the most critical challenges for unmanned systems. In this paper, we present a method that identifies obstacles in terms of ellipsoids to estimate linear and angular obstacle velocities. Our proposed method is based on the idea of any object can be approximately expressed by ellipsoids. To achieve this, we propose a method based on variational Bayesian estimation of Gaussian mixture model, the Kyachiyan algorithm, and a refinement algorithm. Our proposed method does not require knowledge of the number of clusters and can operate in real-time, unlike existing optimization-based methods. In addition, we define an ellipsoid-based feature vector to match obstacles given two timely close point frames. Our method can be applied to any environment with static and dynamic obstacles, including the ones with rotating obstacles. We compare our algorithm with other clustering methods and show that when coupled with a trajectory planner, the overall system can efficiently traverse unknown environments in the presence of dynamic obstacles.
    A General Stochastic Optimization Framework for Convergence Bidding. (arXiv:2210.06543v3 [math.OC] UPDATED)
    Convergence (virtual) bidding is an important part of two-settlement electric power markets as it can effectively reduce discrepancies between the day-ahead and real-time markets. Consequently, there is extensive research into the bidding strategies of virtual participants aiming to obtain optimal bids to submit to the day-ahead market. In this paper, we introduce a price-based general stochastic optimization framework to obtain optimal convergence bid curves. Within this framework, we develop a computationally tractable linear programming-based optimization model, which produces bid prices and volumes simultaneously. We also show that different approximations and simplifications in the general model lead naturally to state-of-the-art convergence bidding approaches, such as self-scheduling and opportunistic approaches. Our general framework also provides a straightforward way to compare the performance of these models, which is demonstrated by numerical experiments on the California (CAISO) market.
    Adversarial De-confounding in Individualised Treatment Effects Estimation. (arXiv:2210.10530v3 [cs.LG] UPDATED)
    Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.
    Context-aware Deep Model for Entity Recommendation in Search Engine at Alibaba. (arXiv:1909.04493v1 [cs.IR] CROSS LISTED)
    Entity recommendation, providing search users with an improved experience via assisting them in finding related entities for a given query, has become an indispensable feature of today's search engines. Existing studies typically only consider the queries with explicit entities. They usually fail to handle complex queries that without entities, such as "what food is good for cold weather", because their models could not infer the underlying meaning of the input text. In this work, we believe that contexts convey valuable evidence that could facilitate the semantic modeling of queries, and take them into consideration for entity recommendation. In order to better model the semantics of queries and entities, we learn the representation of queries and entities jointly with attentive deep neural networks. We evaluate our approach using large-scale, real-world search logs from a widely used commercial Chinese search engine. Our system has been deployed in ShenMa Search Engine and you can fetch it in UC Browser of Alibaba. Results from online A/B test suggest that the impression efficiency of click-through rate increased by 5.1% and page view increased by 5.5%.
    Convergence of Random Reshuffling Under The Kurdyka-{\L}ojasiewicz Inequality. (arXiv:2110.04926v4 [math.OC] UPDATED)
    We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method.
    A systematic review of biologically-informed deep learning models for cancer: fundamental trends for encoding and interpreting oncology data. (arXiv:2207.00812v3 [q-bio.QM] UPDATED)
    There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. This systematic review discusses DL models used to support inference in cancer biology with a particular emphasis on multi-omics analysis. It focuses on how existing models address the need for better dialogue with prior knowledge, biological plausibility and interpretability, fundamental properties in the biomedical domain. For this, we retrieved and analyzed 42 studies focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. We discuss the recent evolutionary arch of DL models in the direction of integrating prior biological relational and network knowledge to support better generalisation (e.g. pathways or Protein-Protein-Interaction networks) and interpretability. This represents a fundamental functional shift towards models which can integrate mechanistic and statistical inference aspects. We introduce a concept of bio-centric interpretability and according to its taxonomy, we discuss representational methodologies for the integration of domain prior knowledge in such models. The paper provides a critical outlook into contemporary methods for explainability and interpretabiltiy used in DL for cancer. The analysis points in the direction of a convergence between encoding prior knowledge and improved interpretability. We introduce bio-centric interpretability which is an important step towards formalisation of biological interpretability of DL models and developing methods that are less problem- or application-specific.
    Asymptotic Analysis of Deep Residual Networks. (arXiv:2212.08199v2 [cs.LG] UPDATED)
    We investigate the asymptotic properties of deep Residual networks (ResNets) as the number of layers increases. We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature. We study the convergence of the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a stochastic differential equation (SDE) or neither of these. In particular, our findings point to the existence of a diffusive regime in which the deep network limit is described by a class of stochastic differential equations (SDEs). Finally, we derive the corresponding scaling limits for the backpropagation dynamics.
    To be or not to be stable, that is the question: understanding neural networks for inverse problems. (arXiv:2211.13692v2 [math.NA] UPDATED)
    The solution of linear inverse problems arising, for example, in signal and image processing is a challenging problem since the ill-conditioning amplifies the noise on the data. Recently introduced algorithms based on deep learning overwhelm the more traditional model-based approaches, but they typically suffer from instability with respect to data perturbation. In this paper, we theoretically analyze the trade-off between neural networks stability and accuracy in the solution of linear inverse problems. Moreover, we propose different supervised and unsupervised solutions to increase network stability that maintains good accuracy by inheriting, in the network training, regularization from a model-based iterative scheme. Extensive numerical experiments on image deblurring confirm the theoretical results and the effectiveness of the proposed deep learning-based solutions to stably solve noisy inverse problems.
    Connecting metrics for shape-texture knowledge in computer vision. (arXiv:2301.10608v1 [cs.CV])
    Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families.
    FewShotTextGCN: K-hop neighborhood regularization for few-shot learning on graphs. (arXiv:2301.10481v1 [cs.CL])
    We present FewShotTextGCN, a novel method designed to effectively utilize the properties of word-document graphs for improved learning in low-resource settings. We introduce K-hop Neighbourhood Regularization, a regularizer for heterogeneous graphs, and show that it stabilizes and improves learning when only a few training samples are available. We furthermore propose a simplification in the graph-construction method, which results in a graph that is $\sim$7 times less dense and yields better performance in little-resource settings while performing on par with the state of the art in high-resource settings. Finally, we introduce a new variant of Adaptive Pseudo-Labeling tailored for word-document graphs. When using as little as 20 samples for training, we outperform a strong TextGCN baseline with 17% in absolute accuracy on average over eight languages. We demonstrate that our method can be applied to document classification without any language model pretraining on a wide range of typologically diverse languages while performing on par with large pretrained language models.
    User-Interactive Offline Reinforcement Learning. (arXiv:2205.10629v2 [cs.LG] UPDATED)
    Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.
    Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games. (arXiv:2111.06008v3 [cs.LG] UPDATED)
    Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium at the rate of $\tilde{O}(T^{-1})$. This substantially improves over the prior best rate of convergence for correlated equilibria of $O(T^{-3/4})$ due to Chen and Peng (NeurIPS`20), and it is optimal -- within the no-regret framework -- up to polylogarithmic factors in $T$. To obtain these results, we develop new techniques for establishing higher-order smoothness for learning dynamics involving fixed point operations. Specifically, we establish that the no-internal-regret learning dynamics of Stoltz and Lugosi (Mach Learn`05) are equivalently simulated by no-external-regret dynamics on a combinatorial space. This allows us to trade the computation of the stationary distribution on a polynomial-sized Markov chain for a (much more well-behaved) linear transformation on an exponential-sized set, enabling us to leverage similar techniques as DFG to near-optimally bound the internal regret. Moreover, we establish an $O(\textrm{polylog}(T))$ no-swap-regret bound for the classic algorithm of Blum and Mansour (BM) (JMLR`07). We do so by introducing a technique based on the Cauchy Integral Formula that circumvents the more limited combinatorial arguments of DFG. In addition to shedding clarity on the near-optimal regret guarantees of BM, our arguments provide insights into the various ways in which the techniques by DFG can be extended and leveraged in the analysis of more involved learning algorithms.
    Distributed Control of Partial Differential Equations Using Convolutional Reinforcement Learning. (arXiv:2301.10737v1 [cs.LG])
    We present a convolutional framework which significantly reduces the complexity and thus, the computational effort for distributed reinforcement learning control of dynamical systems governed by partial differential equations (PDEs). Exploiting translational invariances, the high-dimensional distributed control problem can be transformed into a multi-agent control problem with many identical, uncoupled agents. Furthermore, using the fact that information is transported with finite velocity in many cases, the dimension of the agents' environment can be drastically reduced using a convolution operation over the state space of the PDE. In this setting, the complexity can be flexibly adjusted via the kernel width or by using a stride greater than one. Moreover, scaling from smaller to larger systems -- or the transfer between different domains -- becomes a straightforward task requiring little effort. We demonstrate the performance of the proposed framework using several PDE examples with increasing complexity, where stabilization is achieved by training a low-dimensional deep deterministic policy gradient agent using minimal computing resources.
    Tighter Bounds on the Expressivity of Transformer Encoders. (arXiv:2301.10743v1 [cs.LG])
    Characterizing neural networks in terms of better-understood formal systems has the potential to yield new insights into the power and limitations of these networks. Doing so for transformers remains an active area of research. Bhattamishra and others have shown that transformer encoders are at least as expressive as a certain kind of counter machine, while Merrill and Sabharwal have shown that fixed-precision transformer encoders recognize only languages in uniform $TC^0$. We connect and strengthen these results by identifying a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders. This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
    Spatio-Temporal Graph Neural Networks: A Survey. (arXiv:2301.10569v1 [cs.LG])
    Graph Neural Networks have gained huge interest in the past few years. These powerful algorithms expanded deep learning models to non-Euclidean space and were able to achieve state of art performance in various applications including recommender systems and social networks. However, this performance is based on static graph structures assumption which limits the Graph Neural Networks performance when the data varies with time. Temporal Graph Neural Networks are extension of Graph Neural Networks that takes the time factor into account. Recently, various Temporal Graph Neural Network algorithms were proposed and achieved superior performance compared to other deep learning algorithms in several time dependent applications. This survey discusses interesting topics related to Spatio temporal Graph Neural Networks, including algorithms, application, and open challenges.
    Evaluation of the syllables pronunciation quality in speech rehabilitation through the solution of the classification problem. (arXiv:2301.10585v1 [cs.LG])
    The solution of the problem of assessing the quality of the pronunciation of syllables during speech rehabilitation after surgical treatment of oncological diseases of the organs of the speech-forming tract is considered in the work. The assessment is carried out by solving the problem of classifying syllables into two classes: before and immediately after surgical treatment. A classifier is built on the basis of the LSTM neural network and trained on the records before the operation and immediately after it, before the start of speech rehabilitation. The measure of assessing the quality of syllables pronunciation in the process of rehabilitation is the metric of belonging to the class before the operation. A study is being made of the influence of taking into account problematic phonemes, the gender of the patient, his individual characteristics on the resulting estimates of the quality of pronunciation. A comparison with existing types of syllable pronunciation quality assessments is carried out, recommendations are given for the practical application of the resulting new class of pronunciation quality assessments.
    Prediction of COVID-19 by Its Variants using Multivariate Data-driven Deep Learning Models. (arXiv:2301.10616v1 [cs.CE])
    The Coronavirus Disease 2019 or the COVID-19 pandemic has swept almost all parts of the world since the first case was found in Wuhan, China, in December 2019. With the increasing number of COVID-19 cases in the world, SARS-CoV-2 has mutated into various variants. Given the increasingly dangerous conditions of the pandemic, it is crucial to know when the pandemic will stop by predicting confirmed cases of COVID-19. Therefore, many studies have raised COVID-19 as a case study to overcome the ongoing pandemic using the Deep Learning method, namely LSTM, with reasonably accurate results and small error values. LSTM training is used to predict confirmed cases of COVID-19 based on variants that have been identified using ECDC's COVID-19 dataset containing confirmed cases of COVID-19 that have been identified from 30 countries in Europe. Tests were conducted using the LSTM and BiLSTM models with the addition of RNN as comparisons on hidden size and layer size. The obtained result showed that in testing hidden sizes 25, 50, 75 to 100, the RNN model provided better results, with the minimum MSE value of 0.01 and the RMSE value of 0.012 for B.1.427/B.1.429 variant with hidden size 100. In further testing of layer sizes 2, 3, 4, and 5, the result shows that the BiLSTM model provided better results, with minimum MSE value of 0.01 and the RMSE of 0.01 for the B.1.427/B.1.429 variant with hidden size 100 and layer size 2.
    Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs. (arXiv:2106.02684v3 [cs.LG] UPDATED)
    We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.
    Trainable Loss Weights in Super-Resolution. (arXiv:2301.10575v1 [cs.CV])
    In recent years, research on super-resolution has primarily focused on the development of unsupervised models, blind networks, and the use of optimization methods in non-blind models. But, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity in a conventional way. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network using this criterion. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while keeping the output components between zero and one. As shown in the experimental results section, weighted loss by the proposed method leads to better results than the unweighted loss in both signal-to-noise and perceptual similarity senses.
    Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and Self-Training. (arXiv:2206.11215v3 [cs.CV] UPDATED)
    We consider a certifiable object pose estimation problem, where -- given a partial point cloud of an object -- the goal is to not only estimate the object pose, but also to provide a certificate of correctness for the resulting estimate. Our first contribution is a general theory of certification for end-to-end perception models. In particular, we introduce the notion of $\zeta$-correctness, which bounds the distance between an estimate and the ground truth. We show that $\zeta$-correctness can be assessed by implementing two certificates: (i) a certificate of observable correctness, that asserts if the model output is consistent with the input data and prior information, (ii) a certificate of non-degeneracy, that asserts whether the input data is sufficient to compute a unique estimate. Our second contribution is to apply this theory and design a new learning-based certifiable pose estimator. We propose C-3PO, a semantic-keypoint-based pose estimation model, augmented with the two certificates, to solve the certifiable pose estimation problem. C-3PO also includes a keypoint corrector, implemented as a differentiable optimization layer, that can correct large detection errors (e.g. due to the sim-to-real gap). Our third contribution is a novel self-supervised training approach that uses our certificate of observable correctness to provide the supervisory signal to C-3PO during training. In it, the model trains only on the observably correct input-output pairs, in each training iteration. As training progresses, we see that the observably correct input-output pairs grow, eventually reaching near 100% in many cases. Our experiments show that (i) standard semantic-keypoint-based methods outperform more recent alternatives, (ii) C-3PO further improves performance and significantly outperforms all the baselines, and (iii) C-3PO's certificates are able to discern correct pose estimates.
    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction. (arXiv:2010.16059v1 [cs.CL] CROSS LISTED)
    Current supervised relational triple extraction approaches require huge amounts of labeled data and thus suffer from poor performance in few-shot settings. However, people can grasp new knowledge by learning a few instances. To this end, we take the first step to study the few-shot relational triple extraction, which has not been well understood. Unlike previous single-task few-shot problems, relational triple extraction is more challenging as the entities and relations have implicit correlations. In this paper, We propose a novel multi-prototype embedding network model to jointly extract the composition of relational triples, namely, entity pairs and corresponding relations. To be specific, we design a hybrid prototypical learning mechanism that bridges text and knowledge concerning both entities and relations. Thus, implicit correlations between entities and relations are injected. Additionally, we propose a prototype-aware regularization to learn more representative prototypes. Experimental results demonstrate that the proposed method can improve the performance of the few-shot triple extraction.
    LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting. (arXiv:2109.00720v5 [cs.CL] CROSS LISTED)
    Most NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Existing dominant approaches usually suffer from the challenge that the target domain has different label sets compared with a resource-rich source domain, which can be concluded as class transfer and domain transfer. In this paper, we propose a lightweight tuning paradigm for low-resource NER via pluggable prompting (LightNER). Specifically, we construct the unified learnable verbalizer of entity categories to generate the entity span sequence and entity categories without any label-specific classifiers, thus addressing the class transfer issue. We further propose a pluggable guidance module by incorporating learnable parameters into the self-attention layer as guidance, which can re-modulate the attention and adapt pre-trained weights. Note that we only tune those inserted module with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings. Code is in https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot.
    VN-Transformer: Rotation-Equivariant Attention for Vector Neurons. (arXiv:2206.04176v3 [cs.CV] UPDATED)
    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: $(i)$ we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; $(ii)$ we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; $(iii)$ we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; $(iv)$ we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.
    PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning. (arXiv:2301.10681v1 [cs.LG])
    Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
    Probing Taxonomic and Thematic Embeddings for Taxonomic Information. (arXiv:2301.10656v1 [cs.CL])
    Modelling taxonomic and thematic relatedness is important for building AI with comprehensive natural language understanding. The goal of this paper is to learn more about how taxonomic information is structurally encoded in embeddings. To do this, we design a new hypernym-hyponym probing task and perform a comparative probing study of taxonomic and thematic SGNS and GloVe embeddings. Our experiments indicate that both types of embeddings encode some taxonomic information, but the amount, as well as the geometric properties of the encodings, are independently related to both the encoder architecture, as well as the embedding training data. Specifically, we find that only taxonomic embeddings carry taxonomic information in their norm, which is determined by the underlying distribution in the data.
    XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. (arXiv:2301.10472v1 [cs.CL])
    Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
    A Boosting Approach to Reinforcement Learning. (arXiv:2108.09767v2 [cs.LG] UPDATED)
    Reducing reinforcement learning to supervised learning is a well-studied and effective approach that leverages the benefits of compact function approximation to deal with large-scale Markov decision processes. Independently, the boosting methodology (e.g. AdaBoost) has proven to be indispensable in designing efficient and accurate classification algorithms by combining inaccurate rules-of-thumb. In this paper, we take a further step: we reduce reinforcement learning to a sequence of weak learning problems. Since weak learners perform only marginally better than random guesses, such subroutines constitute a weaker assumption than the availability of an accurate supervised learning oracle. We prove that the sample complexity and running time bounds of the proposed method do not explicitly depend on the number of states. While existing results on boosting operate on convex losses, the value function over policies is non-convex. We show how to use a non-convex variant of the Frank-Wolfe method for boosting, that additionally improves upon the known sample complexity and running time even for reductions to supervised learning.
    Transfer Learning in Deep Learning Models for Building Load Forecasting: Case of Limited Data. (arXiv:2301.10663v1 [cs.LG])
    Precise load forecasting in buildings could increase the bill savings potential and facilitate optimized strategies for power generation planning. With the rapid evolution of computer science, data-driven techniques, in particular the Deep Learning models, have become a promising solution for the load forecasting problem. These models have showed accurate forecasting results; however, they need abundance amount of historical data to maintain the performance. Considering the new buildings and buildings with low resolution measuring equipment, it is difficult to get enough historical data from them, leading to poor forecasting performance. In order to adapt Deep Learning models for buildings with limited and scarce data, this paper proposes a Building-to-Building Transfer Learning framework to overcome the problem and enhance the performance of Deep Learning models. The transfer learning approach was applied to a new technique known as Transformer model due to its efficacy in capturing data trends. The performance of the algorithm was tested on a large commercial building with limited data. The result showed that the proposed approach improved the forecasting accuracy by 56.8% compared to the case of conventional deep learning where training from scratch is used. The paper also compared the proposed Transformer model to other sequential deep learning models such as Long-short Term Memory (LSTM) and Recurrent Neural Network (RNN). The accuracy of the transformer model outperformed other models by reducing the root mean square error to 0.009, compared to LSTM with 0.011 and RNN with 0.051.
    Dimensionality Expansion of Load Monitoring Time Series and Transfer Learning for EMS. (arXiv:2204.02802v3 [cs.LG] UPDATED)
    Energy management systems (EMS) rely on (non)-intrusive load monitoring (N)ILM to monitor and manage appliances and help residents be more energy efficient and thus more frugal. The robustness as well as the transfer potential of the most promising machine learning solutions for (N)ILM is not yet fully understood as they are trained and evaluated on relatively limited data. In this paper, we propose a new approach for load monitoring in building EMS based on dimensionality expansion of time series and transfer learning. We perform an extensive evaluation on 5 different low-frequency datasets. The proposed feature dimensionality expansion using video-like transformation and resource-aware deep learning architecture achieves an average weighted F1 score of 0.88 across the datasets with 29 appliances and is computationally more efficient compared to the state-of-the-art imaging methods. Investigating the proposed method for cross-dataset intra-domain transfer learning, we find that 1) our method performs with an average weighted F1 score of 0.80 while requiring 3-times fewer epochs for model training compared to the non-transfer approach, 2) can achieve an F1 score of 0.75 with only 230 data samples, and 3) our transfer approach outperforms the state-of-the-art in precision drop by up to 12 percentage points for unseen appliances.
    What are the Machine Learning best practices reported by practitioners on Stack Exchange?. (arXiv:2301.10516v1 [cs.SE])
    Machine Learning (ML) is being used in multiple disciplines due to its powerful capability to infer relationships within data. In particular, Software Engineering (SE) is one of those disciplines in which ML has been used for multiple tasks, like software categorization, bugs prediction, and testing. In addition to the multiple ML applications, some studies have been conducted to detect and understand possible pitfalls and issues when using ML. However, to the best of our knowledge, only a few studies have focused on presenting ML best practices or guidelines for the application of ML in different domains. In addition, the practices and literature presented in previous literature (i) are domain-specific (e.g., concrete practices in biomechanics), (ii) describe few practices, or (iii) the practices lack rigorous validation and are presented in gray literature. In this paper, we present a study listing 127 ML best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites and validated by four independent ML experts. The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system; for each practice, we include explanations and examples. In all the practices, the provided examples focus on SE tasks. We expect this list of practices could help practitioners to understand better the practices and use ML in a more informed way, in particular newcomers to this new area that sits at the intersection of software engineering and machine learning.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Backward Compatibility During Data Updates by Weight Interpolation. (arXiv:2301.10546v1 [cs.LG])
    Backward compatibility of model predictions is a desired property when updating a machine learning driven application. It allows to seamlessly improve the underlying model without introducing regression bugs. In classification tasks these bugs occur in the form of negative flips. This means an instance that was correctly classified by the old model is now classified incorrectly by the updated model. This has direct negative impact on the user experience of such systems e.g. a frequently used voice assistant query is suddenly misclassified. A common reason to update the model is when new training data becomes available and needs to be incorporated. Simply retraining the model with the updated data introduces the unwanted negative flips. We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI). This method interpolates between the weights of the old and new model and we show in extensive experiments that it reduces negative flips without sacrificing the improved accuracy of the new model. BCWI is straight forward to implement and does not increase inference cost. We also explore the use of importance weighting during interpolation and averaging the weights of multiple new models in order to further reduce negative flips.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    Automated multilingual detection of Pro-Kremlin propaganda in newspapers and Telegram posts. (arXiv:2301.10604v1 [cs.CL])
    The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amount of news articles and social media data reflecting opposing ideologies and narratives. These polarized campaigns have led to mutual accusations of misinformation and fake news, shaping an atmosphere of confusion and mistrust for readers worldwide. This study analyses how the media affected and mirrored public opinion during the first month of the war using news articles and Telegram news channels in Ukrainian, Russian, Romanian and English. We propose and compare two methods of multilingual automated pro-Kremlin propaganda identification, based on Transformers and linguistic features. We analyse the advantages and disadvantages of both methods, their adaptability to new genres and languages, and ethical considerations of their usage for content moderation. With this work, we aim to lay the foundation for further development of moderation tools tailored to the current conflict.  ( 2 min )
    E(n)-equivariant Graph Neural Cellular Automata. (arXiv:2301.10497v1 [cs.LG])
    Cellular automata (CAs) are computational models exhibiting rich dynamics emerging from the local interaction of cells arranged in a regular lattice. Graph CAs (GCAs) generalise standard CAs by allowing for arbitrary graphs rather than regular lattices, similar to how Graph Neural Networks (GNNs) generalise Convolutional NNs. Recently, Graph Neural CAs (GNCAs) have been proposed as models built on top of standard GNNs that can be trained to approximate the transition rule of any arbitrary GCA. Existing GNCAs are anisotropic in the sense that their transition rules are not equivariant to translation, rotation, and reflection of the nodes' spatial locations. However, it is desirable for instances related by such transformations to be treated identically by the model. By replacing standard graph convolutions with E(n)-equivariant ones, we avoid anisotropy by design and propose a class of isotropic automata that we call E(n)-GNCAs. These models are lightweight, but can nevertheless handle large graphs, capture complex dynamics and exhibit emergent self-organising behaviours. We showcase the broad and successful applicability of E(n)-GNCAs on three different tasks: (i) pattern formation, (ii) graph auto-encoding, and (iii) simulation of E(n)-equivariant dynamical systems.  ( 2 min )
    When to Trust Aggregated Gradients: Addressing Negative Client Sampling in Federated Learning. (arXiv:2301.10400v1 [cs.LG])
    Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate and can effectively reflect the merged data distribution of sampled clients, and we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method.  ( 2 min )
    Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes. (arXiv:2206.08852v2 [cs.LG] UPDATED)
    Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.
    Improved Stock Price Movement Classification Using News Articles Based on Embeddings and Label Smoothing. (arXiv:2301.10458v1 [cs.LG])
    Stock price movement prediction is a challenging and essential problem in finance. While it is well established in modern behavioral finance that the share prices of related stocks often move after the release of news via reactions and overreactions of investors, how to capture the relationships between price movements and news articles via quantitative models is an active area research; existing models have achieved success with variable degrees. In this paper, we propose to improve stock price movement classification using news articles by incorporating regularization and optimization techniques from deep learning. More specifically, we capture the dependencies between news articles and stocks through embeddings and bidirectional recurrent neural networks as in recent models. We further incorporate weight decay, batch normalization, dropout, and label smoothing to improve the generalization of the trained models. To handle high fluctuations of validation accuracy of batch normalization, we propose dual-phase training to realize the improvements reliably. Our experimental results on a commonly used dataset show significant improvements, achieving average accuracy of 80.7% on the test set, which is more than 10.0% absolute improvement over existing models. Our ablation studies show batch normalization and label smoothing are most effective, leading to 6.0% and 3.4% absolute improvement, respectively on average.  ( 2 min )
    SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators. (arXiv:2301.10388v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they overcome the limited applicability of prior neural networks. A GCN takes as input an arbitrarily structured graph and executes a series of layers which exploit the graph's structure to calculate their output features. One recent trend in GCNs is the use of deep network architectures. As opposed to the traditional GCNs which only span around two to five layers deep, modern GCNs now incorporate tens to hundreds of layers with the help of residual connections. From such deep GCNs, we find an important characteristic that they exhibit very high intermediate feature sparsity. We observe that with deep layers and residual connections, the number of zeros in the intermediate features sharply increases. This reveals a new opportunity for accelerators to exploit in GCN executions that was previously not present. In this paper, we propose SGCN, a fast and energy-efficient GCN accelerator which fully exploits the sparse intermediate features of modern GCNs. SGCN suggests several techniques to achieve significantly higher performance and energy efficiency than the existing accelerators. First, SGCN employs a GCN-friendly feature compression format. We focus on reducing the off-chip memory traffic, which often is the bottleneck for GCN executions. Second, we propose microarchitectures for seamlessly handling the compressed feature format. Third, to better handle locality in the existence of the varying sparsity, SGCN employs sparsity-aware cooperation. Sparsity-aware cooperation creates a pattern that exhibits multiple reuse windows, such that the cache can capture diverse sizes of working sets and therefore adapt to the varying level of sparsity. We show that SGCN achieves 1.71x speedup and 43.9% higher energy efficiency compared to the existing accelerators.  ( 2 min )
    Overcoming Prior Misspecification in Online Learning to Rank. (arXiv:2301.10651v1 [cs.LG])
    The recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior. In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback. Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.  ( 2 min )
    Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning. (arXiv:2301.10500v1 [cs.LG])
    We propose `Banker-OMD`, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The `Banker-OMD` framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus allowing the easy design of new algorithms capable of easily and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\tilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. `Banker-OMD` leads to the first delayed scale-free adversarial MAB algorithm achieving $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret and the first delayed adversarial linear bandit algorithm achieving $\tilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\tilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $\Omega(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.  ( 2 min )
    Integrating Local Real Data with Global Gradient Prototypes for Classifier Re-Balancing in Federated Long-Tailed Learning. (arXiv:2301.10394v1 [cs.LG])
    Federated Learning (FL) has become a popular distributed learning paradigm that involves multiple clients training a global model collaboratively in a data privacy-preserving manner. However, the data samples usually follow a long-tailed distribution in the real world, and FL on the decentralized and long-tailed data yields a poorly-behaved global model severely biased to the head classes with the majority of the training samples. To alleviate this issue, decoupled training has recently been introduced to FL, considering it has achieved promising results in centralized long-tailed learning by re-balancing the biased classifier after the instance-balanced training. However, the current study restricts the capacity of decoupled training in federated long-tailed learning with a sub-optimal classifier re-trained on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. In this work, in order to re-balance the classifier more effectively, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce an extra classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings.  ( 2 min )
    ViDeBERTa: A powerful pre-trained language model for Vietnamese. (arXiv:2301.10439v1 [cs.CL])
    This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.  ( 2 min )
    Capacity Analysis of Vector Symbolic Architectures. (arXiv:2301.10352v1 [cs.LG])
    Hyperdimensional computing (HDC) is a biologically-inspired framework that uses high-dimensional vectors and various vector operations to represent and manipulate symbols. The ensemble of a particular vector space and two vector operations (one addition-like for "bundling" and one outer-product-like for "binding") form what is called a "vector symbolic architecture" (VSA). While VSAs have been employed in numerous applications and studied empirically, many theoretical questions about VSAs remain open. We provide theoretical analyses for the *representation capacities* of three popular VSAs: MAP-I, MAP-B, and Binary Sparse. Representation capacity here refers to upper bounds on the dimensions of the VSA vectors required to perform certain symbolic tasks (such as testing for set membership $i \in S$ and estimating set intersection sizes $|S \cap T|$) to a given degree of accuracy. We also describe a relationship between the MAP-I VSA to Hopfield networks, which are simple models of associative memory, and analyze the ability of Hopfield networks to perform some of the same tasks that are typically asked of VSAs. Our analysis of MAP-I casts the VSA vectors as the outputs of *sketching* (dimensionality reduction) algorithms such as the Johnson-Lindenstrauss transform; this provides a clean, simple framework for obtaining bounds on MAP-I's representation capacity. We also provide, to our knowledge, the first analysis of testing set membership in a bundle of general pairwise bindings from MAP-I. Binary sparse VSAs are well-known to be related to Bloom filters; we give analyses of set intersection for Bloom and Counting Bloom filters. Our analysis of MAP-B and Binary Sparse bundling include new applications of several concentration inequalities.  ( 2 min )
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.  ( 2 min )
    HealthEdge: A Machine Learning-Based Smart Healthcare Framework for Prediction of Type 2 Diabetes in an Integrated IoT, Edge, and Cloud Computing System. (arXiv:2301.10450v1 [cs.LG])
    Diabetes Mellitus has no permanent cure to date and is one of the leading causes of death globally. The alarming increase in diabetes calls for the need to take precautionary measures to avoid/predict the occurrence of diabetes. This paper proposes HealthEdge, a machine learning-based smart healthcare framework for type 2 diabetes prediction in an integrated IoT-edge-cloud computing system. Numerical experiments and comparative analysis were carried out between the two most used machine learning algorithms in the literature, Random Forest (RF) and Logistic Regression (LR), using two real-life diabetes datasets. The results show that RF predicts diabetes with 6% more accuracy on average compared to LR.  ( 2 min )
    AutoCost: Evolving Intrinsic Cost for Zero-violation Reinforcement Learning. (arXiv:2301.10339v1 [cs.LG])
    Safety is a critical hurdle that limits the application of deep reinforcement learning (RL) to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision processes. However, such constrained RL methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safe RL benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.  ( 2 min )
    A Data-Centric Approach for Improving Adversarial Training Through the Lens of Out-of-Distribution Detection. (arXiv:2301.10454v1 [cs.LG])
    Current machine learning models achieve super-human performance in many real-world applications. Still, they are susceptible against imperceptible adversarial perturbations. The most effective solution for this problem is adversarial training that trains the model with adversarially perturbed samples instead of original ones. Various methods have been developed over recent years to improve adversarial training such as data augmentation or modifying training attacks. In this work, we examine the same problem from a new data-centric perspective. For this purpose, we first demonstrate that the existing model-based methods can be equivalent to applying smaller perturbation or optimization weights to the hard training examples. By using this finding, we propose detecting and removing these hard samples directly from the training procedure rather than applying complicated algorithms to mitigate their effects. For detection, we use maximum softmax probability as an effective method in out-of-distribution detection since we can consider the hard samples as the out-of-distribution samples for the whole data distribution. Our results on SVHN and CIFAR-10 datasets show the effectiveness of this method in improving the adversarial training without adding too much computational cost.  ( 2 min )
    Near-Optimal No-Regret Learning in General Games. (arXiv:2108.06924v2 [cs.LG] UPDATED)
    We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\rm poly}(\log T)$. Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias (Syrgkanis et al., 2015), and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games (Chen & Pen, 2020). A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\tilde{O}\left(\frac 1T\right)$.
    A Provable Splitting Approach for Symmetric Nonnegative Matrix Factorization. (arXiv:2301.10499v1 [cs.LG])
    The symmetric Nonnegative Matrix Factorization (NMF), a special but important class of the general NMF, has found numerous applications in data analysis such as various clustering tasks. Unfortunately, designing fast algorithms for the symmetric NMF is not as easy as for its nonsymmetric counterpart, since the latter admits the splitting property that allows state-of-the-art alternating-type algorithms. To overcome this issue, we first split the decision variable and transform the symmetric NMF to a penalized nonsymmetric one, paving the way for designing efficient alternating-type algorithms. We then show that solving the penalized nonsymmetric reformulation returns a solution to the original symmetric NMF. Moreover, we design a family of alternating-type algorithms and show that they all admit strong convergence guarantee: the generated sequence of iterates is convergent and converges at least sublinearly to a critical point of the original symmetric NMF. Finally, we conduct experiments on both synthetic data and real image clustering to support our theoretical results and demonstrate the performance of the alternating-type algorithms.  ( 2 min )
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System. (arXiv:2301.10577v1 [cs.CL])
    In this work, we present ARDIAS, a web-based application that aims to provide researchers with a full suite of discovery and collaboration tools. ARDIAS currently allows searching for authors and articles by name and gaining insights into the research topics of a particular researcher. With the aid of AI-based tools, ARDIAS aims to recommend potential collaborators and topics to researchers. In the near future, we aim to add tools that allow researchers to communicate with each other and start new projects.
    Multimodal Analogical Reasoning over Knowledge Graphs. (arXiv:2210.00312v3 [cs.CL] UPDATED)
    Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance. Code and datasets are available in https://github.com/zjunlp/MKG_Analogy.
    OCD: Learning to Overfit with Conditional Diffusion Models. (arXiv:2210.00471v4 [cs.LG] UPDATED)
    We present a dynamic model in which the weights are conditioned on an input sample x and are learned to match those that would be obtained by finetuning a base model on x and its label y. This mapping between an input sample and network weights is approximated by a denoising diffusion model. The diffusion model we employ focuses on modifying a single layer of the base model and is conditioned on the input, activations, and output of this layer. Since the diffusion model is stochastic in nature, multiple initializations generate different networks, forming an ensemble, which leads to further improvements. Our experiments demonstrate the wide applicability of the method for image classification, 3D reconstruction, tabular data, speech separation, and natural language processing. Our code is available at https://github.com/ShaharLutatiPersonal/OCD
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    A blob method for inhomogeneous diffusion with applications to multi-agent control and sampling. (arXiv:2202.12927v3 [math.AP] UPDATED)
    As a counterpoint to classical stochastic particle methods for linear diffusion equations, we develop a deterministic particle method for the weighted porous medium equation (WPME) and prove its convergence on bounded time intervals. This generalizes related work on blob methods for unweighted porous medium equations. From a numerical analysis perspective, our method has several advantages: it is meshfree, preserves the gradient flow structure of the underlying PDE, converges in arbitrary dimension, and captures the correct asymptotic behavior in simulations. That our method succeeds in capturing the long time behavior of WPME is significant from the perspective of related problems in quantization. Just as the Fokker-Planck equation provides a way to quantize a probability measure $\bar{\rho}$ by evolving an empirical measure according to stochastic Langevin dynamics so that the empirical measure flows toward $\bar{\rho}$, our particle method provides a way to quantize $\bar{\rho}$ according to deterministic particle dynamics approximating WMPE. In this way, our method has natural applications to multi-agent coverage algorithms and sampling probability measures. A specific case of our method corresponds exactly to confined mean-field dynamics of training a two-layer neural network for a radial basis function activation function. From this perspective, our convergence result shows that, in the overparametrized regime and as the variance of the radial basis functions goes to zero, the continuum limit is given by WPME. This generalizes previous results, which considered the case of a uniform data distribution, to the more general inhomogeneous setting. As a consequence of our convergence result, we identify conditions on the target function and data distribution for which convexity of the energy landscape emerges in the continuum limit.
    Learning to Rank Normalized Entropy Curves with Differentiable Window Transformation. (arXiv:2301.10443v1 [cs.LG])
    Recent automated machine learning systems often use learning curves ranking models to inform decisions about when to stop unpromising trials and identify better model configurations. In this paper, we present a novel learning curve ranking model specifically tailored for ranking normalized entropy (NE) learning curves, which are commonly used in online advertising and recommendation systems. Our proposed model, self-Adaptive Curve Transformation augmented Relative curve Ranking (ACTR2), features an adaptive curve transformation layer that transforms raw lifetime NE curves into composite window NE curves with the window sizes adaptively optimized based on both the position on the learning curve and the curve's dynamics. We also introduce a novel differentiable indexing method for the proposed adaptive curve transformation, which allows gradients with respect to the discrete indices to flow freely through the curve transformation layer, enabling the learned window sizes to be updated flexibly during training. Additionally, we propose a pairwise curve ranking architecture that directly models the difference between the two learning curves and is better at capturing subtle changes in relative performance that may not be evident when modeling each curve individually as the existing approaches did. Our extensive experiments on a real-world NE curve dataset demonstrate the effectiveness of each key component of ACTR2 and its improved performance over the state-of-the-art.
    MLPGradientFlow: going with the flow of multilayer perceptrons (and finding minima fast and accurately). (arXiv:2301.10638v1 [cs.LG])
    MLPGradientFlow is a software package to solve numerically the gradient flow differential equation $\dot \theta = -\nabla \mathcal L(\theta; \mathcal D)$, where $\theta$ are the parameters of a multi-layer perceptron, $\mathcal D$ is some data set, and $\nabla \mathcal L$ is the gradient of a loss function. We show numerically that adaptive first- or higher-order integration methods based on Runge-Kutta schemes have better accuracy and convergence speed than gradient descent with the Adam optimizer. However, we find Newton's method and approximations like BFGS preferable to find fixed points (local and global minima of $\mathcal L$) efficiently and accurately. For small networks and data sets, gradients are usually computed faster than in pytorch and Hessian are computed at least $5\times$ faster. Additionally, the package features an integrator for a teacher-student setup with bias-free, two-layer networks trained with standard Gaussian input in the limit of infinite data. The code is accessible at https://github.com/jbrea/MLPGradientFlow.jl.  ( 2 min )
    Understanding and Improving Deep Graph Neural Networks: A Probabilistic Graphical Model Perspective. (arXiv:2301.10536v1 [cs.LG])
    Recently, graph-based models designed for downstream tasks have significantly advanced research on graph neural networks (GNNs). GNN baselines based on neural message-passing mechanisms such as GCN and GAT perform worse as the network deepens. Therefore, numerous GNN variants have been proposed to tackle this performance degradation problem, including many deep GNNs. However, a unified framework is still lacking to connect these existing models and interpret their effectiveness at a high level. In this work, we focus on deep GNNs and propose a novel view for understanding them. We establish a theoretical framework via inference on a probabilistic graphical model. Given the fixed point equation (FPE) derived from the variational inference on the Markov random fields, the deep GNNs, including JKNet, GCNII, DGCN, and the classical GNNs, such as GCN, GAT, and APPNP, can be regarded as different approximations of the FPE. Moreover, given this framework, more accurate approximations of FPE are brought, guiding us to design a more powerful GNN: coupling graph neural network (CoGNet). Extensive experiments are carried out on citation networks and natural language processing downstream tasks. The results demonstrate that the CoGNet outperforms the SOTA models.  ( 2 min )
    DEJA VU: Continual Model Generalization For Unseen Domains. (arXiv:2301.10418v1 [cs.LG])
    In real-world applications, deep learning models often run in non-stationary environments where the target data distribution continually shifts over time. There have been numerous domain adaptation (DA) methods in both online and offline modes to improve cross-domain adaptation ability. However, these DA methods typically only provide good performance after a long period of adaptation, and perform poorly on new domains before and during adaptation - in what we call the "Unfamiliar Period", especially when domain shifts happen suddenly and significantly. On the other hand, domain generalization (DG) methods have been proposed to improve the model generalization ability on unadapted domains. However, existing DG works are ineffective for continually changing domains due to severe catastrophic forgetting of learned knowledge. To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains. RaTP includes a training-free data augmentation module to prepare data for TDG, a novel pseudo-labeling mechanism to provide reliable supervision for TDA, and a prototype contrastive alignment algorithm to align different domains for achieving TDG, TDA and FA. Extensive experiments on Digits, PACS, and DomainNet demonstrate that RaTP significantly outperforms state-of-the-art works from Continual DA, Source-Free DA, Test-Time/Online DA, Single DG, Multiple DG and Unified DA&DG in TDG, and achieves comparable TDA and FA capabilities.  ( 2 min )
    Off-Policy Evaluation for Action-Dependent Non-Stationary Environments. (arXiv:2301.10330v1 [cs.LG])
    Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (passive non-stationarity), changes induced by interactions with the system itself (active non-stationarity), or both (hybrid non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a higher-order stationarity assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity.  ( 2 min )
    Exact Fractional Inference via Re-Parametrization & Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms. (arXiv:2301.10369v1 [cs.LG])
    Inference efforts -- required to compute partition function, $Z$, of an Ising model over a graph of $N$ ``spins" -- are most likely exponential in $N$. Efficient variational methods, such as Belief Propagation (BP) and Tree Re-Weighted (TRW) algorithms, compute $Z$ approximately minimizing respective (BP- or TRW-) free energy. We generalize the variational scheme building a $\lambda$-fractional-homotopy, $Z^{(\lambda)}$, where $\lambda=0$ and $\lambda=1$ correspond to TRW- and BP-approximations, respectively, and $Z^{(\lambda)}$ decreases with $\lambda$ monotonically. Moreover, this fractional scheme guarantees that in the attractive (ferromagnetic) case $Z^{(TRW)}\geq Z^{(\lambda)}\geq Z^{(BP)}$, and there exists a unique (``exact") $\lambda_*$ such that, $Z=Z^{(\lambda_*)}$. Generalizing the re-parametrization approach of \cite{wainwright_tree-based_2002} and the loop series approach of \cite{chertkov_loop_2006}, we show how to express $Z$ as a product, $\forall \lambda:\ Z=Z^{(\lambda)}{\cal Z}^{(\lambda)}$, where the multiplicative correction, ${\cal Z}^{(\lambda)}$, is an expectation over a node-independent probability distribution built from node-wise fractional marginals. Our theoretical analysis is complemented by extensive experiments with models from Ising ensembles over planar and random graphs of medium- and large- sizes. The empirical study yields a number of interesting observations, such as (a) ability to estimate ${\cal Z}^{(\lambda)}$ with $O(N^4)$ fractional samples; (b) suppression of $\lambda_*$ fluctuations with increase in $N$ for instances from a particular random Ising ensemble.
    On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems. (arXiv:2301.10587v1 [cs.SD])
    The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these practices on resource utilization and more importantly network performance is not well documented. This paper is an empirical study of the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
    Audience-Centric Natural Language Generation via Style Infusion. (arXiv:2301.10283v1 [cs.CL])
    Adopting contextually appropriate, audience-tailored linguistic styles is critical to the success of user-centric language generation systems (e.g., chatbots, computer-aided writing, dialog systems). While existing approaches demonstrate textual style transfer with large volumes of parallel or non-parallel data, we argue that grounding style on audience-independent external factors is innately limiting for two reasons. First, it is difficult to collect large volumes of audience-specific stylistic data. Second, some stylistic objectives (e.g., persuasiveness, memorability, empathy) are hard to define without audience feedback. In this paper, we propose the novel task of style infusion - infusing the stylistic preferences of audiences in pretrained language generation models. Since humans are better at pairwise comparisons than direct scoring - i.e., is Sample-A more persuasive/polite/empathic than Sample-B - we leverage limited pairwise human judgments to bootstrap a style analysis model and augment our seed set of judgments. We then infuse the learned textual style in a GPT-2 based text generator while balancing fluency and style adoption. With quantitative and qualitative assessments, we show that our infusion approach can generate compelling stylized examples with generic text prompts. The code and data are accessible at https://github.com/CrowdDynamicsLab/StyleInfusion.  ( 2 min )
    NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification. (arXiv:2108.06158v4 [cs.LG] UPDATED)
    Gene-disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning setting in which only a subset of instances are labeled as positive while the rest of the data set is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on ten different disease data sets using three machine learning algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms.
    Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding. (arXiv:2111.15363v2 [cs.CV] UPDATED)
    Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves \sota performance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts).
    Lightweight Neural Architecture Search for Temporal Convolutional Networks at the Edge. (arXiv:2301.10281v1 [cs.LG])
    Neural Architecture Search (NAS) is quickly becoming the go-to approach to optimize the structure of Deep Learning (DL) models for complex tasks such as Image Classification or Object Detection. However, many other relevant applications of DL, especially at the edge, are based on time-series processing and require models with unique features, for which NAS is less explored. This work focuses in particular on Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged as a promising alternative to more complex recurrent architectures. We propose the first NAS tool that explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive-field and number of features in each layer. The proposed approach searches for networks that offer good trade-offs between accuracy and number of parameters/operations, enabling an efficient deployment on embedded platforms. We test the proposed NAS on four real-world, edge-relevant tasks, involving audio and bio-signals. Results show that, starting from a single seed network, our method is capable of obtaining a rich collection of Pareto optimal architectures, among which we obtain models with the same accuracy as the seed, and 15.9-152x fewer parameters. Compared to three state-of-the-art NAS tools, ProxylessNAS, MorphNet and FBNetV2, our method explores a larger search space for TCNs (up to 10^12x) and obtains superior solutions, while requiring low GPU memory and search time. We deploy our NAS outputs on two distinct edge devices, the multicore GreenWaves Technology GAP8 IoT processor and the single-core STMicroelectronics STM32H7 microcontroller. With respect to the state-of-the-art hand-tuned models, we reduce latency and energy of up to 5.5x and 3.8x on the two targets respectively, without any accuracy loss.  ( 3 min )
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.  ( 2 min )
    Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute. (arXiv:2301.10448v1 [cs.CL])
    Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size.  ( 2 min )
    One Model for All Domains: Collaborative Domain-Prefix Tuning for Cross-Domain NER. (arXiv:2301.10410v1 [cs.CL])
    Cross-domain NER is a challenging task to address the low-resource problem in practical scenarios. Previous typical solutions mainly obtain a NER model by pre-trained language models (PLMs) with data from a rich-resource domain and adapt it to the target domain. Owing to the mismatch issue among entity types in different domains, previous approaches normally tune all parameters of PLMs, ending up with an entirely new NER model for each domain. Moreover, current models only focus on leveraging knowledge in one general source domain while failing to successfully transfer knowledge from multiple sources to the target. To address these issues, we introduce Collaborative Domain-Prefix Tuning for cross-domain NER (CP-NER) based on text-to-text generative PLMs. Specifically, we present text-to-text generation grounding domain-related instructors to transfer knowledge to new domain NER tasks without structural modifications. We utilize frozen PLMs and conduct collaborative domain-prefix tuning to stimulate the potential of PLMs to handle NER tasks across various domains. Experimental results on the Cross-NER benchmark show that the proposed approach has flexible transfer ability and performs better on both one-source and multiple-source cross-domain NER tasks. Codes will be available in https://github.com/zjunlp/DeepKE/tree/main/example/ner/cross.  ( 2 min )
    Editing Language Model-based Knowledge Graph Embeddings. (arXiv:2301.10405v1 [cs.CL])
    Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, which are challenging to modify without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. The proposed task aims to enable data-efficient and fast updates to KG embeddings without damaging the performance of the rest. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hyper network to edit/add facts. Comprehensive experimental results demonstrate that KGEditor can perform better when updating specific facts while not affecting the rest with low training resources. Code and datasets will be available in https://github.com/zjunlp/PromptKG/tree/main/deltaKG.  ( 2 min )
    Designing Data: Proactive Data Collection and Iteration for Machine Learning. (arXiv:2301.10319v1 [cs.HC])
    Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track and manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model through Out-of-Distribution (OOD) methods. We instantiate designing data through our own data collection and applied ML case study. We find models trained on "designed" datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.  ( 2 min )
    Weakly Supervised Headline Dependency Parsing. (arXiv:2301.10371v1 [cs.CL])
    English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to evaluate existing state-of-the-art dependency parsers on news headlines. To improve English news headline parsing accuracies, we develop a projection method to bootstrap silver training data from unlabeled news headline-article lead sentence pairs. Models trained on silver headline parses demonstrate significant improvements in performance over models trained solely on gold-annotated long-form texts. Ultimately, we find that, although projected silver training data improves parser performance across different news outlets, the improvement is moderated by constructions idiosyncratic to outlet.  ( 2 min )
    Exact and rapid linear clustering of networks with dynamic programming. (arXiv:2301.10403v1 [cs.SI])
    We study the problem of clustering networks whose nodes have imputed or physical positions in a single dimension, such as prestige hierarchies or the similarity dimension of hyperbolic embeddings. Existing algorithms, such as the critical gap method and other greedy strategies, only offer approximate solutions. Here, we introduce a dynamic programming approach that returns provably optimal solutions in polynomial time -- O(n^2) steps -- for a broad class of clustering objectives. We demonstrate the algorithm through applications to synthetic and empirical networks, and show that it outperforms existing heuristics by a significant margin, with a similar execution time.  ( 2 min )
    Predicting mental health using social media: A roadmap for future development. (arXiv:2301.10453v1 [cs.IR])
    Mental disorders such as depression and suicidal ideation are hazardous, affecting more than 300 million people over the world. However, on social media, mental disorder symptoms can be observed, and automated approaches are increasingly capable of detecting them. The considerable number of social media users and the tremendous quantity of user-generated data on social platforms provide a unique opportunity for researchers to distinguish patterns that correlate with mental status. This research offers a roadmap for analysis, where mental state detection can be based on machine learning techniques. We describe the common approaches for predicting and identifying the disorder using user-generated content. This research is organized according to the data collection, feature extraction, and prediction algorithms. Furthermore, we review several recent studies conducted to explore different features of candidate profiles and their analytical methods. Following, we debate various aspects of the development of experimental auto-detection frameworks for identifying users who suffer from disorders, and we conclude with a discussion of future trends. The introduced methods can help complement screening procedures, identify at-risk people through social media monitoring on a large scale, and make disorders easier to treat in the future.  ( 2 min )
    Parameterizing the cost function of Dynamic Time Warping with application to time series classification. (arXiv:2301.10350v1 [cs.LG])
    Dynamic Time Warping (DTW) is a popular time series distance measure that aligns the points in two series with one another. These alignments support warping of the time dimension to allow for processes that unfold at differing rates. The distance is the minimum sum of costs of the resulting alignments over any allowable warping of the time dimension. The cost of an alignment of two points is a function of the difference in the values of those points. The original cost function was the absolute value of this difference. Other cost functions have been proposed. A popular alternative is the square of the difference. However, to our knowledge, this is the first investigation of both the relative impacts of using different cost functions and the potential to tune cost functions to different tasks. We do so in this paper by using a tunable cost function {\lambda}{\gamma} with parameter {\gamma}. We show that higher values of {\gamma} place greater weight on larger pairwise differences, while lower values place greater weight on smaller pairwise differences. We demonstrate that training {\gamma} significantly improves the accuracy of both the DTW nearest neighbor and Proximity Forest classifiers.  ( 2 min )
    Data Consistent Deep Rigid MRI Motion Correction. (arXiv:2301.10365v1 [eess.IV])
    Motion artifacts are a pervasive problem in MRI, leading to misdiagnosis or mischaracterization in population-level imaging studies. Current retrospective rigid intra-slice motion correction techniques jointly optimize estimates of the image and the motion parameters. In this paper, we use a deep network to reduce the joint image-motion parameter search to a search over rigid motion parameters alone. Our network produces a reconstruction as a function of two inputs: corrupted k-space data and motion parameters. We train the network using simulated, motion-corrupted k-space data generated from known motion parameters. At test-time, we estimate unknown motion parameters by minimizing a data consistency loss between the motion parameters, the network-based image reconstruction given those parameters, and the acquired measurements. Intra-slice motion correction experiments on simulated and realistic 2D fast spin echo brain MRI achieve high reconstruction fidelity while retaining the benefits of explicit data consistency-based optimization. Our code is publicly available at https://www.github.com/nalinimsingh/neuroMoCo.  ( 2 min )
    Multilingual Multiaccented Multispeaker TTS with RADTTS. (arXiv:2301.10335v1 [cs.SD])
    We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.  ( 2 min )
    Generating Multidimensional Clusters With Support Lines. (arXiv:2301.10327v1 [cs.LG])
    Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present \textit{Clugen}, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. \textit{Clugen} is open source, 100\% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.  ( 2 min )
    Score Matching via Differentiable Physics. (arXiv:2301.10250v1 [cs.LG])
    Diffusion models based on stochastic differential equations (SDEs) gradually perturb a data distribution $p(\mathbf{x})$ over time by adding noise to it. A neural network is trained to approximate the score $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ at time $t$, which can be used to reverse the corruption process. In this paper, we focus on learning the score field that is associated with the time evolution according to a physics operator in the presence of natural non-deterministic physical processes like diffusion. A decisive difference to previous methods is that the SDE underlying our approach transforms the state of a physical system to another state at a later time. For that purpose, we replace the drift of the underlying SDE formulation with a differentiable simulator or a neural network approximation of the physics. We propose different training strategies based on the so-called probability flow ODE to fit a training set of simulation trajectories and discuss their relation to the score matching objective. For inference, we sample plausible trajectories that evolve towards a given end state using the reverse-time SDE and demonstrate the competitiveness of our approach for different challenging inverse problems.  ( 2 min )
    Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction. (arXiv:2301.10309v1 [cs.LG])
    Crosslingual conditional generation (e.g., machine translation) has long enjoyed the benefits of scaling. Nonetheless, there are still issues that scale alone may not overcome. A source query in one language, for instance, may yield several translation options in another language without any extra context. Only one translation could be acceptable however, depending on the translator's preferences and goals. Choosing the incorrect option might significantly affect translation usefulness and quality. We propose a novel method interactive-chain prompting -- a series of question, answering and generation intermediate steps between a Translator model and a User model -- that reduces translations into a list of subproblems addressing ambiguities and then resolving such subproblems before producing the final text to be translated. To check ambiguity resolution capabilities and evaluate translation quality, we create a dataset exhibiting different linguistic phenomena which leads to ambiguities at inference for four languages. To encourage further exploration in this direction, we release all datasets. We note that interactive-chain prompting, using eight interactions as exemplars, consistently surpasses prompt-based methods with direct access to background information to resolve ambiguities.  ( 2 min )
    Towards Robust Metrics for Concept Representation Evaluation. (arXiv:2301.10367v1 [cs.LG])
    Recent work on interpretability has focused on concept-based explanations, where deep learning models are explained in terms of high-level units of information, referred to as concepts. Concept learning models, however, have been shown to be prone to encoding impurities in their representations, failing to fully capture meaningful features of their inputs. While concept learning lacks metrics to measure such phenomena, the field of disentanglement learning has explored the related notion of underlying factors of variation in the data, with plenty of metrics to measure the purity of such factors. In this paper, we show that such metrics are not appropriate for concept learning and propose novel metrics for evaluating the purity of concept representations in both approaches. We show the advantage of these metrics over existing ones and demonstrate their utility in evaluating the robustness of concept representations and interventions performed on them. In addition, we show their utility for benchmarking state-of-the-art methods from both families and find that, contrary to common assumptions, supervision alone may not be sufficient for pure concept representations.  ( 2 min )
    Language Model Detoxification in Dialogue with Contextualized Stance Control. (arXiv:2301.10368v1 [cs.CL])
    To reduce the toxic degeneration in a pretrained Language Model (LM), previous work on Language Model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. As a result, a type of implicit offensive language where the generations support the offensive language in the context is ignored. Different from the LM controlling tasks in previous work, where the desired attributes are fixed for generation, the desired stance of the generation depends on the offensiveness of the context. Therefore, we propose a novel control method to do context-dependent detoxification with the stance taken into consideration. We introduce meta prefixes to learn the contextualized stance control strategy and to generate the stance control prefix according to the input context. The generated stance prefix is then combined with the toxicity control prefix to guide the response generation. Experimental results show that our proposed method can effectively learn the context-dependent stance control strategies while keeping a low self-toxicity of the underlying LM.  ( 2 min )
    ClimaX: A foundation model for weather and climate. (arXiv:2301.10343v1 [cs.LG])
    Most state-of-the-art approaches for weather and climate modeling are based on physics-informed numerical models of the atmosphere. These approaches aim to model the non-linear dynamics and complex interactions between multiple variables, which are challenging to approximate. Additionally, many such numerical models are computationally intensive, especially when modeling the atmospheric phenomenon at a fine-grained spatial and temporal resolution. Recent data-driven approaches based on machine learning instead aim to directly solve a downstream forecasting or projection task by learning a data-driven functional mapping using deep neural networks. However, these networks are trained using curated and homogeneous climate datasets for specific spatiotemporal tasks, and thus lack the generality of numerical models. We develop and demonstrate ClimaX, a flexible and generalizable deep learning model for weather and climate science that can be trained using heterogeneous datasets spanning different variables, spatio-temporal coverage, and physical groundings. ClimaX extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility. ClimaX is pre-trained with a self-supervised learning objective on climate datasets derived from CMIP6. The pre-trained ClimaX can then be fine-tuned to address a breadth of climate and weather tasks, including those that involve atmospheric variables and spatio-temporal scales unseen during pretraining. Compared to existing data-driven baselines, we show that this generality in ClimaX results in superior performance on benchmarks for weather forecasting and climate projections, even when pretrained at lower resolutions and compute budgets.  ( 2 min )
    Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems. (arXiv:2301.10391v1 [cs.LG])
    We present a data-driven, space-time continuous framework to learn surrogatemodels for complex physical systems described by advection-dominated partialdifferential equations. Those systems have slow-decaying Kolmogorovn-widththat hinders standard methods, including reduced order modeling, from producinghigh-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compactrepresentation network. We leverage the expressive power of the network and aspecially designed consistency-inducing regularization to obtain latent trajectoriesthat are both low-dimensional and smooth. These properties render our surrogatemodels highly efficient at inference time. We show the efficacy of our frameworkby learning models that generate accurate multi-step rollout predictions at muchfaster inference speed compared to competitors, for several challenging examples.  ( 2 min )
    Learned Interferometric Imaging for the SPIDER Instrument. (arXiv:2301.10260v1 [astro-ph.IM])
    The Segmented Planar Imaging Detector for Electro-Optical Reconnaissance (SPIDER) is an optical interferometric imaging device that aims to offer an alternative to the large space telescope designs of today with reduced size, weight and power consumption. This is achieved through interferometric imaging. State-of-the-art methods for reconstructing images from interferometric measurements adopt proximal optimization techniques, which are computationally expensive and require handcrafted priors. In this work we present two data-driven approaches for reconstructing images from measurements made by the SPIDER instrument. These approaches use deep learning to learn prior information from training data, increasing the reconstruction quality, and significantly reducing the computation time required to recover images by orders of magnitude. Reconstruction time is reduced to ${\sim} 10$ milliseconds, opening up the possibility of real-time imaging with SPIDER for the first time. Furthermore, we show that these methods can also be applied in domains where training data is scarce, such as astronomical imaging, by leveraging transfer learning from domains where plenty of training data are available.  ( 2 min )
  • Open

    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    Posterior Covariance Information Criterion for Weighted Inference. (arXiv:2106.13694v4 [stat.ME] UPDATED)
    For predictive evaluation based on quasi-posterior distributions, we develop a new information criterion, the posterior covariance information criterion (PCIC. PCIC generalises the widely applicable information criterion WAIC so as to effectively handle predictive scenarios where likelihoods for the estimation and the evaluation of the model may be different. A typical example of such scenarios is the weighted likelihood inference, including prediction under covariate shift and counterfactual prediction. The proposed criterion utilises a posterior covariance form and is computed by using only one Markov chain Monte Carlo run. Through numerical examples, we demonstrate how PCIC can apply in practice. Further, we show that PCIC is asymptotically unbiased to the quasi-Bayesian generalization error under mild conditions in weighted inference with both regular and singular statistical models.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.
    Semiparametric discrete data regression with Monte Carlo inference and prediction. (arXiv:2110.12316v5 [stat.ME] UPDATED)
    Discrete data are abundant and often arise as counts or rounded data. These data commonly exhibit complex distributional features such as zero-inflation, over- or under-dispersion, boundedness, and heaping, which render many parametric models inadequate. Yet even for parametric regression models, approximations such as MCMC typically are needed for posterior inference. This paper introduces a Bayesian modeling and algorithmic framework that enables semiparametric regression analysis for discrete data with Monte Carlo (not MCMC) sampling. The proposed approach pairs a nonparametric marginal model with a latent linear regression model to encourage both flexibility and interpretability, and delivers posterior consistency even under model misspecification. For a parametric or large-sample approximation of this model, we identify a class of conjugate priors with (pseudo) closed-form posteriors. All posterior and predictive distributions are available analytically or via Monte Carlo sampling. These tools are broadly useful for linear regression, nonlinear models via basis expansions, and variable selection with discrete data. Simulation studies demonstrate significant advantages in computing, prediction, estimation, and selection relative to existing alternatives. This novel approach is applied to self-reported mental health data that exhibit zero-inflation, overdispersion, boundedness, and heaping.
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.

  • Open

    Insane face rendering A.I technology..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    AI Music - Eminem - 'Slim shady is alive'
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    Finding the right AI for a specific task
    Hi all, We're developing an internal application that groups customers together based on attributes that adhere to a ruleset on how they should be grouped. It does this fine. However, some nuance is then applied via human effort to modify groupings based on some customer notes (a text string) that sometimes dictate that two customers need to be in different groups for x reason, even if the original grouping adheres to the ruleset. The application itself has a UI that sorts customers into columns, which are manipulated by staff via dragging and dropping a customer/customers between one column and another. I had a thought to employ an AI model that compares the original generated grouping config that our code produces against the modified groupings that staff adjust based on that nuance. The idea is that we could analyze the why of a modification and use that insight to generate better default groupings. Is there a model out there that would be ideally suited for this kind of learning? Keen to dive into it further on my own but any recommendations as a starting point would be great. submitted by /u/premiumnougat [link] [comments]  ( 41 min )
    Create Your Chat GPT-3 Web App with Streamlit in Python
    submitted by /u/pasticciociccio [link] [comments]  ( 40 min )
    What do employers and job seekers need to know about artificial intelligence's role in hiring?
    University of Florida - Warrington College of Business's Mo Wang offers advice for the future of work. Full Story: https://explore.research.ufl.edu/the-future-of-work.html#ai-hiring submitted by /u/ufexplore [link] [comments]  ( 40 min )
    BuzzFeed to Use ChatGPT Creator OpenAI to Help Create Quizzes and Other Content
    submitted by /u/trueslicky [link] [comments]  ( 40 min )
    Member of Congress Reads AI-Generated Speech on House Floor
    submitted by /u/dahmedahe [link] [comments]  ( 6 min )
    Synthesizing the Businessmen-Smile
    submitted by /u/walt74 [link] [comments]  ( 45 min )
    AI Dream 150 - ENTERING DREAMWORLD Part2 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    NIST Risk Management Framework Aims to Improve Trustworthiness of Artificial Intelligence
    submitted by /u/Harley109 [link] [comments]  ( 40 min )
    📌[Searchcolab] It's impressive to see how far generative AI has come in the past 5 years. What should we expect the trajectory of this field to be in the next 5 years? Btw the pictures attached are also generated using AI.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    "Father" from Equilibrium movie
    Imagine we could create an AGI/ASI that will protect our values. Something like "Father" from Equilibrium (but that's a dystopian version). Let's call it modern God. What values should it protect? submitted by /u/chuguruk [link] [comments]  ( 40 min )
    InstructPix2Pix lets you edit images using only text prompts
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    what are your favorite AI subreddit?
    submitted by /u/Chaserivx [link] [comments]  ( 40 min )
    AI "Upscale" With Only 1000 Training Examples(All examples were dogs)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 43 min )
    Proud Pollution A movie script Written by AI
    Movie script: Proud Pollution The website used- (https://www.plot-generator.org.uk/) Be free to share your thoughts on it! ​ Proud Pollution A Screenplay by Mr. Pseudonym EXT. VASQUEZ ROCKS, CALIFORNIA - AFTERNOON Misunderstood piolet FLAMOUS JACK THORNTON is arguing with mean scout MISS HELEN FISH. JACK tries to hug HELEN but she shakes him off. JACK Please, Helen, don't leave me. HELEN I'm sorry Jack, but I'm looking for somebody a bit braver. Somebody who faces his fears head on, inhead-onstead of running away. JACK I am such a person! HELEN frowns. HELEN I'm sorry, Jack. I just don't feel excited by this relationship anymore. HELEN leaves. JACK sits down, looking defeated. Moments later, noble navigator MASTER CUTHBERT MACDONALD barges in looking flustered. JACK Go…  ( 47 min )
    Meta's chief AI scientist says "ChatGPT is not innovative".
    What happened? So, there has been a lot of excitement around OpenAI's ChatGPT which generates natural-language responses to human prompts. But what if... it's not as amazing as we all think it is? Yann LeCun, Meta's chief AI scientist, argues that the program is not innovative. He also states that similar technology has been developed by many companies and research labs, and that ChatGPT is composed of multiple pieces of technology developed over many years by many parties. (sounds salty to me... and I like my cookies sweet!) But maybe Yann has a point! What's happening now? ChatGPT is perceived by many as a unique and innovative program. People are using it everyday to make their lives easier. So, no matter what, ChatGPT is still awesome. What's happening next? It's unclear what will happen next in terms of the development and perception of ChatGPT. However, it can be expected that as AI technology continues to evolve, there will be further advancements on what is possible with this tech. It's likely that ChatGPT will face many spin-offs and competitors this year. If you enjoyed this and want 500+ AI tools, I write a daily AI newsletter: https://chriscookies.beehiiv.com/p/metas-chief-ai-scientist-says-chatgpt-not-innovative-7581 submitted by /u/ZaKodiak [link] [comments]  ( 41 min )
    Looking for some ideas to research about making money using AI!
    Hey yall, I'm looking for ideas of how to make money using language models like ChatGPT. I want to go in depth and research a bit as well as begin designing tutorials based on my experiences of what's the best way to make money. I am open to any suggestions or things that have worked for you all! Thanks! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    What is Atomic AI? - Is AI going to have a Drug development breakthrough soon?
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    AI Video to Fill Missing Frames/Smooth Animation?
    Hey all, Was wondering if you know some kind of AI tool that exists to fill missing frames and therefore smooth animation in animated videos? Trying to get something cleaned up and really nailed in. Thanks! submitted by /u/miseryleech [link] [comments]  ( 40 min )
    ChatGPT: OpenAI’s Last Resort Turns Out To Be A Winner
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    41 AI Written Articles Out Of 77 On CNET Have Plagiarism And Errors
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 6 min )
    Chrome Extension that uses AI to write emails.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I told an AI to freak out on camera. it was ALL made by AI.
    submitted by /u/25dopren [link] [comments]  ( 40 min )
  • Open

    [D] score based vs. Diffusion models
    I know there is a mathematical way to show that the two approaches of score matching models and diffusion models are the same. I wonder, if there in practice/code are the same either? I already tried to find some PyTorch implementations of score based models but didn’t find anything yet - just for diffusion models. submitted by /u/Individual-Cause-616 [link] [comments]  ( 43 min )
    [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation?
    I guess what I'm trying to figure out is, what are the main reasons that DMs are outperforming GANs in text2img generation? Thanks! submitted by /u/TheCockatoo [link] [comments]  ( 46 min )
    [P] A python module to generate optimized prompts & solve different NLP problems using GPT-n based models and return structured python object for easy parsing
    Hi folks, I was working on a personal experimental project related to GPT-3, which I thought of making it open source now. It saves much time while working with LLMs. If you are an industrial researcher or application developer, you probably have worked with GPT-3 apis. A common challenge when utilizing LLMs such as GPT-3 and BLOOM is their tendency to produce uncontrollable & unstructured outputs, making it difficult to use them for various NLP tasks and applications.To address this, we developed Promptify, a library that allows for the use of LLMs to solve NLP problems including Named Entity Recognition, Binary Classification, Multi-Label Classification, and Question-Answering and return a python object for easy parsing to construct additional applications on top of GPT-n based models. Features 🚀 🧙‍♀️ NLP Tasks (NER, Binary Text Classification, Multi-Label Classification etc) in 2 lines of code with no training data required 🔨 Easily add one shot, two shot, or few shot examples to the prompt ✌ Output always provided as a Python object (e.g. list, dictionary) for easy parsing and filtering 💥 Custom examples and samples can be easily added to the prompt 💰 Optimized prompts to reduce OpenAI token costs GITHUB: https://github.com/promptslab/Promptify Examples: https://github.com/promptslab/Promptify/tree/main/examples For quick demo -> Colab I hope it will be helpful in your research. Thanks :) NER example ​ https://preview.redd.it/vnz4mf0i6gea1.png?width=1398&format=png&auto=webp&s=74c70bd9d518423f913c1fb9c68cf2565cf8cffc submitted by /u/aadityaura [link] [comments]  ( 43 min )
    A Watermark for Large Language Models
    submitted by /u/lookinsidemybutthole [link] [comments]  ( 42 min )
    [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
    Dec 2022 paper from Microsoft research: https://arxiv.org/abs/2212.10559v2 Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. submitted by /u/currentscurrents [link] [comments]  ( 43 min )
    [Discussion] Github like alternative for ML?
    Versioning and collaboration on code for software engineers is a reasonably solved problem through GitHub since the task at hand predominantly involves just maintaining different copies of just simple vanilla code in different folders. On the other hand, ML engineers face the humungous task of maintaining different versions on not just code, but hyper parameters, data, models, data lineage and labels and storing this on GitHub currently does not allow you to track the changes on each variable well. What are the software/open source tools currently used for the same? Is their a space for a new company to be built here? submitted by /u/angkhandelwal749 [link] [comments]  ( 44 min )
    Are there any projects working at an open source version of Constitutional AI? [D]
    I'm looking into projects which augment the RLHF training approach of chatGPT with explicit rules, such as in https://paperswithcode.com/paper/constitutional-ai-harmlessness-from-ai. Ideally there would be both rules and priority levels between the rules, similarly to the Asimov laws of robotics. The Open-Assistant project (https://github.com/LAION-AI/Open-Assistant) captures the spirit, but it is looking to replicate chatGPT at the moment. submitted by /u/lorepieri [link] [comments]  ( 42 min )
    [D] Quantitative measure for smoothness of NLP autoencoder latent space
    I would like to measure the smoothness of an NLP-autoencoder's latent space. The idea is to sample two Gaussian vectors v1 and v2 in the latent space of the AE, and generate N-1 points between them like so: vi = v1 + (v2 - v1) / (N * i) My idea is to then decode these vectors and measure the BLEU score between d(vi) and d(vi+1) for all N-2 comparisons. Is this idea reasonable, do you have a better one? Is there a technique from AEs with images that can be useful here? submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    [D] What are some of your favorite ML research posters?
    And what are your own best practices when creating one (e.g. adding a QR code that links to the GitHub project or paper PDF)? submitted by /u/epistoteles [link] [comments]  ( 42 min )
    [D] Fastest and most accurate model for casing
    What is the state of the art regarding freely available casing models, i.e. DNNs, that try to restore the original casing of a text with uniform (either lowercase or capital letters) casing? I value both speed and accuracy, as I have to process a large corpus of text. submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    Few questions about scalability of chatGPT [D]
    I have two questions about chatGPT. I don't come from a machine learning background. I am just a programmer. So bear with me if they sound a bit dumb. I was checking about chatGPT a bit the last week. I went through their papers and also tried out a fine tuning by myself by creating some fictional world and giving it some examples. The first thing I wondered is what is very special about the model than the large data and parameter set it has, that other competitors can't do. I ask this because I have seen a lot of "google killer" discussions in some places. From what I understood from their papers I thought it is something another company with the computing power and the filtered data can have up and running in few months. I see their advantage in rolling out to the public because with feedbacks from actual users all over the world it can potentially be retrained. The second thing I wondered is its scalability. It feels to me that it is a very big challenge to keep it scalable in the future. Currently getting a long text out of it is kind of painful because it has to continuously generate. I think it is continuously calculating with the huge parameter set it has. I wonder also about new trends, if it needs to be retrained. I also used it for a fine tuning, where I created a fictional world with its own law and rules and the fine tuning took hours in the queue - so is it creating separate parameters for my case? that would be a lot considering how much parameter set they have. submitted by /u/besabestin [link] [comments]  ( 50 min )
    [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II.
    Find the release notes here: https://github.com/nnaisense/evotorch/releases/tag/v0.4.0 A big highlight is how fast these implementations are! I genuinely believe GPU-acceleration is the future of Evolutionary algorithms, and EvoTorch and its integration into the PyTorch ecosystem is a fantastic enabler for this. To demonstrate the raw speed provided by the new release, I compared EvoTorch's CMA-ES implementation to that provided by the popular pycma package on the 80-dimensional Rastrigin problem and tracked the run-time: Performance was measured over 50 runs on the 80-dimensional Rastrigin problem The crazy thing to note is that when we switch to GPU (Tesla V100), we can efficiently run CMA-ES with population sizes going into 100k+! submitted by /u/NaturalGradient [link] [comments]  ( 45 min )
    Machine learning and black box numerical solver[D]
    Anybody know some methods and techniques for integrating a numerical solver with the neural network .. how do you calculate the gradients of the solver when you don’t know the details of such solver- black box solver. submitted by /u/Due-Wall-915 [link] [comments]  ( 43 min )
    [P] Diffusion models best practices
    I'm about to start an experimental project that involves training a denoising diffusion model on the medical data (small dataset). Could you please share useful resources, tips, tricks and heuristics for dealing with diffusion models? submitted by /u/debrises [link] [comments]  ( 42 min )
  • Open

    "Cheaters Hacked an AI Bot—and Beat the 'Rocket League' Elite"
    submitted by /u/gwern [link] [comments]  ( 40 min )
    Insights and learnings
    Hey all, I am part of an incubator and interested in building developer tooling for reinforcement learning. I would love to understand, from the RL community, what some of the biggest pinpoints are in developing and productionising RL agents. Would love to hear about your implementations too, if you are happy to share! submitted by /u/paramkumar1992 [link] [comments]  ( 41 min )
    Are there papers that do an empirical investigation on DRL hyperparameters?
    Could someone please help with this - https://ai.stackexchange.com/questions/38894/are-there-papers-that-do-an-empirical-investigation-on-drl-hyperparameters submitted by /u/Academic-Rent7800 [link] [comments]  ( 41 min )
    DQN application
    I want to train a DQN model in an off-policy fashion, where my behavior policy is an older agent. I have a big memory of a lot of episodes of this agent. Now I want to find a better policy using DQN. Now I am just wondering, in the "normal" DQN case you would use the experience replay buffer and would update behavior and target policy online (behavior not really online but with the time lag introduced after which these parameters are also updated). In my case, I already have all the experience and would like to learn from it. Do you think it makes sense to use the exact same procedure in this context, so sampling one new action, state and immediate reward, follow up action or could it be better here to use the fact that all experience is already stored to exchange the immediate reward + gamma*maxQ(s',a') with some more future information about the rewards (up to the point of Monte Carlo where you take G_t so the discounted cumulative reward seen during the episode from point t onwards)? submitted by /u/PatrickSVM [link] [comments]  ( 42 min )
    "Imitating Human Behaviour with Diffusion Models", Pearce et al 2023 {MS}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning", Wang et al 2022 {Twitter}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Learning with Queried Hints" [on "Online Learning and Bandits with Queried Hints", Bhaskara et al 2022 {G}]
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Why can't my neural network recognize my own digits, but it has 97% accuracy on mnist test samples?
    I've been learning about neural networks for the past few days and I decided to write my own in Python. To keep it simple, I didn't write code for convolution layers and such, only fully connected layers with logistic activation function. I first trained it to do XOR with a 2 -> 2 -> 1 layers layout and it worked. Then I tried to train a 28*28 -> 100 -> 10 network on the MNIST digit dataset to recognize digits. When running it on the test samples, the accuracy was 97%, but when running on my own samples it barely ever manages to get it right. Does anyone have any idea on why this would happen? submitted by /u/FidgetSpinzz [link] [comments]  ( 42 min )
    Why add bias instead of subtracting bias?
    Pretty much the title, why do we add the bias instead of subtracting? Also when i watched 3blue1browns video about neural networks nad he said that you subtract the with the bias, but other sources tell me or explain that you simply add the bias in the dot product instead of subtracting. //Newbie submitted by /u/ArthurLCTTheCool [link] [comments]  ( 41 min )
    Machine Learning Framework with Neural Networks for Java
    Hey guys, a buddy of mine and me created a Framework for Machine Learning in Java. It provides the possibilities to train the neural Networks via backpropagation but we also implemented a Genetic Algorithm which can also be used for the training. The project is intended to be used by people who only learn Java in school and want to try out ML without the need of learning Python or complex Java Libraries. It's designed to be easy to use and to be played around with. Qualifications needed to use: basic Java understanding A brief understanding of what Machine Learning is Here is a step-by-step tutorial how to predict diabetes with this framework: https://easy-ml.gitbook.io/easy-ml-for-java/fundamentals/implement-your-first-ai (This example is using the genetic Algorithm, there is already one example in the source code published using the Backpropagation approach but the tutorial for it is gonna follow in the next few days) Please also look at the GitHub repository and leave some feedback about code and design. (Especially considering the ReadMe) https://github.com/tomLamprecht/Easy-ML-For-Java https://easy-ml.gitbook.io/easy-ml-for-java/ (Doc) Right now, I'm working on adding Convolutional Neural Networks as well. Feel free to also check open issues on our GitHub if you want to contribute! :) Thanks so much, and also especially for those who contributed to our project with pull requests. PS: we earn no cent with this project, and we just do it for the experience. So feedback is basically our payment :D (and ofc stars on GitHub hehe) submitted by /u/Lampard557 [link] [comments]  ( 42 min )
  • Open

    Best Egg achieved three times faster ML model training with Amazon SageMaker Automatic Model Tuning
    This post is co-authored by Tristan Miller from Best Egg. Best Egg is a leading financial confidence platform that provides lending products and resources focused on helping people feel more confident as they manage their everyday finances. Since March 2014, Best Egg has delivered $22 billion in consumer personal loans with strong credit performance, welcomed […]  ( 8 min )
  • Open

    What Are Large Language Models Used For?
    AI applications are summarizing articles, writing stories and engaging in long conversations — and large language models are doing the heavy lifting. A large language model, or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. Large language models Read article >  ( 7 min )
    DLSS 3 Delivers Ultimate Boost in Latest Game Updates on GeForce NOW
    GeForce NOW RTX 4080 SuperPODs are rolling out now, bringing RTX 4080-class performance and features to Ultimate members — including support for NVIDIA Ada Lovelace GPU architecture technologies like NVIDIA DLSS 3.  This GFN Thursday brings updates to some of GeForce NOW’s hottest games that take advantage of these amazing technologies, all from the cloud. Read article >  ( 6 min )
  • Open

    Remove algorithmic filters from what you read
    I typically announce new blog posts from my most relevant twitter account: data science from @DataSciFact, algebra and miscellaneous math from @AlgebraFact, TeX and typography from @TeXtip, etc. If you’d like to be sure that you’re notified of each post, regardless of what algorithms Twitter applies to your feed, you can subscribe to this blog […] Remove algorithmic filters from what you read first appeared on John D. Cook.  ( 5 min )
  • Open

    Robustness through Data Augmentation Loss Consistency. (arXiv:2110.11205v3 [cs.LG] UPDATED)
    While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. For example, dialog state covaries with named entity when we augment data with a new named entity. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. Our code of all experiments is available at: https://github.com/optimization-for-data-driven-science/DAIR.git  ( 3 min )
    Neuronal architecture extracts statistical temporal patterns. (arXiv:2301.10203v1 [q-bio.NC])
    Neuronal systems need to process temporal signals. We here show how higher-order temporal (co-)fluctuations can be employed to represent and process information. Concretely, we demonstrate that a simple biologically inspired feedforward neuronal model is able to extract information from up to the third order cumulant to perform time series classification. This model relies on a weighted linear summation of synaptic inputs followed by a nonlinear gain function. Training both - the synaptic weights and the nonlinear gain function - exposes how the non-linearity allows for the transfer of higher order correlations to the mean, which in turn enables the synergistic use of information encoded in multiple cumulants to maximize the classification accuracy. The approach is demonstrated both on a synthetic and on real world datasets of multivariate time series. Moreover, we show that the biologically inspired architecture makes better use of the number of trainable parameters as compared to a classical machine-learning scheme. Our findings emphasize the benefit of biological neuronal architectures, paired with dedicated learning algorithms, for the processing of information embedded in higher-order statistical cumulants of temporal (co-)fluctuations.  ( 2 min )
    Gradient-adjusted Incremental Target Propagation Provides Effective Credit Assignment in Deep Neural Networks. (arXiv:2102.11598v3 [cs.LG] UPDATED)
    Many of the recent advances in the field of artificial intelligence have been fueled by the highly successful backpropagation of error (BP) algorithm, which efficiently solves the credit assignment problem in artificial neural networks. However, it is unlikely that BP is implemented in its usual form within biological neural networks, because of its reliance on non-local information in propagating error gradients. Since biological neural networks are capable of highly efficient learning and responses from BP trained models can be related to neural responses, it seems reasonable that a biologically viable approximation of BP underlies synaptic plasticity in the brain. Gradient-adjusted incremental target propagation (GAIT-prop or GP for short) has recently been derived directly from BP and has been shown to successfully train networks in a more biologically plausible manner. However, so far, GP has only been shown to work on relatively low-dimensional problems, such as handwritten-digit recognition. This work addresses some of the scaling issues in GP and shows it to perform effective multi-layer credit assignment in deeper networks and on the much more challenging ImageNet dataset.  ( 2 min )
    Denoising Diffusion Probabilistic Models for Generation of Realistic Fully-Annotated Microscopy Image Data Sets. (arXiv:2301.10227v1 [eess.IV])
    Denoising diffusion probabilistic models have shown great potential in generating realistic image data. We show how those models can be used to generate realistic microscopy image data in 2D and 3D based on simulated sketches of cellular structures. Multiple data sets are used as an inspiration to simulate sketches of different cellular structures, allowing to generate fully-annotated image data sets without requiring human interactions. Those data sets are used to train segmentation approaches and demonstrate that annotation-free segmentation of cellular structures in fluorescence microscopy image data can be achieved, thereby leaping towards the ultimate goal of eliminating the necessity of human annotation efforts.  ( 2 min )
    CovidRhythm: A Deep Learning Model for Passive Prediction of Covid-19 using Biobehavioral Rhythms Derived from Wearable Physiological Data. (arXiv:2301.10168v1 [eess.SP])
    To investigate whether a deep learning model can detect Covid-19 from disruptions in the human body's physiological (heart rate) and rest-activity rhythms (rhythmic dysregulation) caused by the SARS-CoV-2 virus. We propose CovidRhythm, a novel Gated Recurrent Unit (GRU) Network with Multi-Head Self-Attention (MHSA) that combines sensor and rhythmic features extracted from heart rate and activity (steps) data gathered passively using consumer-grade smart wearable to predict Covid-19. A total of 39 features were extracted (standard deviation, mean, min/max/avg length of sedentary and active bouts) from wearable sensor data. Biobehavioral rhythms were modeled using nine parameters (mesor, amplitude, acrophase, and intra-daily variability). These features were then input to CovidRhythm for predicting Covid-19 in the incubation phase (one day before biological symptoms manifest). A combination of sensor and biobehavioral rhythm features achieved the highest AUC-ROC of 0.79 [Sensitivity = 0.69, Specificity=0.89, F$_{0.1}$ = 0.76], outperforming prior approaches in discriminating Covid-positive patients from healthy controls using 24 hours of historical wearable physiological. Rhythmic features were the most predictive of Covid-19 infection when utilized either alone or in conjunction with sensor features. Sensor features predicted healthy subjects best. Circadian rest-activity rhythms that combine 24h activity and sleep information were the most disrupted. CovidRhythm demonstrates that biobehavioral rhythms derived from consumer-grade wearable data can facilitate timely Covid-19 detection. To the best of our knowledge, our work is the first to detect Covid-19 using deep learning and biobehavioral rhythms features derived from consumer-grade wearable data.  ( 2 min )
    Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis. (arXiv:2301.10183v1 [cs.SD])
    Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in deep learning. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure: i.e., local amplitude variations up to 100 milliseconds or so. In this paper, we formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. We empirically demonstrate that time--frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, we motivate the need for a time-invariant and multiscale differentiable time--frequency model of similarity at the level of both local spectra and spectrotemporal modulations.  ( 2 min )
    On the Tradeoff between Energy, Precision, and Accuracy in Federated Quantized Neural Networks. (arXiv:2111.07911v3 [cs.LG] UPDATED)
    Deploying federated learning (FL) over wireless networks with resource-constrained devices requires balancing between accuracy, energy efficiency, and precision. Prior art on FL often requires devices to train deep neural networks (DNNs) using a 32-bit precision level for data representation to improve accuracy. However, such algorithms are impractical for resource-constrained devices since DNNs could require execution of millions of operations. Thus, training DNNs with a high precision level incurs a high energy cost for FL. In this paper, a quantized FL framework, that represents data with a finite level of precision in both local training and uplink transmission, is proposed. Here, the finite level of precision is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with the quantization are rigorously derived. An energy minimization problem is formulated with respect to the level of precision while ensuring convergence. To solve the problem, we first analytically derive the FL convergence rate and use a line search method. Simulation results show that our FL framework can reduce energy consumption by up to 53% compared to a standard FL model. The results also shed light on the tradeoff between precision, energy, and accuracy in FL over wireless networks.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization. (arXiv:2301.10133v1 [cs.LG])
    We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $\alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not. This sign-conscious algorithm is aware of whether from the previous step to the current one the update of each parameter has been too large or too small and adjusts the $\alpha$ accordingly. We implement the Active version (ours) of widely used and recently published gradient descent optimizers, namely SGD with momentum, AdamW, RAdam, and AdaBelief. Our experiments on ImageNet, CIFAR-10, WikiText-103, WikiText-2, and PASCAL VOC using different model architectures, such as ResNet and Transformers, show an increase in generalizability and training set fit, and decrease in training time for the Active variants of the tested optimizers. The results also show robustness of the Active variant of these optimizers to different values of the initial learning rate. Furthermore, the detrimental effects of using large mini-batch sizes are mitigated. ActiveLR, thus, alleviates the need for hyper-parameter search for two of the most commonly tuned hyper-parameters that require heavy time and computational costs to pick. We encourage AI researchers and practitioners to use the Active variant of their optimizer of choice for faster training, better generalizability, and reducing carbon footprint of training deep neural networks.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Analysis of Arrhythmia Classification on ECG Dataset. (arXiv:2301.10174v1 [cs.LG])
    The heart is one of the most vital organs in the human body. It supplies blood and nutrients in other parts of the body. Therefore, maintaining a healthy heart is essential. As a heart disorder, arrhythmia is a condition in which the heart's pumping mechanism becomes aberrant. The Electrocardiogram is used to analyze the arrhythmia problem from the ECG signals because of its fewer difficulties and cheapness. The heart peaks shown in the ECG graph are used to detect heart diseases, and the R peak is used to analyze arrhythmia disease. Arrhythmia is grouped into two groups - Tachycardia and Bradycardia for detection. In this paper, we discussed many different techniques such as Deep CNNs, LSTM, SVM, NN classifier, Wavelet, TQWT, etc., that have been used for detecting arrhythmia using various datasets throughout the previous decade. This work shows the analysis of some arrhythmia classification on the ECG dataset. Here, Data preprocessing, feature extraction, classification processes were applied on most research work and achieved better performance for classifying ECG signals to detect arrhythmia. Automatic arrhythmia detection can help cardiologists make the right decisions immediately to save human life. In addition, this research presents various previous research limitations with some challenges in detecting arrhythmia that will help in future research.  ( 2 min )
    Sleep Activity Recognition and Characterization from Multi-Source Passively Sensed Data. (arXiv:2301.10156v1 [eess.SP])
    Sleep constitutes a key indicator of human health, performance, and quality of life. Sleep deprivation has long been related to the onset, development, and worsening of several mental and metabolic disorders, constituting an essential marker for preventing, evaluating, and treating different health conditions. Sleep Activity Recognition methods can provide indicators to assess, monitor, and characterize subjects' sleep-wake cycles and detect behavioral changes. In this work, we propose a general method that continuously operates on passively sensed data from smartphones to characterize sleep and identify significant sleep episodes. Thanks to their ubiquity, these devices constitute an excellent alternative data source to profile subjects' biorhythms in a continuous, objective, and non-invasive manner, in contrast to traditional sleep assessment methods that usually rely on intrusive and subjective procedures. A Heterogeneous Hidden Markov Model is used to model a discrete latent variable process associated with the Sleep Activity Recognition task in a self-supervised way. We validate our results against sleep metrics reported by tested wearables, proving the effectiveness of the proposed approach and advocating its use to assess sleep without more reliable sources.  ( 2 min )
    EEG Opto-processor: epileptic seizure detection using diffractive photonic computing units. (arXiv:2301.10167v1 [eess.SP])
    Electroencephalography (EEG) analysis extracts critical information from brain signals, which has provided fundamental support for various applications, including brain-disease diagnosis and brain-computer interface. However, the real-time processing of large-scale EEG signals at high energy efficiency has placed great challenges for electronic processors on edge computing devices. Here, we propose the EEG opto-processor based on diffractive photonic computing units (DPUs) to effectively process the extracranial and intracranial EEG signals and perform epileptic seizure detection. The signals of EEG channels within a second-time window are optically encoded as inputs to the constructed diffractive neural networks for classification, which monitors the brain state to determine whether it's the symptom of an epileptic seizure or not. We developed both the free-space and integrated DPUs as edge computing systems and demonstrated their applications for real-time epileptic seizure detection with the benchmark datasets, i.e., the CHB-MIT extracranial EEG dataset and Epilepsy-iEEG-Multicenter intracranial EEG dataset, at high computing performance. Along with the channel selection mechanism, both the numerical evaluations and experimental results validated the sufficient high classification accuracies of the proposed opto-processors for supervising the clinical diagnosis. Our work opens up a new research direction of utilizing photonic computing techniques for processing large-scale EEG signals in promoting its broader applications.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    VaiPhy: a Variational Inference Based Algorithm for Phylogeny. (arXiv:2203.01121v3 [q-bio.PE] UPDATED)
    Phylogenetics is a classical methodology in computational biology that today has become highly relevant for medical investigation of single-cell data, e.g., in the context of cancer development. The exponential size of the tree space is, unfortunately, a substantial obstacle for Bayesian phylogenetic inference using Markov chain Monte Carlo based methods since these rely on local operations. And although more recent variational inference (VI) based methods offer speed improvements, they rely on expensive auto-differentiation operations for learning the variational parameters. We propose VaiPhy, a remarkably fast VI based algorithm for approximate posterior inference in an augmented tree space. VaiPhy produces marginal log-likelihood estimates on par with the state-of-the-art methods on real data and is considerably faster since it does not require auto-differentiation. Instead, VaiPhy combines coordinate ascent update equations with two novel sampling schemes: (i) SLANTIS, a proposal distribution for tree topologies in the augmented tree space, and (ii) the JC sampler, to the best of our knowledge, the first-ever scheme for sampling branch lengths directly from the popular Jukes-Cantor model. We compare VaiPhy in terms of density estimation and runtime. Additionally, we evaluate the reproducibility of the baselines. We provide our code on GitHub: \url{https://github.com/Lagergren-Lab/VaiPhy}.  ( 2 min )
    Lowering Detection in Sport Climbing Based on Orientation of the Sensor Enhanced Quickdraw. (arXiv:2301.10164v1 [eess.SP])
    Tracking climbers' activity to improve services and make the best use of their infrastructure is a concern for climbing gyms. Each climbing session must be analyzed from beginning till lowering of the climber. Therefore, spotting the climbers descending is crucial since it indicates when the ascent has come to an end. This problem must be addressed while preserving privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence become practical in terms of expenses and time consumption for replacement when using in large quantity in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect sensors' orientation patterns during lowering different routes, and develop an supervised approach to identify lowering.  ( 2 min )
    Sequential Graph Attention Learning for Predicting Dynamic Stock Trends (Student Abstract). (arXiv:2301.10153v1 [q-fin.ST])
    The stock market is characterized by a complex relationship between companies and the market. This study combines a sequential graph structure with attention mechanisms to learn global and local information within temporal time. Specifically, our proposed "GAT-AGNN" module compares model performance across multiple industries as well as within single industries. The results show that the proposed framework outperforms the state-of-the-art methods in predicting stock trends across multiple industries on Taiwan Stock datasets.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    Computational Solar Energy -- Ensemble Learning Methods for Prediction of Solar Power Generation based on Meteorological Parameters in Eastern India. (arXiv:2301.10159v1 [cs.LG])
    The challenges in applications of solar energy lies in its intermittency and dependency on meteorological parameters such as; solar radiation, ambient temperature, rainfall, wind-speed etc., and many other physical parameters like dust accumulation etc. Hence, it is important to estimate the amount of solar photovoltaic (PV) power generation for a specific geographical location. Machine learning (ML) models have gained importance and are widely used for prediction of solar power plant performance. In this paper, the impact of weather parameters on solar PV power generation is estimated by several Ensemble ML (EML) models like Bagging, Boosting, Stacking, and Voting for the first time. The performance of chosen ML algorithms is validated by field dataset of a 10kWp solar PV power plant in Eastern India region. Furthermore, a complete test-bed framework has been designed for data mining as well as to select appropriate learning models. It also supports feature selection and reduction for dataset to reduce space and time complexity of the learning models. The results demonstrate greater prediction accuracy of around 96% for Stacking and Voting EML models. The proposed work is a generalized one and can be very useful for predicting the performance of large-scale solar PV power plants also.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Pex: Memory-efficient Microcontroller Deep Learning through Partial Execution. (arXiv:2211.17246v2 [cs.LG] UPDATED)
    Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.
    Multi-Agent Patrolling with Battery Constraints through Deep Reinforcement Learning. (arXiv:2212.08230v2 [cs.AI] UPDATED)
    Autonomous vehicles are suited for continuous area patrolling problems. However, finding an optimal patrolling strategy can be challenging for many reasons. Firstly, patrolling environments are often complex and can include unknown environmental factors. Secondly, autonomous vehicles can have failures or hardware constraints, such as limited battery life. Importantly, patrolling large areas often requires multiple agents that need to collectively coordinate their actions. In this work, we consider these limitations and propose an approach based on model-free, deep multi-agent reinforcement learning. In this approach, the agents are trained to automatically recharge themselves when required, to support continuous collective patrolling. A distributed homogeneous multi-agent architecture is proposed, where all patrolling agents execute identical policies locally based on their local observations and shared information. This architecture provides a fault-tolerant and robust patrolling system that can tolerate agent failures and allow supplementary agents to be added to replace failed agents or to increase the overall patrol performance. The solution is validated through simulation experiments from multiple perspectives, including the overall patrol performance, the efficiency of battery recharging strategies, and the overall fault tolerance and robustness.
    Dirac signal processing of higher-order topological signals. (arXiv:2301.10137v1 [eess.SP])
    We consider topological signals corresponding to variables supported on nodes, links and triangles of higher-order networks and simplicial complexes. So far such signals are typically processed independently of each other, and algorithms that can enforce a consistent processing of topological signals across different levels are largely lacking. Here we propose Dirac signal processing, an adaptive, unsupervised signal processing algorithm that learns to jointly filter topological signals supported on nodes, links and (filled) triangles of simplicial complexes in a consistent way. The proposed Dirac signal processing algorithm is rooted in algebraic topology and formulated in terms of the discrete Dirac operator which can be interpreted as ``square root" of a higher-order (Hodge) Laplacian matrix acting on nodes, links and triangles of simplicial complexes. We test our algorithms on noisy synthetic data and noisy data of drifters in the ocean and find that the algorithm can learn to efficiently reconstruct the true signals outperforming algorithms based exclusively on the Hodge Laplacian.
    Towards Asteroid Detection in Microlensing Surveys with Deep Learning. (arXiv:2211.02239v2 [astro-ph.EP] UPDATED)
    Asteroids are an indelible part of most astronomical surveys though only a few surveys are dedicated to their detection. Over the years, high cadence microlensing surveys have amassed several terabytes of data while scanning primarily the Galactic Bulge and Magellanic Clouds for microlensing events and thus provide a treasure trove of opportunities for scientific data mining. In particular, numerous asteroids have been observed by visual inspection of selected images. This paper presents novel deep learning-based solutions for the recovery and discovery of asteroids in the microlensing data gathered by the MOA project. Asteroid tracklets can be clearly seen by combining all the observations on a given night and these tracklets inform the structure of the dataset. Known asteroids were identified within these composite images and used for creating the labelled datasets required for supervised learning. Several custom CNN models were developed to identify images with asteroid tracklets. Model ensembling was then employed to reduce the variance in the predictions as well as to improve the generalisation error, achieving a recall of 97.67%. Furthermore, the YOLOv4 object detector was trained to localize asteroid tracklets, achieving a mean Average Precision (mAP) of 90.97%. These trained networks will be applied to 16 years of MOA archival data to find both known and unknown asteroids that have been observed by the survey over the years. The methodologies developed can be adapted for use by other surveys for asteroid recovery and discovery.
    Neural Implicit k-Space for Binning-free Non-Cartesian Cardiac MR Imaging. (arXiv:2212.08479v2 [eess.IV] UPDATED)
    In this work, we propose a novel image reconstruction framework that directly learns a neural implicit representation in k-space for ECG-triggered non-Cartesian Cardiac Magnetic Resonance Imaging (CMR). While existing methods bin acquired data from neighboring time points to reconstruct one phase of the cardiac motion, our framework allows for a continuous, binning-free, and subject-specific k-space representation.We assign a unique coordinate that consists of time, coil index, and frequency domain location to each sampled k-space point. We then learn the subject-specific mapping from these unique coordinates to k-space intensities using a multi-layer perceptron with frequency domain regularization. During inference, we obtain a complete k-space for Cartesian coordinates and an arbitrary temporal resolution. A simple inverse Fourier transform recovers the image, eliminating the need for density compensation and costly non-uniform Fourier transforms for non-Cartesian data. This novel imaging framework was tested on 42 radially sampled datasets from 6 subjects. The proposed method outperforms other techniques qualitatively and quantitatively using data from four and one heartbeat(s) and 30 cardiac phases. Our results for one heartbeat reconstruction of 50 cardiac phases show improved artifact removal and spatio-temporal resolution, leveraging the potential for real-time CMR.
    A Learning Based Hypothesis Test for Harmful Covariate Shift. (arXiv:2212.02742v3 [cs.LG] UPDATED)
    The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.
    Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. (arXiv:2210.10246v2 [cs.LG] UPDATED)
    Training deep learning models can be computationally expensive. Prior works have shown that increasing the batch size can potentially lead to better overall throughput. However, the batch size is frequently limited by the accelerator memory capacity due to the activations/feature maps stored for the training backward pass, as larger batch sizes require larger feature maps to be stored. Transformer-based models, which have recently seen a surge in popularity due to their good performance and applicability to a variety of tasks, have a similar problem. To remedy this issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU) memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage and ultimately leading to more efficient training. We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19% and 26% speedup over the baseline.
    Quadruple-star systems are not always nested triples: a machine learning approach to dynamical stability. (arXiv:2301.09930v1 [cs.LG])
    The dynamical stability of quadruple-star systems has traditionally been treated as a problem involving two `nested' triples which constitute a quadruple. In this novel study, we employed a machine learning algorithm, the multi-layer perceptron (MLP), to directly classify 2+2 and 3+1 quadruples based on their stability (or long-term boundedness). The training data sets for the classification, comprised of $5\times10^5$ quadruples each, were integrated using the highly accurate direct $N$-body code MSTAR. We also carried out a limited parameter space study of zero-inclination systems to directly compare quadruples to triples. We found that both our quadruple MLP models perform better than a `nested' triple MLP approach, which is especially significant for 3+1 quadruples. The classification accuracies for the 2+2 MLP and 3+1 MLP models are 94% and 93% respectively, while the scores for the `nested' triple approach are 88% and 66% respectively. This is a crucial implication for quadruple population synthesis studies. Our MLP models, which are very simple and almost instantaneous to implement, are available on GitHub, along with Python3 scripts to access them.
    Exploring Effects of Computational Parameter Changes to Image Recognition Systems. (arXiv:2211.00471v3 [cs.LG] UPDATED)
    Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and FPGAs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment as parameters like deep learning frameworks, compiler optimizations for code generation, and hardware devices are not regulated with varying impact on model performance and correctness. In this paper we conduct robustness analysis of four popular image recognition models (MobileNetV2, ResNet101V2, DenseNet121 and InceptionV3) with the ImageNet dataset, assessing the impact of the following parameters in the model's computational environment: (1) deep learning frameworks; (2) compiler optimizations; and (3) hardware devices. We report sensitivity of model performance in terms of output label and inference time for changes in each of these environment parameters. We find that output label predictions for all four models are sensitive to choice of deep learning framework (by up to 57%) and insensitive to other parameters. On the other hand, model inference time was affected by all environment parameters with changes in hardware device having the most effect. The extent of effect was not uniform across models.
    Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized inference for the Work Disability Functional Assessment Battery. (arXiv:2210.10952v2 [stat.ME] UPDATED)
    The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means -- linear factorization and null hypothesis statistical testing for item partitioning/selection, and finally, posthoc calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. Its item partitioning, based on exploratory factor analysis, is blind to the final nonlinear IRT model and is not performed in a manner consistent with goodness of fit to the final model. In this manuscript, we develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method uses sparsity-based shrinkage to obviate the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that item partitioning is consistent with the ultimate nonlinear factor model. We also analogize our multidimensional IRT model to probabilistic autoencoders, specifying an encoder function that amortizes the inference of ability parameters from item responses. The encoder function is equivalent to the "VBE" step in a stochastic variational Bayesian expectation maximization (VBEM) procedure that we use for approxiamte Bayesian inference on the entire model. We use the method on a sample of WD-FAB item responses and compare the resulting item discriminations to those obtained using the traditional posthoc method.
    Unsupervised Model Selection for Time-series Anomaly Detection. (arXiv:2210.01078v3 [cs.LG] UPDATED)
    Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the $F_1$ score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.
    Broken Neural Scaling Laws. (arXiv:2210.14891v5 [cs.LG] UPDATED)
    We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws
    Tracking the industrial growth of modern China with high-resolution panchromatic imagery: A sequential convolutional approach. (arXiv:2301.09620v1 [cs.CV] CROSS LISTED)
    Due to insufficient or difficult to obtain data on development in inaccessible regions, remote sensing data is an important tool for interested stakeholders to collect information on economic growth. To date, no studies have utilized deep learning to estimate industrial growth at the level of individual sites. In this study, we harness high-resolution panchromatic imagery to estimate development over time at 419 industrial sites in the People's Republic of China using a multi-tier computer vision framework. We present two methods for approximating development: (1) structural area coverage estimated through a Mask R-CNN segmentation algorithm, and (2) imputing development directly with visible & infrared radiance from the Visible Infrared Imaging Radiometer Suite (VIIRS). Labels generated from these methods are comparatively evaluated and tested. On a dataset of 2,078 50 cm resolution images spanning 19 years, the results indicate that two dimensions of industrial development can be estimated using high-resolution daytime imagery, including (a) the total square meters of industrial development (average error of 0.021 $\textrm{km}^2$), and (b) the radiance of lights (average error of 9.8 $\mathrm{\frac{nW}{cm^{2}sr}}$). Trend analysis of the techniques reveal estimates from a Mask R-CNN-labeled CNN-LSTM track ground truth measurements most closely. The Mask R-CNN estimates positive growth at every site from the oldest image to the most recent, with an average change of 4,084 $\textrm{m}^2$.
    Self-Supervised Learning Through Efference Copies. (arXiv:2210.09224v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) methods aim to exploit the abundance of unlabelled data for machine learning (ML), however the underlying principles are often method-specific. An SSL framework derived from biological first principles of embodied learning could unify the various SSL methods, help elucidate learning in the brain, and possibly improve ML. SSL commonly transforms each training datapoint into a pair of views, uses the knowledge of this pairing as a positive (i.e. non-contrastive) self-supervisory sign, and potentially opposes it to unrelated, (i.e. contrastive) negative examples. Here, we show that this type of self-supervision is an incomplete implementation of a concept from neuroscience, the Efference Copy (EC). Specifically, the brain also transforms the environment through efference, i.e. motor commands, however it sends to itself an EC of the full commands, i.e. more than a mere SSL sign. In addition, its action representations are likely egocentric. From such a principled foundation we formally recover and extend SSL methods such as SimCLR, BYOL, and ReLIC under a common theoretical framework, i.e. Self-supervision Through Efference Copies (S-TEC). Empirically, S-TEC restructures meaningfully the within- and between-class representations. This manifests as improvement in recent strong SSL baselines in image classification, segmentation, object detection, and in audio. These results hypothesize a testable positive influence from the brain's motor outputs onto its sensory representations.
    Foresight -- Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs. (arXiv:2212.08072v2 [cs.CL] UPDATED)
    Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generative transformers can be used to forecast a wide range of future disorders, substances, procedures or findings. Methods: We present Foresight, a novel transformer-based pipeline that uses named entity recognition and linking tools to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, substances, procedures and findings. We processed the entire free-text portion from three different hospital datasets totalling 811336 patients covering both physical and mental health. Findings: On tests in two UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 0.68, 0.76 and 0.88 was achieved for forecasting the next disorder in a patient timeline, while precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by five clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. As a generative model, it can forecast follow-on biomedical concepts for as many steps as required. Interpretation: Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk forecasting, virtual trials and clinical research to study the progression of disorders, simulate interventions and counterfactuals, and educational purposes.
    ESTAS: Effective and Stable Trojan Attacks in Self-supervised Encoders with One Target Unlabelled Sample. (arXiv:2211.10908v2 [cs.CV] UPDATED)
    Emerging self-supervised learning (SSL) has become a popular image representation encoding method to obviate the reliance on labeled data and learn rich representations from large-scale, ubiquitous unlabelled data. Then one can train a downstream classifier on top of the pre-trained SSL image encoder with few or no labeled downstream data. Although extensive works show that SSL has achieved remarkable and competitive performance on different downstream tasks, its security concerns, e.g, Trojan attacks in SSL encoders, are still not well-studied. In this work, we present a novel Trojan Attack method, denoted by ESTAS, that can enable an effective and stable attack in SSL encoders with only one target unlabeled sample. In particular, we propose consistent trigger poisoning and cascade optimization in ESTAS to improve attack efficacy and model accuracy, and eliminate the expensive target-class data sample extraction from large-scale disordered unlabelled data. Our substantial experiments on multiple datasets show that ESTAS stably achieves > 99% attacks success rate (ASR) with one target-class sample. Compared to prior works, ESTAS attains > 30% ASR increase and > 8.3% accuracy improvement on average.
    Visual Simulation Software Demonstration for Quantum Multi-Drone Reinforcement Learning. (arXiv:2211.15375v2 [quant-ph] UPDATED)
    Quantum computing (QC) has received a lot of attention according to its light training parameter numbers and computational speeds by qubits. Moreover, various researchers have tried to enable quantum machine learning (QML) using QC, where there are also multifarious efforts to use QC to implement quantum multi-agent reinforcement learning (QMARL). Existing classical multi-agent reinforcement learning (MARL) using neural network features non-stationarity and uncertain properties due to its large number of parameters. Therefore, this paper presents a visual simulation software framework for a novel QMARL algorithm to control autonomous multi-drone systems to take advantage of QC. Our proposed QMARL framework accomplishes reasonable reward convergence and service quality performance with fewer trainable parameters than the classical MARL. Furthermore, QMARL shows more stable training results than existing MARL algorithms. Lastly, our proposed visual simulation software allows us to analyze the agents' training process and results.
    Green, Quantized Federated Learning over Wireless Networks: An Energy-Efficient Design. (arXiv:2207.09387v2 [cs.LG] UPDATED)
    In this paper, a green-quantized FL framework, which represents data with a finite precision level in both local training and uplink transmission, is proposed. Here, the finite precision level is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with quantization are rigorously derived. To minimize the energy consumption and the number of communication rounds simultaneously, a multi-objective optimization problem is formulated with respect to the number of local iterations, the number of selected devices, and the precision levels for both local training and transmission while ensuring convergence under a target accuracy constraint. To solve this problem, the convergence rate of the proposed FL system is analytically derived with respect to the system control variables. Then, the Pareto boundary of the problem is characterized to provide efficient solutions using the normal boundary inspection method. Design insights on balancing the tradeoff between the two objectives while achieving a target accuracy are drawn from using the Nash bargaining solution and analyzing the derived convergence rate. Simulation results show that the proposed FL framework can reduce energy consumption until convergence by up to 70\% compared to a baseline FL algorithm that represents data with full precision without damaging the convergence rate.
    SIAN: Style-Guided Instance-Adaptive Normalization for Multi-Organ Histopathology Image Synthesis. (arXiv:2209.02412v2 [eess.IV] UPDATED)
    Existing deep neural networks for histopathology image synthesis cannot generate image styles that align with different organs, and cannot produce accurate boundaries of clustered nuclei. To address these issues, we propose a style-guided instance-adaptive normalization (SIAN) approach to synthesize realistic color distributions and textures for histopathology images from different organs. SIAN contains four phases, semantization, stylization, instantiation, and modulation. The first two phases synthesize image semantics and styles by using semantic maps and learned image style vectors. The instantiation module integrates geometrical and topological information and generates accurate nuclei boundaries. We validate the proposed approach on a multiple-organ dataset, Extensive experimental results demonstrate that the proposed method generates more realistic histopathology images than four state-of-the-art approaches for five organs. By incorporating synthetic images from the proposed approach to model training, an instance segmentation network can achieve state-of-the-art performance.
    RAIN: RegulArization on Input and Network for Black-Box Domain Adaptation. (arXiv:2208.10531v2 [cs.CV] UPDATED)
    Source-Free domain adaptation transits the source-trained model towards target domain without exposing the source data, trying to dispel these concerns about data privacy and security. However, this paradigm is still at risk of data leakage due to adversarial attacks on the source model. Hence, the Black-Box setting only allows to use the outputs of source model, but still suffers from overfitting on the source domain more severely due to source model's unseen weights. In this paper, we propose a novel approach named RAIN (RegulArization on Input and Network) for Black-Box domain adaptation from both input-level and network-level regularization. For the input-level, we design a new data augmentation technique as Phase MixUp, which highlights task-relevant objects in the interpolations, thus enhancing input-level regularization and class consistency for target models. For network-level, we develop a Subnetwork Distillation mechanism to transfer knowledge from the target subnetwork to the full target network via knowledge distillation, which thus alleviates overfitting on the source domain by learning diverse target representations. Extensive experiments show that our method achieves state-of-the-art performance on several cross-domain benchmarks under both single- and multi-source black-box domain adaptation.
    A computational framework for physics-informed symbolic regression with straightforward integration of domain knowledge. (arXiv:2209.06257v3 [cs.LG] UPDATED)
    Discovering a meaningful symbolic expression that explains experimental data is a fundamental challenge in many scientific fields. We present a novel, open-source computational framework called Scientist-Machine Equation Detector (SciMED), which integrates scientific discipline wisdom in a scientist-in-the-loop approach, with state-of-the-art symbolic regression (SR) methods. SciMED combines a wrapper selection method, that is based on a genetic algorithm, with automatic machine learning and two levels of SR methods. We test SciMED on five configurations of a settling sphere, with and without aerodynamic non-linear drag force, and with excessive noise in the measurements. We show that SciMED is sufficiently robust to discover the correct physically meaningful symbolic expressions from the data, and demonstrate how the integration of domain knowledge enhances its performance. Our results indicate better performance on these tasks than the state-of-the-art SR software packages , even in cases where no knowledge is integrated. Moreover, we demonstrate how SciMED can alert the user about possible missing features, unlike the majority of current SR systems.
    Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach. (arXiv:2109.08549v4 [cs.CY] UPDATED)
    Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
    Learning to Counter: Stochastic Feature-based Learning for Diverse Counterfactual Explanations. (arXiv:2209.13446v2 [cs.AI] UPDATED)
    Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One flourishing approach is through counterfactual explanations, which provide suggestions on what a user can do to alter an outcome. Not only must a counterfactual example counter the original prediction from the black-box classifier but it should also satisfy various constraints for practical applications. Diversity is one of the critical constraints that however remains less discussed. While diverse counterfactuals are ideal, it is computationally challenging to simultaneously address some other constraints. Furthermore, there is a growing privacy concern over the released counterfactual data. To this end, we propose a feature-based learning framework that effectively handles the counterfactual constraints and contributes itself to the limited pool of private explanation models. We demonstrate the flexibility and effectiveness of our method in generating diverse counterfactuals of actionability and plausibility. Our counterfactual engine is more efficient than counterparts of the same capacity while yielding the lowest re-identification risks.
    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.
    Can large language models reason about medical questions?. (arXiv:2207.08143v3 [cs.CL] UPDATED)
    Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether GPT-3.5 (Codex and InstructGPT) can be applied to answer and reason about difficult real-world-based questions. We utilize two multiple-choice medical exam questions (USMLE and MedMCQA) and a medical reading comprehension dataset (PubMedQA). We investigate multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), zero- and few-shot (prepending the question with question-answer exemplars) and retrieval augmentation (injecting Wikipedia passages into the prompt). For a subset of the USMLE questions, a medical expert reviewed and annotated the model's CoT. We found that InstructGPT can often read, reason and recall expert knowledge. Failure are primarily due to lack of knowledge and reasoning errors and trivial guessing heuristics are observed, e.g.\ too often predicting labels A and D on USMLE. Sampling and combining many completions overcome some of these limitations. Using 100 samples, Codex 5-shot CoT not only gives close to well-calibrated predictive probability but also achieves human-level performances on the three datasets. USMLE: 60.2%, MedMCQA: 62.7% and PubMedQA: 78.2%.
    To Trust or Not To Trust Prediction Scores for Membership Inference Attacks. (arXiv:2111.09076v3 [cs.LG] UPDATED)
    Membership inference attacks (MIAs) aim to determine whether a specific sample was used to train a predictive model. Knowing this may indeed lead to a privacy breach. Most MIAs, however, make use of the model's prediction scores - the probability of each output given some input - following the intuition that the trained model tends to behave differently on its training data. We argue that this is a fallacy for many modern deep network architectures. Consequently, MIAs will miserably fail since overconfidence leads to high false-positive rates not only on known domains but also on out-of-distribution data and implicitly acts as a defense against MIAs. Specifically, using generative adversarial networks, we are able to produce a potentially infinite number of samples falsely classified as part of the training data. In other words, the threat of MIAs is overestimated, and less information is leaked than previously assumed. Moreover, there is actually a trade-off between the overconfidence of models and their susceptibility to MIAs: the more classifiers know when they do not know, making low confidence predictions, the more they reveal the training data.
    Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials. (arXiv:2202.05255v3 [cond-mat.mtrl-sci] UPDATED)
    Topological materials present unconventional electronic properties that make them attractive for both basic science and next-generation technological applications. The majority of currently known topological materials have been discovered using methods that involve symmetry-based analysis of the quantum wavefunction. Here we use machine learning to develop a simple-to-use heuristic chemical rule that diagnoses with a high accuracy whether a material is topological using only its chemical formula. This heuristic rule is based on a notion that we term topogivity, a machine-learned numerical value for each element that loosely captures its tendency to form topological materials. We next implement a high-throughput procedure for discovering topological materials based on the heuristic topogivity-rule prediction followed by ab initio validation. This way, we discover new topological materials that are not diagnosable using symmetry indicators, including several that may be promising for experimental observation.
    Gaze-based Object Detection in the Wild. (arXiv:2203.15651v2 [cs.RO] UPDATED)
    In human-robot collaboration, one challenging task is to teach a robot new yet unknown objects enabling it to interact with them. Thereby, gaze can contain valuable information. We investigate if it is possible to detect objects (object or no object) merely from gaze data and determine their bounding box parameters. For this purpose, we explore different sizes of temporal windows, which serve as a basis for the computation of heatmaps, i.e., the spatial distribution of the gaze data. Additionally, we analyze different grid sizes of these heatmaps, and demonstrate the functionality in a proof of concept using different machine learning techniques. Our method is characterized by its speed and resource efficiency compared to conventional object detectors. In order to generate the required data, we conducted a study with five subjects who could move freely and thus, turn towards arbitrary objects. This way, we chose a scenario for our data collection that is as realistic as possible. Since the subjects move while facing objects, the heatmaps also contain gaze data trajectories, complicating the detection and parameter regression. We make our data set publicly available to the research community for download.
    Mixed Effects Random Forests for Personalised Predictions of Clinical Depression Severity. (arXiv:2301.09815v1 [cs.LG])
    This work demonstrates how mixed effects random forests enable accurate predictions of depression severity using multimodal physiological and digital activity data collected from an 8-week study involving 31 patients with major depressive disorder. We show that mixed effects random forests outperform standard random forests and personal average baselines when predicting clinical Hamilton Depression Rating Scale scores (HDRS_17). Compared to the latter baseline, accuracy is significantly improved for each patient by an average of 0.199-0.276 in terms of mean absolute error (p<0.05). This is noteworthy as these simple baselines frequently outperform machine learning methods in mental health prediction tasks. We suggest that this improved performance results from the ability of the mixed effects random forest to personalise model parameters to individuals in the dataset. However, we find that these improvements pertain exclusively to scenarios where labelled patient data are available to the model at training time. Investigating methods that improve accuracy when generalising to new patients is left as important future work.
    Integrating Reward Maximization and Population Estimation: Sequential Decision-Making for Internal Revenue Service Audit Selection. (arXiv:2204.11910v3 [cs.LG] UPDATED)
    We introduce a new setting, optimize-and-estimate structured bandits. Here, a policy must select a batch of arms, each characterized by its own context, that would allow it to both maximize reward and maintain an accurate (ideally unbiased) population estimate of the reward. This setting is inherent to many public and private sector applications and often requires handling delayed feedback, small data, and distribution shifts. We demonstrate its importance on real data from the United States Internal Revenue Service (IRS). The IRS performs yearly audits of the tax base. Two of its most important objectives are to identify suspected misreporting and to estimate the "tax gap" -- the global difference between the amount paid and true amount owed. Based on a unique collaboration with the IRS, we cast these two processes as a unified optimize-and-estimate structured bandit. We analyze optimize-and-estimate approaches to the IRS problem and propose a novel mechanism for unbiased population estimation that achieves rewards comparable to baseline approaches. This approach has the potential to improve audit efficacy, while maintaining policy-relevant estimates of the tax gap. This has important social consequences given that the current tax gap is estimated at nearly half a trillion dollars. We suggest that this problem setting is fertile ground for further research and we highlight its interesting challenges. The results of this and related research are currently being incorporated into the continual improvement of the IRS audit selection methods.
    Efficient Planning in a Compact Latent Action Space. (arXiv:2208.10291v3 [cs.LG] UPDATED)
    Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.
    A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range. (arXiv:2203.13273v5 [cs.LG] UPDATED)
    We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter epsilon within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient g_t and its first momentum m_t before using them to estimate the second momentum. The new optimization method is referred to as Aida. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet. Code is available at this URL
    Planckian Jitter: countering the color-crippling effects of color jitter on self-supervised training. (arXiv:2202.07993v2 [cs.CV] UPDATED)
    Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation - which we call Planckian Jitter - that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets. In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations.
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.
    MTTN: Multi-Pair Text to Text Narratives for Prompt Generation. (arXiv:2301.10172v1 [cs.CL])
    The explosive popularity of diffusion models[ 1][ 2][ 3 ] has provided a huge stage for further development in generative-text modelling. As prompt based models are very nuanced, such that a carefully generated prompt can produce truely breath taking images, on the contrary producing powerful or even meaningful prompt is a hit or a miss. To lavish on this we have introduced a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO[4 ], Flickr[ 5], etc. We have also introduced staging for these sentences that sequentially reduce the context and increase the complexity, that will further strengthen the output because of the complex annotations that are being created. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations. The original 2.4M million pairs are broken down in such a manner that it produces a true scenario of internet lingo that is used globally thereby heightening the robustness of the dataset, and any model trained on it.
    RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving. (arXiv:2301.10222v1 [cs.CV])
    Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at https://github.com/valeoai/rangevit.
    A Watermark for Large Language Models. (arXiv:2301.10226v1 [cs.LG])
    Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of whitelist tokens before a word is generated, and then softly promoting use of whitelist tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
    HTMOT : Hierarchical Topic Modelling Over Time. (arXiv:2112.03104v2 [cs.IR] UPDATED)
    Over the years, topic models have provided an efficient way of extracting insights from text. However, while many models have been proposed, none are able to model topic temporality and hierarchy jointly. Modelling time provide more precise topics by separating lexically close but temporally distinct topics while modelling hierarchy provides a more detailed view of the content of a document corpus. In this study, we therefore propose a novel method, HTMOT, to perform Hierarchical Topic Modelling Over Time. We train HTMOT using a new implementation of Gibbs sampling, which is more efficient. Specifically, we show that only applying time modelling to deep sub-topics provides a way to extract specific stories or events while high level topics extract larger themes in the corpus. Our results show that our training procedure is fast and can extract accurate high-level topics and temporally precise sub-topics. We measured our model's performance using the Word Intrusion task and outlined some limitations of this evaluation method, especially for hierarchical models. As a case study, we focused on the various developments in the space industry in 2020.
    Leveraging Vision-Language Models for Granular Market Change Prediction. (arXiv:2301.10166v1 [q-fin.ST])
    Predicting future direction of stock markets using the historical data has been a fundamental component in financial forecasting. This historical data contains the information of a stock in each specific time span, such as the opening, closing, lowest, and highest price. Leveraging this data, the future direction of the market is commonly predicted using various time-series models such as Long-Short Term Memory networks. This work proposes modeling and predicting market movements with a fundamentally new approach, namely by utilizing image and byte-based number representation of the stock data processed with the recently introduced Vision-Language models. We conduct a large set of experiments on the hourly stock data of the German share index and evaluate various architectures on stock price prediction using historical stock data. We conduct a comprehensive evaluation of the results with various metrics to accurately depict the actual performance of various approaches. Our evaluation results show that our novel approach based on representation of stock data as text (bytes) and image significantly outperforms strong deep learning-based baselines.
    WEASEL 2.0 -- A Random Dilated Dictionary Transform for Fast, Accurate and Memory Constrained Time Series Classification. (arXiv:2301.10194v1 [cs.LG])
    A time series is a sequence of sequentially ordered real values in time. Time series classification (TSC) is the task of assigning a time series to one of a set of predefined classes, usually based on a model learned from examples. Dictionary-based methods for TSC rely on counting the frequency of certain patterns in time series and are important components of the currently most accurate TSC ensembles. One of the early dictionary-based methods was WEASEL, which at its time achieved SotA results while also being very fast. However, it is outperformed both in terms of speed and accuracy by other methods. Furthermore, its design leads to an unpredictably large memory footprint, making it inapplicable for many applications. In this paper, we present WEASEL 2.0, a complete overhaul of WEASEL based on two recent advancements in TSC: Dilation and ensembling of randomized hyper-parameter settings. These two techniques allow WEASEL 2.0 to work with a fixed-size memory footprint while at the same time improving accuracy. Compared to 15 other SotA methods on the UCR benchmark set, WEASEL 2.0 is significantly more accurate than other dictionary methods and not significantly worse than the currently best methods. Actually, it achieves the highest median accuracy over all data sets, and it performs best in 5 out of 12 problem classes. We thus believe that WEASEL 2.0 is a viable alternative for current TSC and also a potentially interesting input for future ensembles.
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.
    Fine-grained Early Frequency Attention for Deep Speaker Representation Learning. (arXiv:2009.01822v2 [eess.AS] UPDATED)
    Deep learning techniques have considerably improved speech processing in recent years. Speaker representations extracted by deep learning models are being used in a wide range of tasks such as speaker recognition and speech emotion recognition. Attention mechanisms have started to play an important role in improving deep learning models in the field of speech processing. Nonetheless, despite the fact that important speaker-related information can be embedded in individual frequency-bins of the input spectral representations, current attention models are unable to attend to fine-grained information items in spectral representations. In this paper we propose Fine-grained Early Frequency Attention (FEFA) for speaker representation learning. Our model is a simple and lightweight model that can be integrated into various CNN pipelines and is capable of focusing on information items as small as frequency-bins. We evaluate the proposed model on three tasks of speaker recognition, speech emotion recognition, and spoken digit recognition. We use Three widely used public datasets, namely VoxCeleb, IEMOCAP, and Free Spoken Digit Dataset for our experiments. We attach FEFA to several prominent deep learning models and evaluate its impact on the final performance. We also compare our work with other related works in the area. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, and the models equipped with FEFA outperform all the other attentive models. We also test our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Topological Understanding of Neural Networks, a survey. (arXiv:2301.09742v1 [cs.LG])
    We look at the internal structure of neural networks which is usually treated as a black box. The easiest and the most comprehensible thing to do is to look at a binary classification and try to understand the approach a neural network takes. We review the significance of different activation functions, types of network architectures associated to them, and some empirical data. We find some interesting observations and a possibility to build upon the ideas to verify the process for real datasets. We suggest some possible experiments to look forward to in three different directions.
    A predictive physics-aware hybrid reduced order model for reacting flows. (arXiv:2301.09860v1 [cs.LG])
    In this work, a new hybrid predictive Reduced Order Model (ROM) is proposed to solve reacting flow problems. This algorithm is based on a dimensionality reduction using Proper Orthogonal Decomposition (POD) combined with deep learning architectures. The number of degrees of freedom is reduced from thousands of temporal points to a few POD modes with their corresponding temporal coefficients. Two different deep learning architectures have been tested to predict the temporal coefficients, based on recursive (RNN) and convolutional (CNN) neural networks. From each architecture, different models have been created to understand the behavior of each parameter of the neural network. Results show that these architectures are able to predict the temporal coefficients of the POD modes, as well as the whole snapshots. The RNN shows lower prediction error for all the variables analyzed. The model was also found capable of predicting more complex simulations showing transfer learning capabilities.
    Towards Modular Machine Learning Solution Development: Benefits and Trade-offs. (arXiv:2301.09753v1 [cs.LG])
    Machine learning technologies have demonstrated immense capabilities in various domains. They play a key role in the success of modern businesses. However, adoption of machine learning technologies has a lot of untouched potential. Cost of developing custom machine learning solutions that solve unique business problems is a major inhibitor to far-reaching adoption of machine learning technologies. We recognize that the monolithic nature prevalent in today's machine learning applications stands in the way of efficient and cost effective customized machine learning solution development. In this work we explore the benefits of modular machine learning solutions and discuss how modular machine learning solutions can overcome some of the major solution engineering limitations of monolithic machine learning solutions. We analyze the trade-offs between modular and monolithic machine learning solutions through three deep learning problems; one text based and the two image based. Our experimental results show that modular machine learning solutions have a promising potential to reap the solution engineering advantages of modularity while gaining performance and data advantages in a way the monolithic machine learning solutions do not permit.
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Interpretable Tsetlin Machine-based Premature Ventricular Contraction Identification. (arXiv:2301.10181v1 [eess.SP])
    Neural network-based models have found wide use in automatic long-term electrocardiogram (ECG) analysis. However, such black box models are inadequate for analysing physiological signals where credibility and interpretability are crucial. Indeed, how to make ECG analysis transparent is still an open problem. In this study, we develop a Tsetlin machine (TM) based architecture for premature ventricular contraction (PVC) identification by analysing long-term ECG signals. The architecture is transparent by describing patterns directly with logical AND rules. To validate the accuracy of our approach, we compare the TM performance with those of convolutional neural networks (CNNs). Our numerical results demonstrate that TM provides comparable performance with CNNs on the MIT-BIH database. To validate interpretability, we provide explanatory diagrams that show how TM makes the PVC identification from confirming and invalidating patterns. We argue that these are compatible with medical knowledge so that they can be readily understood and verified by a medical doctor. Accordingly, we believe this study paves the way for machine learning (ML) for ECG analysis in clinical practice.
    Inference of Continuous Linear Systems from Data with Guaranteed Stability. (arXiv:2301.10060v1 [cs.LG])
    Machine-learning technologies for learning dynamical systems from data play an important role in engineering design. This research focuses on learning continuous linear models from data. Stability, a key feature of dynamic systems, is especially important in design tasks such as prediction and control. Thus, there is a need to develop methodologies that provide stability guarantees. To that end, we leverage the parameterization of stable matrices proposed in [Gillis/Sharma, Automatica, 2017] to realize the desired models. Furthermore, to avoid the estimation of derivative information to learn continuous systems, we formulate the inference problem in an integral form. We also discuss a few extensions, including those related to control systems. Numerical experiments show that the combination of a stable matrix parameterization and an integral form of differential equations allows us to learn stable systems without requiring derivative information, which can be challenging to obtain in situations with noisy or limited data.
    Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning. (arXiv:2301.10119v1 [cs.LG])
    Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-equivalent partial models". After providing a formal definition for these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Then, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.  ( 2 min )
    Autonomous particles. (arXiv:2301.10077v1 [cs.LG])
    Consider a reinforcement learning problem where an agent has access to a very large amount of information about the environment, but it can only take very few actions to accomplish its task and to maximize its reward. Evidently, the main problem for the agent is to learn a map from a very high-dimensional space (which represents its environment) to a very low-dimensional space (which represents its actions). The high-to-low dimensional map implies that most of the information about the environment is irrelevant for the actions to be taken, and only a small fraction of information is relevant. In this paper we argue that the relevant information need not be learned by brute force (which is the standard approach), but can be identified from the intrinsic symmetries of the system. We analyze in details a reinforcement learning problem of autonomous driving, where the corresponding symmetry is the Galilean symmetry, and argue that the learning task can be accomplished with very few relevant parameters, or, more precisely, invariants. For a numerical demonstration, we show that the autonomous vehicles (which we call autonomous particles since they describe very primitive vehicles) need only four relevant invariants to learn how to drive very well without colliding with other particles. The simple model can be easily generalized to include different types of particles (e.g. for cars, for pedestrians, for buildings, for road signs, etc.) with different types of relevant invariants describing interactions between them. We also argue that there must exist a field theory description of the learning system where autonomous particles would be described by fermionic degrees of freedom and interactions mediated by the relevant invariants would be described by bosonic degrees of freedom.  ( 2 min )
    Differentiable bit-rate estimation for neural-based video codec enhancement. (arXiv:2301.09776v1 [eess.IV])
    Neural networks (NN) can improve standard video compression by pre- and post-processing the encoded video. For optimal NN training, the standard codec needs to be replaced with a codec proxy that can provide derivatives of estimated bit-rate and distortion, which are used for gradient back-propagation. Since entropy coding of standard codecs is designed to take into account non-linear dependencies between transform coefficients, bit-rates cannot be well approximated with simple per-coefficient estimators. This paper presents a new approach for bit-rate estimation that is similar to the type employed in training end-to-end neural codecs, and able to efficiently take into account those statistical dependencies. It is defined from a mathematical model that provides closed-form formulas for the estimates and their gradients, reducing the computational complexity. Experimental results demonstrate the method's accuracy in estimating HEVC/H.265 codec bit-rates.  ( 2 min )
    Explainable Data-Driven Optimization: From Context to Decision and Back Again. (arXiv:2301.10074v1 [cs.LG])
    Data-driven optimization uses contextual information and machine learning algorithms to find solutions to decision problems with uncertain parameters. While a vast body of work is dedicated to interpreting machine learning models in the classification setting, explaining decision pipelines involving learning algorithms remains unaddressed. This lack of interpretability can block the adoption of data-driven solutions as practitioners may not understand or trust the recommended decisions. We bridge this gap by introducing a counterfactual explanation methodology tailored to explain solutions to data-driven problems. We introduce two classes of explanations and develop methods to find nearest explanations of random forest and nearest-neighbor predictors. We demonstrate our approach by explaining key problems in operations management such as inventory management and routing.  ( 2 min )
    PolarAir: A Compressed Sensing Scheme for Over-the-Air Federated Learning. (arXiv:2301.10110v1 [cs.IT])
    We explore a scheme that enables the training of a deep neural network in a Federated Learning configuration over an additive white Gaussian noise channel. The goal is to create a low complexity, linear compression strategy, called PolarAir, that reduces the size of the gradient at the user side to lower the number of channel uses needed to transmit it. The suggested approach belongs to the family of compressed sensing techniques, yet it constructs the sensing matrix and the recovery procedure using multiple access techniques. Simulations show that it can reduce the number of channel uses by ~30% when compared to conveying the gradient without compression. The main advantage of the proposed scheme over other schemes in the literature is its low time complexity. We also investigate the behavior of gradient updates and the performance of PolarAir throughout the training process to obtain insight on how best to construct this compression scheme based on compressed sensing.  ( 2 min )
    Intrinsic Motivation in Model-based Reinforcement Learning: A Brief Review. (arXiv:2301.10067v1 [cs.LG])
    The reinforcement learning research area contains a wide range of methods for solving the problems of intelligent agent control. Despite the progress that has been made, the task of creating a highly autonomous agent is still a significant challenge. One potential solution to this problem is intrinsic motivation, a concept derived from developmental psychology. This review considers the existing methods for determining intrinsic motivation based on the world model obtained by the agent. We propose a systematic approach to current research in this field, which consists of three categories of methods, distinguished by the way they utilize a world model in the agent's components: complementary intrinsic reward, exploration policy, and intrinsically motivated goals. The proposed unified framework describes the architecture of agents using a world model and intrinsic motivation to improve learning. The potential for developing new techniques in this area of research is also examined.  ( 2 min )
    A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data. (arXiv:2301.10053v1 [cs.LG])
    Personal data collected at scale from surveys or digital devices offers important insights for statistical analysis and scientific research. Safely sharing such data while protecting privacy is however challenging. Anonymization allows data to be shared while minimizing privacy risks, but traditional anonymization techniques have been repeatedly shown to provide limited protection against re-identification attacks in practice. Among modern anonymization techniques, synthetic data generation (SDG) has emerged as a potential solution to find a good tradeoff between privacy and statistical utility. Synthetic data is typically generated using algorithms that learn the statistical distribution of the original records, to then generate "artificial" records that are structurally and statistically similar to the original ones. Yet, the fact that synthetic records are "artificial" does not, per se, guarantee that privacy is protected. In this work, we systematically evaluate the tradeoffs between protecting privacy and preserving statistical utility for a wide range of synthetic data generation algorithms. Modeling privacy as protection against attribute inference attacks (AIAs), we extend and adapt linear reconstruction attacks, which have not been previously studied in the context of synthetic data. While prior work suggests that AIAs may be effective only on few outlier records, we show they can be very effective even on randomly selected records. We evaluate attacks on synthetic datasets ranging from 10^3 to 10^6 records, showing that even for the same generative model, the attack effectiveness can drastically increase when a larger number of synthetic records is generated. Overall, our findings prove that synthetic data is subject to privacy-utility tradeoffs just like other anonymization techniques: when good utility is preserved, attribute inference can be a risk for many data subjects.  ( 2 min )
    SMART: Self-supervised Multi-task pretrAining with contRol Transformers. (arXiv:2301.09816v1 [cs.LG])
    Self-supervised pretraining has been extensively studied in language and vision domains, where a unified model can be easily adapted to various downstream tasks by pretraining representations without explicit labels. When it comes to sequential decision-making tasks, however, it is difficult to properly design such a pretraining approach that can cope with both high-dimensional perceptual information and the complexity of sequential control over long interaction horizons. The challenge becomes combinatorially more complex if we want to pretrain representations amenable to a large variety of tasks. To tackle this problem, in this work, we formulate a general pretraining-finetuning pipeline for sequential decision making, under which we propose a generic pretraining framework \textit{Self-supervised Multi-task pretrAining with contRol Transformer (SMART)}. By systematically investigating pretraining regimes, we carefully design a Control Transformer (CT) coupled with a novel control-centric pretraining objective in a self-supervised manner. SMART encourages the representation to capture the common essential information relevant to short-term control and long-term control, which is transferrable across tasks. We show by extensive experiments in DeepMind Control Suite that SMART significantly improves the learning efficiency among seen and unseen downstream tasks and domains under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality pretraining datasets that are randomly collected.  ( 2 min )
    Koopman neural operator as a mesh-free solver of non-linear partial differential equations. (arXiv:2301.10022v1 [cs.LG])
    The lacking of analytic solutions of diverse partial differential equations (PDEs) gives birth to series of computational techniques for numerical solutions. In machine learning, numerous latest advances of solver designs are accomplished in developing neural operators, a kind of mesh-free approximators of the infinite-dimensional operators that map between different parameterization spaces of equation solutions. Although neural operators exhibit generalization capacities for learning an entire PDE family simultaneously, they become less accurate and explainable while learning long-term behaviours of non-linear PDE families. In this paper, we propose Koopman neural operator (KNO), a new neural operator, to overcome these challenges. With the same objective of learning an infinite-dimensional mapping between Banach spaces that serves as the solution operator of target PDE family, our approach differs from existing models by formulating a non-linear dynamic system of equation solution. By approximating the Koopman operator, an infinite-dimensional linear operator governing all possible observations of the dynamic system, to act on the flow mapping of dynamic system, we can equivalently learn the solution of an entire non-linear PDE family by solving simple linear prediction problems. In zero-shot prediction and long-term prediction experiments on representative PDEs (e.g., the Navier-Stokes equation), KNO exhibits notable advantages in breaking the tradeoff between accuracy and efficiency (e.g., model size) while previous state-of-the-art models are limited. These results suggest that more efficient PDE solvers can be developed by the joint efforts from physics and machine learning.  ( 2 min )
    Robust Fair Clustering: A Novel Fairness Attack and Defense Framework. (arXiv:2210.01953v2 [cs.LG] UPDATED)
    Clustering algorithms are widely used in many societal resource allocation applications, such as loan approvals and candidate recruitment, among others, and hence, biased or unfair model outputs can adversely impact individuals that rely on these applications. To this end, many fair clustering approaches have been recently proposed to counteract this issue. Due to the potential for significant harm, it is essential to ensure that fair clustering algorithms provide consistently fair outputs even under adversarial influence. However, fair clustering algorithms have not been studied from an adversarial attack perspective. In contrast to previous research, we seek to bridge this gap and conduct a robustness analysis against fair clustering by proposing a novel black-box fairness attack. Through comprehensive experiments, we find that state-of-the-art models are highly susceptible to our attack as it can reduce their fairness performance significantly. Finally, we propose Consensus Fair Clustering (CFC), the first robust fair clustering approach that transforms consensus clustering into a fair graph partitioning problem, and iteratively learns to generate fair cluster outputs. Experimentally, we observe that CFC is highly robust to the proposed attack and is thus a truly robust fair clustering alternative.  ( 2 min )
    3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation. (arXiv:2209.15076v3 [cs.CV] UPDATED)
    The recent 3D medical ViTs (e.g., SwinUNETR) achieve the state-of-the-art performances on several 3D volumetric data benchmarks, including 3D medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local self-attention and the large number of model parameters. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7\times7\times7$) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880 to 0.900). The source code with our proposed model are available at https://github.com/MASILab/3DUX-Net.  ( 2 min )
    Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression. (arXiv:2301.09830v1 [cs.LG])
    In training of modern large natural language processing (NLP) models, it has become a common practice to split models using 3D parallelism to multiple GPUs. Such technique, however, suffers from a high overhead of inter-node communication. Compressing the communication is one way to mitigate the overhead by reducing the inter-node traffic volume; however, the existing compression techniques have critical limitations to be applied for NLP models with 3D parallelism in that 1) only the data parallelism traffic is targeted, and 2) the existing compression schemes already harm the model quality too much. In this paper, we present Optimus-CC, a fast and scalable distributed training framework for large NLP models with aggressive communication compression. Optimus-CC differs from existing communication compression frameworks in the following ways: First, we compress pipeline parallel (inter-stage) traffic. In specific, we compress the inter-stage backpropagation and the embedding synchronization in addition to the existing data-parallel traffic compression methods. Second, we propose techniques to avoid the model quality drop that comes from the compression. We further provide mathematical and empirical analyses to show that our techniques can successfully suppress the compression error. Lastly, we analyze the pipeline and opt to selectively compress those traffic lying on the critical path. This further helps reduce the compression error. We demonstrate our solution on a GPU cluster, and achieve superior speedup from the baseline state-of-the-art solutions for distributed training without sacrificing the model quality.  ( 2 min )
    When does the student surpass the teacher? Federated Semi-supervised Learning with Teacher-Student EMA. (arXiv:2301.10114v1 [cs.LG])
    Semi-Supervised Learning (SSL) has received extensive attention in the domain of computer vision, leading to development of promising approaches such as FixMatch. In scenarios where training data is decentralized and resides on client devices, SSL must be integrated with privacy-aware training techniques such as Federated Learning. We consider the problem of federated image classification and study the performance and privacy challenges with existing federated SSL (FSSL) approaches. Firstly, we note that even state-of-the-art FSSL algorithms can trivially compromise client privacy and other real-world constraints such as client statelessness and communication cost. Secondly, we observe that it is challenging to integrate EMA (Exponential Moving Average) updates into the federated setting, which comes at a trade-off between performance and communication cost. We propose a novel approach FedSwitch, that improves privacy as well as generalization performance through Exponential Moving Average (EMA) updates. FedSwitch utilizes a federated semi-supervised teacher-student EMA framework with two features - local teacher adaptation and adaptive switching between teacher and student for pseudo-label generation. Our proposed approach outperforms the state-of-the-art on federated image classification, can be adapted to real-world constraints, and achieves good generalization performance with minimal communication cost overhead.  ( 2 min )
    Proceedings of the 1st International Workshop on Reading Music Systems. (arXiv:2301.10062v1 [cs.CV])
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 1st International Workshop on Reading Music Systems, held in Paris on the 20th of September 2018.  ( 2 min )
    Quantum Heavy-tailed Bandits. (arXiv:2301.09680v1 [cs.LG])
    In this paper, we study multi-armed bandits (MAB) and stochastic linear bandits (SLB) with heavy-tailed rewards and quantum reward oracle. Unlike the previous work on quantum bandits that assumes bounded/sub-Gaussian distributions for rewards, here we investigate the quantum bandits problem under a weaker assumption that the distributions of rewards only have bounded $(1+v)$-th moment for some $v\in (0,1]$. In order to achieve regret improvements for heavy-tailed bandits, we first propose a new quantum mean estimator for heavy-tailed distributions, which is based on the Quantum Monte Carlo Mean Estimator and achieves a quadratic improvement of estimation error compared to the classical one. Based on our quantum mean estimator, we focus on quantum heavy-tailed MAB and SLB and propose quantum algorithms based on the Upper Confidence Bound (UCB) framework for both problems with $\Tilde{O}(T^{\frac{1-v}{1+v}})$ regrets, polynomially improving the dependence in terms of $T$ as compared to classical (near) optimal regrets of $\Tilde{O}(T^{\frac{1}{1+v}})$, where $T$ is the number of rounds. Finally, experiments also support our theoretical results and show the effectiveness of our proposed methods.  ( 2 min )
    From Robots to Books: An Introduction to Smart Applications of AI in Education (AIEd). (arXiv:2301.10026v1 [cs.CY])
    The world around us has undergone a radical transformation due to rapid technological advancement in recent decades. The industry of the future generation is evolving, and artificial intelligence is the following change in the making popularly known as Industry 4.0. Indeed, experts predict that artificial intelligence(AI) will be the main force behind the following significant virtual shift in the way we stay, converse, study, live, communicate and conduct business. All facets of our social connection are being transformed by this growing technology. One of the newest areas of educational technology is Artificial Intelligence in the field of Education(AIEd).This study emphasizes the different applications of artificial intelligence in education from both an industrial and academic standpoint. It highlights the most recent contextualized learning novel transformative evaluations and advancements in sophisticated tutoring systems. It analyses the AIEd's ethical component and the influence of the transition on people, particularly students and instructors as well. Finally, this article touches on AIEd's potential future research and practices. The goal of this study is to introduce the present-day applications to its intended audience.  ( 2 min )
    FedPrompt: Communication-Efficient and Privacy Preserving Prompt Tuning in Federated Learning. (arXiv:2208.12268v3 [cs.LG] UPDATED)
    Federated learning (FL) has enabled global model training on decentralized data in a privacy-preserving way by aggregating model updates. However, for many natural language processing (NLP) tasks that utilize pre-trained language models (PLMs) with large numbers of parameters, there are considerable communication costs associated with FL. Recently, prompt tuning, which tunes some soft prompts without modifying PLMs, has achieved excellent performance as a new learning paradigm. Therefore we want to combine the two methods and explore the effect of prompt tuning under FL. In this paper, we propose "FedPrompt" to study prompt tuning in a model split aggregation way using FL, and prove that split aggregation greatly reduces the communication cost, only 0.01% of the PLMs' parameters, with little decrease on accuracy both on IID and Non-IID data distribution. This improves the efficiency of FL method while also protecting the data privacy in prompt tuning. In addition, like PLMs, prompts are uploaded and downloaded between public platforms and personal users, so we try to figure out whether there is still a backdoor threat using only soft prompts in FL scenarios. We further conduct backdoor attacks by data poisoning on FedPrompt. Our experiments show that normal backdoor attack can not achieve a high attack success rate, proving the robustness of FedPrompt. We hope this work can promote the application of prompt in FL and raise the awareness of the possible security threats.  ( 2 min )
    Efficient learning of large sets of locally optimal classification rules. (arXiv:2301.09936v1 [cs.LG])
    Conventional rule learning algorithms aim at finding a set of simple rules, where each rule covers as many examples as possible. In this paper, we argue that the rules found in this way may not be the optimal explanations for each of the examples they cover. Instead, we propose an efficient algorithm that aims at finding the best rule covering each training example in a greedy optimization consisting of one specialization and one generalization loop. These locally optimal rules are collected and then filtered for a final rule set, which is much larger than the sets learned by conventional rule learning algorithms. A new example is classified by selecting the best among the rules that cover this example. In our experiments on small to very large datasets, the approach's average classification accuracy is higher than that of state-of-the-art rule learning algorithms. Moreover, the algorithm is highly efficient and can inherently be processed in parallel without affecting the learned rule set and so the classification accuracy. We thus believe that it closes an important gap for large-scale classification rule induction.  ( 2 min )
    Model Agnostic Sample Reweighting for Out-of-Distribution Learning. (arXiv:2301.09819v1 [cs.LG])
    Distributionally robust optimization (DRO) and invariant risk minimization (IRM) are two popular methods proposed to improve out-of-distribution (OOD) generalization performance of machine learning models. While effective for small models, it has been observed that these methods can be vulnerable to overfitting with large overparameterized models. This work proposes a principled method, \textbf{M}odel \textbf{A}gnostic sam\textbf{PL}e r\textbf{E}weighting (\textbf{MAPLE}), to effectively address OOD problem, especially in overparameterized scenarios. Our key idea is to find an effective reweighting of the training samples so that the standard empirical risk minimization training of a large model on the weighted training data leads to superior OOD generalization performance. The overfitting issue is addressed by considering a bilevel formulation to search for the sample reweighting, in which the generalization complexity depends on the search space of sample weights instead of the model size. We present theoretical analysis in linear case to prove the insensitivity of MAPLE to model size, and empirically verify its superiority in surpassing state-of-the-art methods by a large margin. Code is available at \url{https://github.com/x-zho14/MAPLE}.  ( 2 min )
    Spectral Cross-Domain Neural Network with Soft-adaptive Threshold Spectral Enhancement. (arXiv:2301.10171v1 [cs.LG])
    Electrocardiography (ECG) signals can be considered as multi-variable time-series. The state-of-the-art ECG data classification approaches, based on either feature engineering or deep learning techniques, treat separately spectral and time domains in machine learning systems. No spectral-time domain communication mechanism inside the classifier model can be found in current approaches, leading to difficulties in identifying complex ECG forms. In this paper, we proposed a novel deep learning model named Spectral Cross-domain neural network (SCDNN) with a new block called Soft-adaptive threshold spectral enhancement (SATSE), to simultaneously reveal the key information embedded in spectral and time domains inside the neural network. More precisely, the domain-cross information is captured by a general Convolutional neural network (CNN) backbone, and different information sources are merged by a self-adaptive mechanism to mine the connection between time and spectral domains. In SATSE, the knowledge from time and spectral domains is extracted via the Fast Fourier Transformation (FFT) with soft trainable thresholds in modified Sigmoid functions. The proposed SCDNN is tested with several classification tasks implemented on the public ECG databases \textit{PTB-XL} and \textit{MIT-BIH}. SCDNN outperforms the state-of-the-art approaches with a low computational cost regarding a variety of metrics in all classification tasks on both databases, by finding appropriate domains from the infinite spectral mapping. The convergence of the trainable thresholds in the spectral domain is also numerically investigated in this paper. The robust performance of SCDNN provides a new perspective to exploit knowledge across deep learning models from time and spectral domains. The repository can be found: https://github.com/DL-WG/SCDNN-TS  ( 2 min )
    Multi-view Kernel PCA for Time series Forecasting. (arXiv:2301.09811v1 [cs.LG])
    In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.  ( 2 min )
    Quantification of Damage Using Indirect Structural Health Monitoring. (arXiv:2301.09791v1 [cs.LG])
    Structural health monitoring is important to make sure bridges do not fail. Since direct monitoring can be complicated and expensive, indirect methods have been a focus on research. Indirect monitoring can be much cheaper and easier to conduct, however there are challenges with getting accurate results. This work focuses on damage quantification by using accelerometers. Tests were conducted on a model bridge and car with four accelerometers attached to to the vehicle. Different weights were placed on the bridge to simulate different levels of damage, and 31 tests were run for 20 different damage levels. The acceleration data collected was normalized and a Fast-Fourier Transform (FFT) was performed on that data. Both the normalized acceleration data and the normalized FFT data were inputted into a Non-Linear Principal Component Analysis (separately) and three principal components were extracted for each data set. Support Vector Regression (SVR) and Gaussian Process Regression (GPR) were used as the supervised machine learning methods to develop models. Multiple models were created so that the best one could be selected, and the models were compared by looking at their Mean Squared Errors (MSE). This methodology should be applied in the field to measure how effective it can be in real world applications.  ( 2 min )
    Feature-based Image Matching for Identifying Individual K\=ak\=a. (arXiv:2301.06678v2 [cs.CV] UPDATED)
    This report investigates an unsupervised, feature-based image matching pipeline for the novel application of identifying individual k\=ak\=a. Applied with a similarity network for clustering, this addresses a weakness of current supervised approaches to identifying individual birds which struggle to handle the introduction of new individuals to the population. Our approach uses object localisation to locate k\=ak\=a within images and then extracts local features that are invariant to rotation and scale. These features are matched between images with nearest neighbour matching techniques and mismatch removal to produce a similarity score for image match comparison. The results show that matches obtained via the image matching pipeline achieve high accuracy of true matches. We conclude that feature-based image matching could be used with a similarity network to provide a viable alternative to existing supervised approaches.  ( 2 min )
    Accurate Detection of Paroxysmal Atrial Fibrillation with Certified-GAN and Neural Architecture Search. (arXiv:2301.10173v1 [cs.LG])
    This paper presents a novel machine learning framework for detecting Paroxysmal Atrial Fibrillation (PxAF), a pathological characteristic of Electrocardiogram (ECG) that can lead to fatal conditions such as heart attack. To enhance the learning process, the framework involves a Generative Adversarial Network (GAN) along with a Neural Architecture Search (NAS) in the data preparation and classifier optimization phases. The GAN is innovatively invoked to overcome the class imbalance of the training data by producing the synthetic ECG for PxAF class in a certified manner. The effect of the certified GAN is statistically validated. Instead of using a general-purpose classifier, the NAS automatically designs a highly accurate convolutional neural network architecture customized for the PxAF classification task. Experimental results show that the accuracy of the proposed framework exhibits a high value of 99% which not only enhances state-of-the-art by up to 5.1%, but also improves the classification performance of the two widely-accepted baseline methods, ResNet-18, and Auto-Sklearn, by 2.2% and 6.1%.  ( 2 min )
    Dataset Bias in Human Activity Recognition. (arXiv:2301.10161v1 [eess.SP])
    When creating multi-channel time-series datasets for Human Activity Recognition (HAR), researchers are faced with the issue of subject selection criteria. It is unknown what physical characteristics and/or soft-biometrics, such as age, height, and weight, need to be taken into account to train a classifier to achieve robustness towards heterogeneous populations in the training and testing data. This contribution statistically curates the training data to assess to what degree the physical characteristics of humans influence HAR performance. We evaluate the performance of a state-of-the-art convolutional neural network on two HAR datasets that vary in the sensors, activities, and recording for time-series HAR. The training data is intentionally biased with respect to human characteristics to determine the features that impact motion behaviour. The evaluations brought forth the impact of the subjects' characteristics on HAR. Thus, providing insights regarding the robustness of the classifier with respect to heterogeneous populations. The study is a step forward in the direction of fair and trustworthy artificial intelligence by attempting to quantify representation bias in multi-channel time series HAR data.  ( 2 min )
    Domain generalization in deep learning-based mass detection in mammography: A large-scale multi-center study. (arXiv:2201.11620v2 [eess.IV] CROSS LISTED)
    Computer-aided detection systems based on deep learning have shown great potential in breast cancer detection. However, the lack of domain generalization of artificial neural networks is an important obstacle to their deployment in changing clinical environments. In this work, we explore the domain generalization of deep learning methods for mass detection in digital mammography and analyze in-depth the sources of domain shift in a large-scale multi-center setting. To this end, we compare the performance of eight state-of-the-art detection methods, including Transformer-based models, trained in a single domain and tested in five unseen domains. Moreover, a single-source mass detection training pipeline is designed to improve the domain generalization without requiring images from the new domain. The results show that our workflow generalizes better than state-of-the-art transfer learning-based approaches in four out of five domains while reducing the domain shift caused by the different acquisition protocols and scanner manufacturers. Subsequently, an extensive analysis is performed to identify the covariate shifts with bigger effects on the detection performance, such as due to differences in patient age, breast density, mass size, and mass malignancy. Ultimately, this comprehensive study provides key insights and best practices for future research on domain generalization in deep learning-based breast cancer detection.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.
    LDMIC: Learning-based Distributed Multi-view Image Coding. (arXiv:2301.09799v1 [eess.IV])
    Multi-view image compression plays a critical role in 3D-related applications. Existing methods adopt a predictive coding architecture, which requires joint encoding to compress the corresponding disparity as well as residual information. This demands collaboration among cameras and enforces the epipolar geometric constraint between different views, which makes it challenging to deploy these methods in distributed camera systems with randomly overlapping fields of view. Meanwhile, distributed source coding theory indicates that efficient data compression of correlated sources can be achieved by independent encoding and joint decoding, which motivates us to design a learning-based distributed multi-view image coding (LDMIC) framework. With independent encoders, LDMIC introduces a simple yet effective joint context transfer module based on the cross-attention mechanism at the decoder to effectively capture the global inter-view correlations, which is insensitive to the geometric relationships between images. Experimental results show that LDMIC significantly outperforms both traditional and learning-based MIC methods while enjoying fast encoding speed. Code will be released at https://github.com/Xinjie-Q/LDMIC.  ( 2 min )
    Explainable Deep Reinforcement Learning: State of the Art and Challenges. (arXiv:2301.09937v1 [cs.LG])
    Interpretability, explainability and transparency are key issues to introducing Artificial Intelligence methods in many critical domains: This is important due to ethical concerns and trust issues strongly connected to reliability, robustness, auditability and fairness, and has important consequences towards keeping the human in the loop in high levels of automation, especially in critical cases for decision making, where both (human and the machine) play important roles. While the research community has given much attention to explainability of closed (or black) prediction boxes, there are tremendous needs for explainability of closed-box methods that support agents to act autonomously in the real world. Reinforcement learning methods, and especially their deep versions, are such closed-box methods. In this article we aim to provide a review of state of the art methods for explainable deep reinforcement learning methods, taking also into account the needs of human operators - i.e., of those that take the actual and critical decisions in solving real-world problems. We provide a formal specification of the deep reinforcement learning explainability problems, and we identify the necessary components of a general explainable reinforcement learning framework. Based on these, we provide a comprehensive review of state of the art methods, categorizing them in classes according to the paradigm they follow, the interpretable models they use, and the surface representation of explanations provided. The article concludes identifying open questions and important challenges.  ( 2 min )
    Predicting Socio-Economic Well-being Using Mobile Apps Data: A Case Study of France. (arXiv:2301.09986v1 [cs.CY])
    Socio-economic indicators provide context for assessing a country's overall condition. These indicators contain information about education, gender, poverty, employment, and other factors. Therefore, reliable and accurate information is critical for social research and government policing. Most data sources available today, such as censuses, have sparse population coverage or are updated infrequently. Nonetheless, alternative data sources, such as call data records (CDR) and mobile app usage, can serve as cost-effective and up-to-date sources for identifying socio-economic indicators. This work investigates mobile app data to predict socio-economic features. We present a large-scale study using data that captures the traffic of thousands of mobile applications by approximately 30 million users distributed over 550,000 km square and served by over 25,000 base stations. The dataset covers the whole France territory and spans more than 2.5 months, starting from 16th March 2019 to 6th June 2019. Using the app usage patterns, our best model can estimate socio-economic indicators (attaining an R-squared score upto 0.66). Furthermore, using models' explainability, we discover that mobile app usage patterns have the potential to reveal socio-economic disparities in IRIS. Insights of this study provide several avenues for future interventions, including users' temporal network analysis and exploration of alternative data sources.  ( 2 min )
    Investigating Labeler Bias in Face Annotation for Machine Learning. (arXiv:2301.09902v1 [cs.LG])
    In a world increasingly reliant on artificial intelligence, it is more important than ever to consider the ethical implications of artificial intelligence on humanity. One key under-explored challenge is labeler bias, which can create inherently biased datasets for training and subsequently lead to inaccurate or unfair decisions in healthcare, employment, education, and law enforcement. Hence, we conducted a study to investigate and measure the existence of labeler bias using images of people from different ethnicities and sexes in a labeling task. Our results show that participants possess stereotypes that influence their decision-making process and that labeler demographics impact assigned labels. We also discuss how labeler bias influences datasets and, subsequently, the models trained on them. Overall, a high degree of transparency must be maintained throughout the entire artificial intelligence training process to identify and correct biases in the data as early as possible.  ( 2 min )
    Membership Inference of Diffusion Models. (arXiv:2301.09956v1 [cs.CR])
    Recent years have witnessed the tremendous success of diffusion models in data synthesis. However, when diffusion models are applied to sensitive data, they also give rise to severe privacy concerns. In this paper, we systematically present the first study about membership inference attacks against diffusion models, which aims to infer whether a sample was used to train the model. Two attack methods are proposed, namely loss-based and likelihood-based attacks. Our attack methods are evaluated on several state-of-the-art diffusion models, over different datasets in relation to privacy-sensitive data. Extensive experimental evaluations show that our attacks can achieve remarkable performance. Furthermore, we exhaustively investigate various factors which can affect attack performance. Finally, we also evaluate the performance of our attack methods on diffusion models trained with differential privacy.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks: Applications in neutron transport. (arXiv:2301.09991v1 [cs.CE])
    In this paper we solve the Boltzmann transport equation using AI libraries. The reason why this is attractive is because it enables one to use the highly optimised software within AI libraries, enabling one to run on different computer architectures and enables one to tap into the vast quantity of community based software that has been developed for AI and ML applications e.g. mixed arithmetic precision or model parallelism. Here we take the first steps towards developing this approach for the Boltzmann transport equation and develop the necessary methods in order to do that effectively. This includes: 1) A space-angle multigrid solution method that can extract the level of parallelism necessary to run efficiently on GPUs or new AI computers. 2) A new Convolutional Finite Element Method (ConvFEM) that greatly simplifies the implementation of high order finite elements (quadratic to quintic, say). 3) A new non-linear Petrov-Galerkin method that introduces dissipation anisotropically.  ( 2 min )
    Fair and skill-diverse student group formation via constrained k-way graph partitioning. (arXiv:2301.09984v1 [cs.LG])
    Forming the right combination of students in a group promises to enable a powerful and effective environment for learning and collaboration. However, defining a group of students is a complex task which has to satisfy multiple constraints. This work introduces an unsupervised algorithm for fair and skill-diverse student group formation. This is achieved by taking account of student course marks and sensitive attributes provided by the education office. The skill sets of students are determined using unsupervised dimensionality reduction of course mark data via the Laplacian eigenmap. The problem is formulated as a constrained graph partitioning problem, whereby the diversity of skill sets in each group are maximised, group sizes are upper and lower bounded according to available resources, and `balance' of a sensitive attribute is lower bounded to enforce fairness in group formation. This optimisation problem is solved using integer programming and its effectiveness is demonstrated on a dataset of student course marks from Imperial College London.  ( 2 min )
    The Backpropagation algorithm for a math student. (arXiv:2301.09977v1 [cs.LG])
    A Deep Neural Network (DNN) is a composite function of vector-valued functions, and in order to train a DNN, it is necessary to calculate the gradient of the loss function with respect to all parameters. This calculation can be a non-trivial task because the loss function of a DNN is a composition of several nonlinear functions, each with numerous parameters. The Backpropagation (BP) algorithm leverages the composite structure of the DNN to efficiently compute the gradient. As a result, the number of layers in the network does not significantly impact the complexity of the calculation. The objective of this paper is to express the gradient of the loss function in terms of a matrix multiplication using the Jacobian operator. This can be achieved by considering the total derivative of each layer with respect to its parameters and expressing it as a Jacobian matrix. The gradient can then be represented as the matrix product of these Jacobian matrices. This approach is valid because the chain rule can be applied to a composition of vector-valued functions, and the use of Jacobian matrices allows for the incorporation of multiple inputs and outputs. By providing concise mathematical justifications, the results can be made understandable and useful to a broad audience from various disciplines.  ( 2 min )
    Probabilistic Bilevel Coreset Selection. (arXiv:2301.09880v1 [cs.LG])
    The goal of coreset selection in supervised learning is to produce a weighted subset of data, so that training only on the subset achieves similar performance as training on the entire dataset. Existing methods achieved promising results in resource-constrained scenarios such as continual learning and streaming. However, most of the existing algorithms are limited to traditional machine learning models. A few algorithms that can handle large models adopt greedy search approaches due to the difficulty in solving the discrete subset selection problem, which is computationally costly when coreset becomes larger and often produces suboptimal results. In this work, for the first time we propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample. The overall objective is posed as a bilevel optimization problem, where 1) the inner loop samples coresets and train the model to convergence and 2) the outer loop updates the sample probability progressively according to the model's performance. Importantly, we develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation. We provide the convergence property of our training procedure and demonstrate the superiority of our algorithm against various coreset selection methods in various tasks, especially in more challenging label-noise and class-imbalance scenarios.  ( 2 min )
    A two stages Deep Learning Architecture for Model Reduction of Parametric Time-Dependent Problems. (arXiv:2301.09926v1 [math.NA])
    Parametric time-dependent systems are of a crucial importance in modeling real phenomena, often characterized by non-linear behaviors too. Those solutions are typically difficult to generalize in a sufficiently wide parameter space while counting on limited computational resources available. As such, we present a general two-stages deep learning framework able to perform that generalization with low computational effort in time. It consists in a separated training of two pipe-lined predictive models. At first, a certain number of independent neural networks are trained with data-sets taken from different subsets of the parameter space. Successively, a second predictive model is specialized to properly combine the first-stage guesses and compute the right predictions. Promising results are obtained applying the framework to incompressible Navier-Stokes equations in a cavity (Rayleigh-Bernard cavity), obtaining a 97% reduction in the computational time comparing with its numerical resolution for a new value of the Grashof number.  ( 2 min )
    Neighborhood Homophily-Guided Graph Convolutional Network. (arXiv:2301.09851v1 [cs.LG])
    Graph neural networks (GNNs) have achieved remarkable advances in graph-oriented tasks. However, many real-world graphs contain heterophily or low homophily, challenging the homophily assumption of classical GNNs and resulting in low performance. Although many studies have emerged to improve the universality of GNNs, they rarely consider the label reuse and the correlation of their proposed metrics and models. In this paper, we first design a new metric, named Neighborhood Homophily (\textit{NH}), to measure the label complexity or purity in the neighborhood of nodes. Furthermore, we incorporate this metric into the classical graph convolutional network (GCN) architecture and propose \textbf{N}eighborhood \textbf{H}omophily-\textbf{G}uided \textbf{G}raph \textbf{C}onvolutional \textbf{N}etwork (\textbf{NHGCN}). In this framework, nodes are grouped by estimated \textit{NH} values to achieve intra-group weight sharing during message propagation and aggregation. Then the generated node predictions are used to estimate and update new \textit{NH} values. The two processes of metric estimation and model inference are alternately optimized to achieve better node classification. Extensive experiments on both homophilous and heterophilous benchmarks demonstrate that \textbf{NHGCN} achieves state-of-the-art overall performance on semi-supervised node classification for the universality problem.  ( 2 min )
    Learning To Dive In Branch And Bound. (arXiv:2301.09943v1 [cs.LG])
    Primal heuristics are important for solving mixed integer linear programs, because they find feasible solutions that facilitate branch and bound search. A prominent group of primal heuristics are diving heuristics. They iteratively modify and resolve linear programs to conduct a depth-first search from any node in the search tree. Existing divers rely on generic decision rules that fail to exploit structural commonality between similar problem instances that often arise in practice. Therefore, we propose L2Dive to learn specific diving heuristics with graph neural networks: We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions based on the model's predictions. L2Dive is fully integrated into the open-source solver SCIP. We find that L2Dive outperforms standard divers to find better feasible solutions on a range of combinatorial optimization problems. For real-world applications from server load balancing and neural network verification, L2Dive improves the primal-dual integral by up to 7% (35%) on average over a tuned (default) solver baseline and reduces average solving time by 20% (29%).  ( 2 min )
    Same or Different? Diff-Vectors for Authorship Analysis. (arXiv:2301.09862v1 [cs.LG])
    We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship verification) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call Diff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks. (arXiv:2301.09939v1 [cs.CE])
    This paper presents a new approach which uses the tools within Artificial Intelligence (AI) software libraries as an alternative way of solving partial differential equations (PDEs) that have been discretised using standard numerical methods. In particular, we describe how to represent numerical discretisations arising from the finite volume and finite element methods by pre-determining the weights of convolutional layers within a neural network. As the weights are defined by the discretisation scheme, no training of the network is required and the solutions obtained are identical (accounting for solver tolerances) to those obtained with standard codes often written in Fortran or C++. We also explain how to implement the Jacobi method and a multigrid solver using the functions available in AI libraries. For the latter, we use a U-Net architecture which is able to represent a sawtooth multigrid method. A benefit of using AI libraries in this way is that one can exploit their power and their built-in technologies. For example, their executions are already optimised for different computer architectures, whether it be CPUs, GPUs or new-generation AI processors. In this article, we apply the proposed approach to eigenvalue problems in reactor physics where neutron transport is described by diffusion theory. For a fuel assembly benchmark, we demonstrate that the solution obtained from our new approach is the same (accounting for solver tolerances) as that obtained from the same discretisation coded in a standard way using Fortran. We then proceed to solve a reactor core benchmark using the new approach.  ( 2 min )
    A Stability Analysis of Fine-Tuning a Pre-Trained Model. (arXiv:2301.09820v1 [cs.LG])
    Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Noisy Parallel Data Alignment. (arXiv:2301.09685v1 [cs.CL])
    An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.  ( 2 min )
    Data Augmentation Alone Can Improve Adversarial Training. (arXiv:2301.09879v1 [cs.CV])
    Adversarial training suffers from the issue of robust overfitting, which seriously impairs its generalization performance. Data augmentation, which is effective at preventing overfitting in standard training, has been observed by many previous works to be ineffective in mitigating overfitting in adversarial training. This work proves that, contrary to previous findings, data augmentation alone can significantly boost accuracy and robustness in adversarial training. We find that the hardness and the diversity of data augmentation are important factors in combating robust overfitting. In general, diversity can improve both accuracy and robustness, while hardness can boost robustness at the cost of accuracy within a certain limit and degrade them both over that limit. To mitigate robust overfitting, we first propose a new crop transformation, Cropshift, which has improved diversity compared to the conventional one (Padcrop). We then propose a new data augmentation scheme, based on Cropshift, with much improved diversity and well-balanced hardness. Empirically, our augmentation method achieves the state-of-the-art accuracy and robustness for data augmentations in adversarial training. Furthermore, when combined with weight averaging it matches, or even exceeds, the performance of the best contemporary regularization methods for alleviating robust overfitting. Code is available at: https://github.com/TreeLLi/DA-Alone-Improves-AT.  ( 2 min )
    Topological Structure is Predictive of Deep Neural Network Success in Learning. (arXiv:2301.09734v1 [cs.LG])
    Machine learning has become a fundamental tool in modern science, yet its limitations are still not fully understood. Using a simple children's game, we show that the topological structure of the underlying training data can have a dramatic effect on the ability of a deep neural network (DNN) classifier to learn to classify data. We then take insights obtained from this toy model and apply them to two physical data sets (one from particle physics and one from acoustics), which are known to be amenable to classification by DNN's. We show that the simplicity in their topological structure explains the majority of the DNN's ability to operate on these data sets by showing that fully interpretable topological classifiers are able to perform nearly as well as their DNN counterparts.  ( 2 min )
    Slice-and-Forge: Making Better Use of Caches for Graph Convolutional Network Accelerators. (arXiv:2301.09813v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they can process a wide variety of data formats that prior deep neural networks cannot easily support. One key challenge in designing hardware accelerators for GCNs is the vast size and randomness in their data access patterns which greatly reduces the effectiveness of the limited on-chip cache. Aimed at improving the effectiveness of the cache by mitigating the irregular data accesses, prior studies often employ the vertex tiling techniques used in traditional graph processing applications. While being effective at enhancing the cache efficiency, those approaches are often sensitive to the tiling configurations where the optimal setting heavily depends on target input datasets. Furthermore, the existing solutions require manual tuning through trial-and-error or rely on sub-optimal analytical models. In this paper, we propose Slice-and-Forge (SnF), an efficient hardware accelerator for GCNs which greatly improves the effectiveness of the limited on-chip cache. SnF chooses a tiling strategy named feature slicing that splits the features into vertical slices and processes them in the outermost loop of the execution. This particular choice results in a repetition of the identical computational patterns over irregular graph data over multiple rounds. Taking advantage of such repetitions, SnF dynamically tunes its tile size. Our experimental results reveal that SnF can achieve 1.73x higher performance in geomean compared to prior work on multi-engine settings, and 1.46x higher performance in geomean on small scale settings, without the need for off-line analyses.  ( 2 min )
    Gossiped and Quantized Online Multi-Kernel Learning. (arXiv:2301.09848v1 [cs.LG])
    In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, past research has shown that distributed and online multi-kernel learning provides sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). In addition, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. This letter expands on these results to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, we propose a gossip algorithm and provide a proof that it achieves sub-linear regret. Experiments with real datasets confirm our findings.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Backdoor Attacks in Peer-to-Peer Federated Learning. (arXiv:2301.09732v1 [cs.LG])
    We study backdoor attacks in peer-to-peer federated learning systems on different graph topologies and datasets. We show that only 5% attacker nodes are sufficient to perform a backdoor attack with 42% attack success without decreasing the accuracy on clean data by more than 2%. We also demonstrate that the attack can be amplified by the attacker crashing a small number of nodes. We evaluate defenses proposed in the context of centralized federated learning and show they are ineffective in peer-to-peer settings. Finally, we propose a defense that mitigates the attacks by applying different clipping norms to the model updates received from peers and local model trained by a node.  ( 2 min )
    Truveta Mapper: A Zero-shot Ontology Alignment Framework. (arXiv:2301.09767v1 [cs.LG])
    In this paper, a new perspective is suggested for unsupervised Ontology Matching (OM) or Ontology Alignment (OA) by treating it as a translation task. Ontologies are represented as graphs, and the translation is performed from a node in the source ontology graph to a path in the target ontology graph. The proposed framework, Truveta Mapper (TM), leverages a multi-task sequence-to-sequence transformer model to perform alignment across multiple ontologies in a zero-shot, unified and end-to-end manner. Multi-tasking enables the model to implicitly learn the relationship between different ontologies via transfer-learning without requiring any explicit cross-ontology manually labeled data. This also enables the formulated framework to outperform existing solutions for both runtime latency and alignment quality. The model is pre-trained and fine-tuned only on publicly available text corpus and inner-ontologies data. The proposed solution outperforms state-of-the-art approaches, Edit-Similarity, LogMap, AML, BERTMap, and the recently presented new OM frameworks in Ontology Alignment Evaluation Initiative (OAEI22), offers log-linear complexity in contrast to quadratic in the existing end-to-end methods, and overall makes the OM task efficient and more straightforward without much post-processing involving mapping extension or mapping repair.  ( 2 min )
    Constrained Reinforcement Learning for Dexterous Manipulation. (arXiv:2301.09766v1 [cs.RO])
    Existing learning approaches to dexterous manipulation use demonstrations or interactions with the environment to train black-box neural networks that provide little control over how the robot learns the skills or how it would perform post training. These approaches pose significant challenges when implemented on physical platforms given that, during initial stages of training, the robot's behavior could be erratic and potentially harmful to its own hardware, the environment, or any humans in the vicinity. A potential way to address these limitations is to add constraints during learning that restrict and guide the robot's behavior during training as well as roll outs. Inspired by the success of constrained approaches in other domains, we investigate the effects of adding position-based constraints to a 24-DOF robot hand learning to perform object relocation using Constrained Policy Optimization. We find that a simple geometric constraint can ensure the robot learns to move towards the object sooner than without constraints. Further, training with this constraint requires a similar number of samples as its unconstrained counterpart to master the skill. These findings shed light on how simple constraints can help robots achieve sensible and safe behavior quickly and ease concerns surrounding hardware deployment. We also investigate the effects of the strictness of these constraints and report findings that provide insights into how different degrees of strictness affect learning outcomes. Our code is available at https://github.com/GT-STAR-Lab/constrained-rl-dexterous-manipulation.  ( 2 min )
    DODEM: DOuble DEfense Mechanism Against Adversarial Attacks Towards Secure Industrial Internet of Things Analytics. (arXiv:2301.09740v1 [cs.CR])
    Industrial Internet of Things (I-IoT) is a collaboration of devices, sensors, and networking equipment to monitor and collect data from industrial operations. Machine learning (ML) methods use this data to make high-level decisions with minimal human intervention. Data-driven predictive maintenance (PDM) is a crucial ML-based I-IoT application to find an optimal maintenance schedule for industrial assets. The performance of these ML methods can seriously be threatened by adversarial attacks where an adversary crafts perturbed data and sends it to the ML model to deteriorate its prediction performance. The models should be able to stay robust against these attacks where robustness is measured by how much perturbation in input data affects model performance. Hence, there is a need for effective defense mechanisms that can protect these models against adversarial attacks. In this work, we propose a double defense mechanism to detect and mitigate adversarial attacks in I-IoT environments. We first detect if there is an adversarial attack on a given sample using novelty detection algorithms. Then, based on the outcome of our algorithm, marking an instance as attack or normal, we select adversarial retraining or standard training to provide a secondary defense layer. If there is an attack, adversarial retraining provides a more robust model, while we apply standard training for regular samples. Since we may not know if an attack will take place, our adaptive mechanism allows us to consider irregular changes in data. The results show that our double defense strategy is highly efficient where we can improve model robustness by up to 64.6% and 52% compared to standard and adversarial retraining, respectively.  ( 2 min )
    Long-term stable Electromyography classification using Canonical Correlation Analysis. (arXiv:2301.09729v1 [cs.LG])
    Discrimination of hand gestures based on the decoding of surface electromyography (sEMG) signals is a well-establish approach for controlling prosthetic devices and for Human-Machine Interfaces (HMI). However, despite the promising results achieved by this approach in well-controlled experimental conditions, its deployment in long-term real-world application scenarios is still hindered by several challenges. One of the most critical challenges is maintaining high EMG data classification performance across multiple days without retraining the decoding system. The drop in performance is mostly due to the high EMG variability caused by electrodes shift, muscle artifacts, fatigue, user adaptation, or skin-electrode interfacing issues. Here we propose a novel statistical method based on canonical correlation analysis (CCA) that stabilizes EMG classification performance across multiple days for long-term control of prosthetic devices. We show how CCA can dramatically decrease the performance drop of standard classifiers observed across days, by maximizing the correlation among multiple-day acquisition data sets. Our results show how the performance of a classifier trained on EMG data acquired only of the first day of the experiment maintains 90% relative accuracy across multiple days, compensating for the EMG data variability that occurs over long-term periods, using the CCA transformation on data obtained from a small number of gestures. This approach eliminates the need for large data sets and multiple or periodic training sessions, which currently hamper the usability of conventional pattern recognition based approaches  ( 2 min )
    Earthquake Magnitude and b value prediction model using Extreme Learning Machine. (arXiv:2301.09756v1 [physics.geo-ph])
    Earthquake prediction has been a challenging research area for many decades, where the future occurrence of this highly uncertain calamity is predicted. In this paper, several parametric and non-parametric features were calculated, where the non-parametric features were calculated using the parametric features. $8$ seismic features were calculated using Gutenberg-Richter law, the total recurrence, and the seismic energy release. Additionally, criterions such as Maximum Relevance and Maximum Redundancy were applied to choose the pertinent features. These features along with others were used as input for an Extreme Learning Machine (ELM) Regression Model. Magnitude and time data of $5$ decades from the Assam-Guwahati region were used to create this model for magnitude prediction. The Testing Accuracy and Testing Speed were computed taking the Root Mean Squared Error (RMSE) as the parameter for evaluating the mode. As confirmed by the results, ELM shows better scalability with much faster training and testing speed (up to a thousand times faster) than traditional Support Vector Machines. The testing RMSE came out to be around $0.097$. To further test the model's robustness -- magnitude-time data from California was used to calculate the seismic indicators which were then fed into an ELM and then tested on the Assam-Guwahati region. The model proves to be robust and can be implemented in early warning systems as it continues to be a major part of Disaster Response and management.  ( 2 min )
    Two-Stage Learning For the Flexible Job Shop Scheduling Problem. (arXiv:2301.09703v1 [cs.AI])
    The Flexible Job-shop Scheduling Problem (FJSP) is an important combinatorial optimization problem that arises in manufacturing and service settings. FJSP is composed of two subproblems, an assignment problem that assigns tasks to machines, and a scheduling problem that determines the starting times of tasks on their chosen machines. Solving FJSP instances of realistic size and composition is an ongoing challenge even under simplified, deterministic assumptions. Motivated by the inevitable randomness and uncertainties in supply chains, manufacturing, and service operations, this paper investigates the potential of using a deep learning framework to generate fast and accurate approximations for FJSP. In particular, this paper proposes a two-stage learning framework 2SLFJSP that explicitly models the hierarchical nature of FJSP decisions, uses a confidence-aware branching scheme to generate appropriate instances for the scheduling stage from the assignment predictions and leverages a novel symmetry-breaking formulation to improve learnability. 2SL-FJSP is evaluated on instances from the FJSP benchmark library. Results show that 2SL-FJSP can generate high-quality solutions in milliseconds, outperforming a state-of-the-art reinforcement learning approach recently proposed in the literature, and other heuristics commonly used in practice.  ( 2 min )
    Implementation of the Critical Wave Groups Method with Computational Fluid Dynamics and Neural Networks. (arXiv:2301.09834v1 [physics.flu-dyn])
    Accurate and efficient prediction of extreme ship responses continues to be a challenging problem in ship hydrodynamics. Probabilistic frameworks in conjunction with computationally efficient numerical hydrodynamic tools have been developed that allow researchers and designers to better understand extremes. However, the ability of these hydrodynamic tools to represent the physics quantitatively during extreme events is limited. Previous research successfully implemented the critical wave groups (CWG) probabilistic method with computational fluid dynamics (CFD). Although the CWG method allows for less simulation time than a Monte Carlo approach, the large quantity of simulations required is cost prohibitive. The objective of the present paper is to reduce the computational cost of implementing CWG with CFD, through the construction of long short-term memory (LSTM) neural networks. After training the models with a limited quantity of simulations, the models can provide a larger quantity of predictions to calculate the probability. The new framework is demonstrated with a 2-D midship section of the Office of Naval Research Tumblehome (ONRT) hull in Sea State 7 and beam seas at zero speed. The new framework is able to produce predictions that are representative of a purely CFD-driven CWG framework, with two orders of magnitude of computational cost savings.  ( 2 min )
    Long-tail Detection with Effective Class-Margins. (arXiv:2301.09724v1 [cs.CV])
    Large-scale object detection and instance segmentation face a severe data imbalance. The finer-grained object classes become, the less frequent they appear in our datasets. However, at test-time, we expect a detector that performs well for all classes and not just the most frequent ones. In this paper, we provide a theoretical understanding of the long-trail detection problem. We show how the commonly used mean average precision evaluation metric on an unknown test set is bound by a margin-based binary classification error on a long-tailed object detection training set. We optimize margin-based binary classification error with a novel surrogate objective called \textbf{Effective Class-Margin Loss} (ECM). The ECM loss is simple, theoretically well-motivated, and outperforms other heuristic counterparts on LVIS v1 benchmark over a wide range of architecture and detectors. Code is available at \url{https://github.com/janghyuncho/ECM-Loss}.  ( 2 min )
    Graph Neural Networks for Decentralized Multi-Agent Perimeter Defense. (arXiv:2301.09689v1 [cs.MA])
    In this work, we study the problem of decentralized multi-agent perimeter defense that asks for computing actions for defenders with local perceptions and communications to maximize the capture of intruders. One major challenge for practical implementations is to make perimeter defense strategies scalable for large-scale problem instances. To this end, we leverage graph neural networks (GNNs) to develop an imitation learning framework that learns a mapping from defenders' local perceptions and their communication graph to their actions. The proposed GNN-based learning network is trained by imitating a centralized expert algorithm such that the learned actions are close to that generated by the expert algorithm. We demonstrate that our proposed network performs closer to the expert algorithm and is superior to other baseline algorithms by capturing more intruders. Our GNN-based network is trained at a small scale and can be generalized to large-scale cases. We run perimeter defense games in scenarios with different team sizes and configurations to demonstrate the performance of the learned network.  ( 2 min )
    Illumination Variation Correction Using Image Synthesis For Unsupervised Domain Adaptive Person Re-Identification. (arXiv:2301.09702v1 [eess.IV])
    Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.  ( 2 min )
    PRIMEQA: The Prime Repository for State-of-the-Art MultilingualQuestion Answering Research and Development. (arXiv:2301.09715v1 [cs.CL])
    The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.  ( 2 min )
    On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation. (arXiv:2301.09709v1 [cs.LG])
    A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.  ( 2 min )
    Weakly-Supervised Questions for Zero-Shot Relation Extraction. (arXiv:2301.09640v1 [cs.CL])
    Zero-Shot Relation Extraction (ZRE) is the task of Relation Extraction where the training and test sets have no shared relation types. This very challenging domain is a good test of a model's ability to generalize. Previous approaches to ZRE reframed relation extraction as Question Answering (QA), allowing for the use of pre-trained QA models. However, this method required manually creating gold question templates for each new relation. Here, we do away with these gold templates and instead learn a model that can generate questions for unseen relations. Our technique can successfully translate relation descriptions into relevant questions, which are then leveraged to generate the correct tail entity. On tail entity extraction, we outperform the previous state-of-the-art by more than 16 F1 points without using gold question templates. On the RE-QA dataset where no previous baseline for relation extraction exists, our proposed algorithm comes within 0.7 F1 points of a system that uses gold question templates. Our model also outperforms the state-of-the-art ZRE baselines on the FewRel and WikiZSL datasets, showing that QA models no longer need template questions to match the performance of models specifically tailored to the ZRE task. Our implementation is available at https://github.com/fyshelab/QA-ZRE.  ( 2 min )
    Selective Explanations: Leveraging Human Input to Align Explainable AI. (arXiv:2301.09656v1 [cs.AI])
    While a vast collection of explainable AI (XAI) algorithms have been developed in recent years, they are often criticized for significant gaps with how humans produce and consume explanations. As a result, current XAI techniques are often found to be hard to use and lack effectiveness. In this work, we attempt to close these gaps by making AI explanations selective -- a fundamental property of human explanations -- by selectively presenting a subset from a large set of model reasons based on what aligns with the recipient's preferences. We propose a general framework for generating selective explanations by leveraging human input on a small sample. This framework opens up a rich design space that accounts for different selectivity goals, types of input, and more. As a showcase, we use a decision-support task to explore selective explanations based on what the decision-maker would consider relevant to the decision task. We conducted two experimental studies to examine three out of a broader possible set of paradigms based on our proposed framework: in Study 1, we ask the participants to provide their own input to generate selective explanations, with either open-ended or critique-based input. In Study 2, we show participants selective explanations based on input from a panel of similar users (annotators). Our experiments demonstrate the promise of selective explanations in reducing over-reliance on AI and improving decision outcomes and subjective perceptions of the AI, but also paint a nuanced picture that attributes some of these positive effects to the opportunity to provide one's own input to augment AI explanations. Overall, our work proposes a novel XAI framework inspired by human communication behaviors and demonstrates its potentials to encourage future work to better align AI explanations with human production and consumption of explanations.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints. (arXiv:2301.09642v1 [q-bio.QM])
    Have you ever been troubled by the complexity and computational cost of SE(3) protein structure modeling and been amazed by the simplicity and power of language modeling? Recent work has shown promise in simplifying protein structures as sequences of protein angles; therefore, language models could be used for unconstrained protein backbone generation. Unfortunately, such simplification is unsuitable for the constrained protein inpainting problem, where the model needs to recover masked structures conditioned on unmasked ones, as it dramatically increases the computing cost of geometric constraints. To overcome this dilemma, we suggest inserting a hidden \textbf{a}tomic \textbf{d}irection \textbf{s}pace (\textbf{ADS}) upon the language model, converting invariant backbone angles into equivalent direction vectors and preserving the simplicity, called Seq2Direct encoder ($\text{Enc}_{s2d}$). Geometric constraints could be efficiently imposed on the newly introduced direction space. A Direct2Seq decoder ($\text{Dec}_{d2s}$) with mathematical guarantees is also introduced to develop a \textbf{SDS} ($\text{Enc}_{s2d}$+$\text{Dec}_{d2s}$) model. We apply the SDS model as the denoising neural network during the conditional diffusion process, resulting in a constrained generative model--\textbf{DiffSDS}. Extensive experiments show that the plug-and-play ADS could transform the language model into a strong structural model without loss of simplicity. More importantly, the proposed DiffSDS outperforms previous strong baselines by a large margin on the task of protein inpainting.  ( 2 min )
  • Open

    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.  ( 2 min )
    Multiway Spherical Clustering via Degree-Corrected Tensor Block Models. (arXiv:2201.07401v2 [math.ST] UPDATED)
    We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.  ( 2 min )
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking. (arXiv:2104.02943v3 [math.ST] UPDATED)
    The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be viewed as summaries of the ROC curve. In this paper, the fact that most of these empirical criteria can be expressed as two-sample linear rank statistics is highlighted and concentration inequalities for collections of such random variables, referred to as two-sample rank processes here, are proved, when indexed by VC classes of scoring functions. Based on these nonasymptotic bounds, the generalization capacity of empirical maximizers of a wide class of ranking performance criteria is next investigated from a theoretical perspective. It is also supported by empirical evidence through convincing numerical experiments.  ( 2 min )
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Forecasting the 2016-2017 Central Apennines Earthquake Sequence with a Neural Point Process. (arXiv:2301.09948v1 [physics.geo-ph])
    Point processes have been dominant in modeling the evolution of seismicity for decades, with the Epidemic Type Aftershock Sequence (ETAS) model being most popular. Recent advances in machine learning have constructed highly flexible point process models using neural networks to improve upon existing parametric models. We investigate whether these flexible point process models can be applied to short-term seismicity forecasting by extending an existing temporal neural model to the magnitude domain and we show how this model can forecast earthquakes above a target magnitude threshold. We first demonstrate that the neural model can fit synthetic ETAS data, however, requiring less computational time because it is not dependent on the full history of the sequence. By artificially emulating short-term aftershock incompleteness in the synthetic dataset, we find that the neural model outperforms ETAS. Using a new enhanced catalog from the 2016-2017 Central Apennines earthquake sequence, we investigate the predictive skill of ETAS and the neural model with respect to the lowest input magnitude. Constructing multiple forecasting experiments using the Visso, Norcia and Campotosto earthquakes to partition training and testing data, we target M3+ events. We find both models perform similarly at previously explored thresholds (e.g., above M3), but lowering the threshold to M1.2 reduces the performance of ETAS unlike the neural model. We argue that some of these gains are due to the neural model's ability to handle incomplete data. The robustness to missing data and speed to train the neural model present it as an encouraging competitor in earthquake forecasting.  ( 2 min )
    Adaptive Probabilistic Forecasting of Electricity (Net-)Load. (arXiv:2301.10090v1 [stat.AP])
    We focus on electricity load forecasting under three important specificities. First, our setting is adaptive; we use models taking into account the most recent observations available, yielding a forecasting strategy able to automatically respond to regime changes. Second, we consider probabilistic rather than point forecasting; indeed, uncertainty quantification is required to operate electricity systems efficiently and reliably. Third, we consider both conventional load (consumption only) and netload (consumption less embedded generation). Our methodology relies on the Kalman filter, previously used successfully for adaptive point load forecasting. The probabilistic forecasts are obtained by quantile regressions on the residuals of the point forecasting model. We achieve adaptive quantile regressions using the online gradient descent; we avoid the choice of the gradient step size considering multiple learning rates and aggregation of experts. We apply the method to two data sets: the regional net-load in Great Britain and the demand of seven large cities in the United States. Adaptive procedures improve forecast performance substantially in both use cases and for both point and probabilistic forecasting.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    Improved Rate of First Order Algorithms for Entropic Optimal Transport. (arXiv:2301.09675v1 [math.OC])
    This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. The resulting rate for approximating the optimal transport (OT) has been improved from $\widetilde{{O}}({n^{2.5}}/{\epsilon})$ to $\widetilde{{O}}({n^2}/{\epsilon})$, where $n$ is the problem size and $\epsilon$ is the accuracy level. In particular, we propose an accelerated primal-dual stochastic mirror descent algorithm with variance reduction. Such special design helps us improve the rate compared to other accelerated primal-dual algorithms. We further propose a batch version of our stochastic algorithm, which improves the computational performance through parallel computing. To compare, we prove that the computational complexity of the Stochastic Sinkhorn algorithm is $\widetilde{{O}}({n^2}/{\epsilon^2})$, which is slower than our accelerated primal-dual stochastic mirror algorithm. Experiments are done using synthetic and real data, and the results match our theoretical rates. Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $\widetilde{{O}}({n^2}/{\epsilon})$ for solving OT.  ( 2 min )

  • Open

    Just found a new chrome extension called IntelliMail that uses AI to write emails. Its super easy to use and can be used to land internships, jobs and up your email game.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I created an 'AI Tools' series on YouTube and I'd love your feedback! Today is About Jasper AI
    I would love to know your opinion about what I should improve on mt videos and of course if you don't know about Jasper AI give a look, it's a great AI tool for Content Creation https://youtu.be/x_6rzsBVABg submitted by /u/sigmabruuh [link] [comments]  ( 6 min )
    OpenAi's breakthrough
    https://twitter.com/make_mhe/status/1618255363580755968 submitted by /u/bradasm [link] [comments]  ( 40 min )
    I asked an AI image generator to show me a "typical Discord user"
    submitted by /u/MSAPW [link] [comments]  ( 40 min )
    AI Dream 150 - SUPERNOVA IMMINENT Part1 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    I am recommending you something
    It's an amazing app called Simplywrite. It gives you 20 credits for free. It lets you generate articles and outlines. Totally a free tool to use when you bored. LINK TO APP: https://simplywrite.com submitted by /u/benignkirby [link] [comments]  ( 40 min )
    Anonymize tabular data to meet GDPR privacy requirements
    submitted by /u/Repeat-or [link] [comments]  ( 40 min )
    I made a list of tools powered by AI
    submitted by /u/Alen0tv [link] [comments]  ( 40 min )
    How to automatically generate test cases in NLP?
    submitted by /u/Nazma2015 [link] [comments]  ( 40 min )
    Best TTS for cheesy movie trailer guy voice?
    Looking for a TTS that can generate that really deep movie trailer male voice. Any ideas? submitted by /u/infinitycurvature [link] [comments]  ( 40 min )
    Best video style transfer program as of now?
    Looking something where you can capture a face speaking and replace it with a completely different character (including neck, face, shoulders). Looking to replace a human with an animated character. What options are there as of now? I know ebsynth, but it's not too good. There is vtuber software, but they look too "anime". Looking for something that would give a very "photorealistic" result, think Avatar quality output or a Pixar movie. What is the best that's out there right now? submitted by /u/UpperStruggle2421 [link] [comments]  ( 41 min )
    Beware Loab, the digital cryptid lurking in AI's forgotten space
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    Will coders and writers just be doing QA of AI output?
    So 5 years ago, the main career and hobby advice I had was "look to see if programming is a thing for you, it's at least a fun hobby and maybe you can find a niche as one of the last of the custom craftspeople" But now... where do you go, in a society where wealth tends to get sucked upward, to find a good living? Do you learn to roll with being a bot manager? (Assuming the AIs don't just manage themselves?) Do you look for trades of manipulating physical stuff that haven't yet lent themselves to cybernetics? submitted by /u/kirk_is_ [link] [comments]  ( 44 min )
    merge photo with ai
    hi everyone now i have two different photos (not image) and how can i merge thats with using ai. submitted by /u/Aigerim_D40 [link] [comments]  ( 40 min )
    What is your preferred Image Generation API / App?
    It is really difficult to benchmark Text-to-Image AIs, it relies on so many aspects: speed, styles, precision of the prompt, interface, fine-tuning, etc. So I think the best approach is to see which are the most prefered by the people who use Stable Diffusion API. Do not hesitate to explain your choice in comments, and also mention APIs that are not in the Poll, I am limited to 6 options... I know that I did not put Midjourney, Artbreeder, Stable Diffusion, NightCafe, Crayion, Starry AI and many other but I am interested in those which provide API only. PS: this isn't promotional at all, I am not working for any of those companies. View Poll submitted by /u/JerLam2762 [link] [comments]  ( 42 min )
    I asked an AI image generator to show me a "typical Reddit user"
    submitted by /u/MrsChenHW [link] [comments]  ( 41 min )
    Yann LeCun, Meta’s Chief AI Scientist, Has Some Harsh Criticism Of ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    A New Free AI Model InstructPix2Pix To Transform Images By Plain English Instructions
    submitted by /u/CeFurkan [link] [comments]  ( 40 min )
    Fearing ChatGPT, Google enlists founders Brin and Page in AI fight
    submitted by /u/SAT0725 [link] [comments]  ( 40 min )
    Being really humorous under the pressure of billions of prompt requests
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Humanity's Quest to Decode Animal Languages Through AI
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    The Connection Between Science Fiction and Artificial Intelligence: A Survey Study
    Hello everyone, I am a student in AP Research. For my project, I am conducting a survey to analyze the connection between science fiction and technology (specifically Artificial Intelligence). This survey (linked) asks a few questions about your knowledge of Sci-fi, Artificial Intelligence, and the connection between the two. It should not take more than 10 minutes of your time. If you are interested, the link to the form is below: https://docs.google.com/forms/d/e/1FAIpQLScY_VaNI-CEtTiJiLHgYCCguEZ7m9DUdQoxvFTjXFFLOGu2KA/viewform If you have additional questions, my email is in the linked google form. Thank you for your participation, it is deeply appreciated! submitted by /u/rsantos05 [link] [comments]  ( 41 min )
  • Open

    [D] Alphatensor benchmark code in Colab
    Hello everybody I was wondering if anybody tried to run the main factorisation code https://github.com/deepmind/alphatensor/blob/main/benchmarking/factorizations.py from alpha tensor on Google Colab, with Colab's GPUs ( Tesla T4). I know that Tesla T4 is not as the same as the V100 used in Deep Mind's paper, however, I can see that the tensor formulation for the matrix multiplication is highly inefficient, compared to standard JAX matrix multiplication. Any suggestion where am I wrong? submitted by /u/IndependentIce4553 [link] [comments]  ( 42 min )
    [N] Upcoming talk: "Open Problems in Deep Neural Networks: An Optimal Control Perspective"
    Open Problems in Deep Neural Networks: An Optimal Control Perspective Feb 13, 6:30 ET About the talk: Backpropagation is a widely used algorithm for training neural networks. Its key step, Stochastic Gradient Descent (SGD) has become one of the bedrocks of deep learning. Despite wide adoption, mathematically rigorous study of SGD's convergence for deep neural networks is still ongoing. Join us as graduate student Amoolya Tirumalai discusses an approach to the convergence problem inspired by optimal control theory. Following the Pontryargin maximum principle, an alternative forward-backward iterative system will be described. Toy examples will be shown, and problems in robustness and security will be discussed. About the speaker: Amoolya Tirumalai is a 4th year PhD Candidate in Electrical Engineering at the University of Maryland, College Park. His interests are (robust) optimal control, partial differential equations, differential games, mean-field games, safety-critical control, and (robust) machine learning. His thesis titled 'Multi-agent inference, decision-making and control: models, structure and performance evaluation' will be defended in August 2023. Mr. Tirumalai was conferred a BS in Biomedical Engineering from the Georgia Institute of Technology in 2018. submitted by /u/what_comes_next [link] [comments]  ( 43 min )
    [D] Publication Resume
    If we submit a publication to ICML and it is under anonymous review, can I list the title and authors on my resume which will be on my personal webpage? submitted by /u/BigDreamx [link] [comments]  ( 43 min )
    [R] Blogpost on comparing Chatbots like ChatGPT, LaMDA, Sparrow, BlenderBot 3, and Claude
    https://huggingface.co/blog/dialog-agents breaks down the techniques behind ChatGPT -- instruction fine-tuning, supervised fine-tuning, chain-of-thought, read teaming, and more. https://preview.redd.it/fv16fsemd9ea1.png?width=889&format=png&auto=webp&s=a8f24de27c40a946fec64eaa674f81ddef0d0cc3 submitted by /u/emailnazneen [link] [comments]  ( 42 min )
    [D] Efficient retrieval of research information for graduate research
    I have lot of notes about research papers in a particular directory and the number of files has started to become larger than what I can remember off the top of my head. It will continue to keep growing and I have begun to wonder the most efficient way to retrieve the information. I could use ripgrep and regular expressions to find the notes efficiently, but I imagine that if the database is very huge and I don't have the correct regular expression in use, then I might not retrieve the correct files. Inspired by chatGPT, I was impressed at how it presents info from the internet and speeds up my time for finding information even when I do not know the correct keywords. I figured a NLP model primarily trained on my database would be an easier task and I was wondering if someone had already created something like this as open source or how would they go about it? submitted by /u/waterstrider123 [link] [comments]  ( 46 min )
    [D] Accurate data or more data?
    If you are building a model and had the choice, would you prefer more accurate (~99%) but less data or a lot more data but less accurate (~90%)? submitted by /u/NoSympathy9787 [link] [comments]  ( 43 min )
    [R] Best service for scientific paper correction
    Hello, Anyone ever used a paper revision service and can recommend one ? I’m publishing my first paper next month and I want to have feedback from an expert on this domain. Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 43 min )
    [D] Self-Supervised Contrastive Approaches that don’t use large batch size.
    This thread is dedicated to exploring the various techniques used in self-supervised contrastive learning that utilize standard batch sizes. I am seeking information on the current methods in this field, specifically those that do not rely on large batch sizes. I am familiar with the SimSiam paper published by META research, which utilizes 256 batch size for 8-GPUs. However, for individuals with limited resources such as myself, access to a large number of GPUs may not be feasible. As a result, I am interested in learning about other methods that can be used with smaller batch sizes and a single GPU, such as those that would be suitable for training on 1024x1024 input images. Additionally, I am curious about any more efficient architectures that have been developed in this field. This includes, but is not limited to, techniques used in natural language processing that may have applications in other areas of artificial intelligence. ***posted the same question in PyTorch forums, reposting here for wider reach. submitted by /u/shingekichan1996 [link] [comments]  ( 46 min )
    [R] Tsetlin Machine in Medical Research - Striking Differences Between Tsetlin Machine Interpretability and Deep Learning Attention
    ​ Tsetlin machine interpretability vs deep learning attention. Researchers at West China Hospital, Sichuan University, NORCE, and UiA have developed a Tsetlin machine-based architecture for premature ventricular contraction identification by analyzing long-term ECG signals. The experiments show that the Tsetlin machine is capable of producing human-interpretable rules, consistent with the clinical standard and medical knowledge. Simultaneously, the accuracy was comparable with deep CNN-based models. Paper: https://arxiv.org/abs/2301.10181 submitted by /u/olegranmo [link] [comments]  ( 43 min )
    [R] INSTRUCTOR One Embedder , Any Task: Instruction-Finetuned Text Embeddings Paper Explanation and Collab Demo
    In this video I explain about INSTRUCTOR, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor achieves sota on 70 diverse embedding tasks! I also show a google collab demo of instructor https://youtu.be/vg38cq3KJ6M submitted by /u/Sea-Photo5230 [link] [comments]  ( 42 min )
    [R] Easiest way to train RNN's in MATLAB or Julia?
    I work as as a researcher and am kind of new to neural networks. I have an RNN (1e4 x 1e4 network) that I would like to train in either MATLAB or Julia. One option I considered is writing my own code for Hessian-free optimization, but the implementational details are really, really hard to figure out. I am aware there is a Theano or TF implementation of HFO but I I am primarily interested in having the code in MATLAB/Julia. Also, are there better/alternative techniques than Hessian-free optimization for training RNN's ? submitted by /u/NadaBrothers [link] [comments]  ( 44 min )
    Can an AI model licensed under the BigScience RAIL License v1.0 such as BLOOM be used in a program that is useful for any domain? [D]
    Example: the AI model BLOOM) is licensed under the BigScience RAIL License v1.0. The BigScience RAIL License v1.0 forbids that some types of usages: You agree not to use the Model or Derivatives of the Model: [...] To provide medical advice and medical results interpretation; To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use). Am I allowed to use BLOOM in a program that is useful for any domain (e.g., a program to summarize or paraphrase some text, or perform question-answer on a text, or generate questions and their answers based on the text)? Since people could use the program for any domain, they could technically, for example, use the program to summarize a medical report or generate questions and their answers based on some asylum process to distribute to potential applicants. submitted by /u/Franck_Dernoncourt [link] [comments]  ( 44 min )
  • Open

    DM Control Suite vs. Original Environments
    I’m testing out DM control suite as I’d ideally like to do some stuff with the MuJoCo environments, e.g. Hopper. However, it seems as though they’ve changed Hopper from the OpenAI version? For instance, the action space is now 4-dimensional, and the bigger concern for me is that the reward seems to be specified differently. Per the gym documentation, the reward was healthy_reward + forward_reward - ctrl_cost, but when I’ve just started using the control suite version all rewards seem to be 0. The documentation for the control suite is quite poor, it says that for the hop task it is rewarded for torso heigh and forward velocity. It also doesn’t explain what the action dimensions correspond to (including the new dimension), so I can’t even manually test it! submitted by /u/DefinitelyNot4Burner [link] [comments]  ( 41 min )
    Does action masking reduce the ability of the agent to learn game rules?
    I recently experimented with training an sb3 PPO agent on a pretty complicated board game environment (just for fun). At first, I did regular PPO with an invalid action penalty, but it was making a lot of invalid moves and thus getting penalized and terminated early. It very slowly picked up on the signal and started to learn, but much too slowly to get any good results. After days of training, it could usually only play a handful of opening moves. On the other hand, I trained a Masked PPO in the same environment and it rapidly became quite good and was able to play relatively competitively after a few days of training. However, when I examined the outputs in an unmasked setting, it had little-to-no understanding of the game rules. It could still play OK but did not rank valid moves as the highest. This is a problem because I wanted to use it in a non-simulator setting without having to explicitly manually mask the moves by hand (or else convert a game state to a mask, both of which are tedious in my situation). Is this behavior expected? I have read some analyses that suggest that 1) MaskedPPO is much more sample efficient and should converge to a stronger agent MUCH faster, which makes sense, but also that 2) Even despite the invalid action masking, the agent should still learn game mechanics by proxy. If it's only being rewarded for making valid moves, it should learn to not make invalid moves implicitly since it never gets a reward signal for them (rather than being explicitly penalized). Thoughts? I only have a weak background in RL so apologies if this is naive. TLDR: Does action masking make the policy (or reward) network lazy? submitted by /u/TobusFire [link] [comments]  ( 44 min )
    Any List of Videogames With Reinforced Learning Agents Developed?
    Is there any list of videogames for which agents using reinforcement learning have been developed? Enquiring minds wanna know. submitted by /u/sanman [link] [comments]  ( 42 min )
    Learning to Exploit Elastic Actuators for Quadruped Locomotion
    submitted by /u/araffin2 [link] [comments]  ( 40 min )
    What is the limit on parallel environments?
    Is there some sort of hard or practical limit on the number of parallel environments that can be used? In Rllib when I try to use more 7 or 8 I get a scheduling error but yet I see people talking about 32 or 512 environments. What’s the limit? Is there some way for me to increase the amount I can train on? For example, my GPU seems under utilised but my CPU is very stressed, can I incrrrase GPU usage in Rllib? I have already set the number of GPUs to one. submitted by /u/centripetalstranger [link] [comments]  ( 42 min )
    Weird convergence of PPO reward when reducing number of envs
    Hi all, I am using Isaac Gym which enables the usage of multi environments. However, the reward value from the best environment has a huge difference, when training the agent with 512 environment (green) and 32 environment (orange), see below. I understand that the training should be slower when using less environments at the same time, but this difference tells me that I am missing something here... Does anyone have some hints? https://preview.redd.it/3rf0ax8653ea1.png?width=1589&format=png&auto=webp&s=b12defff668f186381b32c9f0385499b4413a3cd Below you can see the configs that I used for the PPO algorithm: config: name: ${resolve_default:CustomTask,${....experiment}} full_experiment_name: ${.name} env_name: rlgpu ppo: True mixed_precision: False normalize_input: True normalize_value: True value_bootstrap: True num_actors: ${....task.env.numEnvs} reward_shaper: scale_value: 1.0 normalize_advantage: True gamma: 0.99 tau: 0.95 learning_rate: 5e-4 lr_schedule: adaptive kl_threshold: 0.008 score_to_win: 10000000 max_epochs: ${resolve_default:5000,${....max_iterations}} save_best_after: 200 save_frequency: 100 print_stats: False use_action_masks: False grad_norm: 1.0 entropy_coef: 0.0001 truncate_grads: True e_clip: 0.2 horizon_length: 32 # num_envs * horizon length % minibatch_size minibatch_size: 1024 mini_epochs: 8 critic_coef: 4 clip_value: True seq_len: 4 bounds_loss_coef: 0.0001 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 41 min )
  • Open

    Ten Productivity Hacks using ChaptGPT Generative AI Prompts
    Generative AI is suddenly everywhere. Because of this, the future of AI looks very bright indeed. There are many opportunities for generative AI to impact life and business in both positive and negative ways in the near future. Because the consequence of negative human impacts can easily far outweigh the benefits of positive human impacts, the… Read More »Ten Productivity Hacks using ChaptGPT Generative AI Prompts The post Ten Productivity Hacks using ChaptGPT Generative AI Prompts appeared first on Data Science Central.  ( 21 min )
    Innovation at the Convergence of Emerging Technologies: Business at the Edge
    In the context of digital transformation and innovation, there is no lack of “hot topics” to discuss. Emerging technologies are truly emerging everywhere. What is most exciting – and what demonstrates their greatest promise – is that these new technologies are converging to produce innovative new businesses, products, and services. Over the past decade, we… Read More »Innovation at the Convergence of Emerging Technologies: Business at the Edge The post Innovation at the Convergence of Emerging Technologies: Business at the Edge appeared first on Data Science Central.  ( 22 min )
    Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems)
    In a recent article on Autonomous Intelligent Systems (AIS) [1], Ajit Joakar described various features and characteristics of such systems, including associated technologies and research areas, building blocks and core elements, critical factors for success, and cross-cutting enablers. He introduces AIS as an “emerging interdisciplinary field that deals with situations where humans interact with AI systems… Read More »Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems) The post Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems) appeared first on Data Science Central.  ( 23 min )
  • Open

    Learning with Queried Hints
    Posted by Sreenivas Gollapudi, Senior Staff Research Scientist, and Kostas Kollias, Staff Research Scientist, Google Research, Algorithms & Optimization Team In many computing applications the system needs to make decisions to serve requests that arrive in an online fashion. Consider, for instance, the example of a navigation app that responds to driver requests. In such settings there is inherent uncertainty about important aspects of the problem. For example, the preferences of the driver with respect to features of the route are often unknown and the delays of road segments can be uncertain. The field of online machine learning studies such settings and provides various techniques for decision-making problems under uncertainty. A navigation engine has to decide how to route thi…  ( 92 min )
  • Open

    Newbie in need – very bad NLP text generator needed
    Hell y'all, after spending an hour typing various combinations of "AI", "TEXT GENERATOR", "DATA FEEDING" and such I come to you with a humble request; can someone recommend me an AI text generator that needs to be fed actual, existing text(s) instead of giving it a prompt? I need something that will create a text based on an essay I will upload, and I really don't need the result to be super great. I will accept and in fact warmly welcome any nonsensical output, as I need AI to spew out trash, actually. I just need that trash to resemble things that actually exist. Grammatical errors are great, idiotic sentences like "actually, when the orange fades into original edition of Alexander, then how do we expect the succession to leave Canada?" is what I need, the less the final product sounds like what a human could write, the better. I don't have the proper lingo to google what I need to find, so I am deeply grateful for any suggestions. submitted by /u/lindybopperette [link] [comments]  ( 42 min )
    Computer Vision Development
    Hey!! I'm new to this side of things. I studied psych research, always had an interest in data visualization and neuroscience but didn't realize I should be piecing the two interests together and I have been too intimidated to take on the task of learning computer science. But I can't help myself any longer! I'm so fascinated and think reddit could be a great place to learn and chat about concepts. Any who... YA! I've started watching https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=1 and already can't believe we haven't solved the process of vision. Have we? Can we? The meta is getting to me. submitted by /u/angelacarolei [link] [comments]  ( 41 min )
  • Open

    Research Focus: Week of January 23, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Revolutionizing Document AI with multimodal document foundation models   Organizations must digitize various documents, many with charts and images, to manage and streamline essential functions. Yet manually […] The post Research Focus: Week of January 23, 2023 appeared first on Microsoft Research.  ( 8 min )
    Biomedical Research Platform Terra Now Available on Microsoft Azure
    We stand at the threshold of a new era of precision medicine, where health and life sciences data hold the potential to dramatically propel and expand our understanding and treatment of human disease. One of the tools that we believe will help to enable precision medicine is Terra, the secure biomedical research platform co-developed by […] The post Biomedical Research Platform Terra Now Available on Microsoft Azure appeared first on Microsoft Research.  ( 9 min )
  • Open

    Build a loyalty points anomaly detector using Amazon Lookout for Metrics
    Today, gaining customer loyalty cannot be a one-off thing. A brand needs a focused and integrated plan to retain its best customers—put simply, it needs a customer loyalty program. Earn and burn programs are one of the main paradigms. A typical earn and burn program rewards customers after a certain number of visits or spend. […]  ( 7 min )
    Explain text classification model predictions using Amazon SageMaker Clarify
    Model explainability refers to the process of relating the prediction of a machine learning (ML) model to the input feature values of an instance in humanly understandable terms. This field is often referred to as explainable artificial intelligence (XAI). Amazon SageMaker Clarify is a feature of Amazon SageMaker that enables data scientists and ML engineers […]  ( 10 min )
    Upscale images with Stable Diffusion in Amazon SageMaker JumpStart
    In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models in Amazon SageMaker JumpStart. Today, we announce a new feature that lets you upscale images (resize images without losing quality) with Stable Diffusion models in JumpStart. An image that is low resolution, blurry, and pixelated can be converted […]  ( 10 min )
    Cohere brings language AI to Amazon SageMaker
    It’s an exciting day for the development community. Cohere’s state-of-the-art language AI is now available through Amazon SageMaker. This makes it easier for developers to deploy Cohere’s pre-trained generation language model to Amazon SageMaker, an end-to-end machine learning (ML) service. Developers, data scientists, and business analysts use Amazon SageMaker to build, train, and deploy ML models quickly and easily using its fully managed infrastructure, tools, and workflows.  ( 6 min )
  • Open

    Braced From Space: Startup Keeps Watchful Eye on Gas Pipeline Leaks Across the Globe
    As its name suggests, Orbital Sidekick is creating technology that acts as a buddy in outer space, keeping an eye on the globe using satellites to help keep it safe and sustainable. The San Francisco-based startup, a member of the NVIDIA Inception program, enables commercial and government users to optimize sustainable operations and security with Read article >  ( 6 min )
    NVIDIA CEO Ignites AI Conversation in Stockholm
    Jensen Huang headlines Stockholm AI confab, Berzelius supercomputer upgraded to 94 NVIDIA DGX A100 systems.  ( 6 min )
  • Open

    Number of bits in a particular integer
    When I think of bit twiddling, I think of C. So I was surprised to read Paul Khuong saying he thinks of Common Lisp (“CL”). As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration. I would not have thought […] Number of bits in a particular integer first appeared on John D. Cook.  ( 5 min )
  • Open

    Deep Learning-Based Assessment of Cerebral Microbleeds in COVID-19. (arXiv:2301.09322v1 [eess.IV])
    Cerebral Microbleeds (CMBs), typically captured as hypointensities from susceptibility-weighted imaging (SWI), are particularly important for the study of dementia, cerebrovascular disease, and normal aging. Recent studies on COVID-19 have shown an increase in CMBs of coronavirus cases. Automatic detection of CMBs is challenging due to the small size and amount of CMBs making the classes highly imbalanced, lack of publicly available annotated data, and similarity with CMB mimics such as calcifications, irons, and veins. Hence, the existing deep learning methods are mostly trained on very limited research data and fail to generalize to unseen data with high variability and cannot be used in clinical setups. To this end, we propose an efficient 3D deep learning framework that is actively trained on multi-domain data. Two public datasets assigned for normal aging, stroke, and Alzheimer's disease analysis as well as an in-house dataset for COVID-19 assessment are used to train and evaluate the models. The obtained results show that the proposed method is robust to low-resolution images and achieves 78% recall and 80% precision on the entire test set with an average false positive of 1.6 per scan.  ( 2 min )
    Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning. (arXiv:2301.09350v1 [cs.CL])
    Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors, representing topics of interest for the biomedical community. Several related but distinct biomedical concepts are often grouped together in a single coarse-grained descriptor and are treated as a single topic for semantic indexing. This study proposes a new method for the automated refinement of subject annotations at the level of concepts, investigating deep learning approaches. Lacking labelled data for this task, our method relies on weak supervision based on concept occurrence in the abstract of an article. The proposed approach is evaluated on an extended large-scale retrospective scenario, taking advantage of concepts that eventually become MeSH descriptors, for which annotations become available in MEDLINE/PubMed. The results suggest that concept occurrence is a strong heuristic for automated subject annotation refinement and can be further enhanced when combined with dictionary-based heuristics. In addition, such heuristics can be useful as weak supervision for developing deep learning models that can achieve further improvement in some cases.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Parallel Approaches to Accelerate Bayesian Decision Trees. (arXiv:2301.09090v1 [stat.CO])
    Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Existing work on Bayesian decision trees uses MCMC. Unfortunately, this can be slow, especially when considering large volumes of data. It is hard to parallelise the accept-reject component of the MCMC. None-the-less, we propose two methods for exploiting parallelism in the MCMC: in the first, we replace the MCMC with another numerical Bayesian approach, the Sequential Monte Carlo (SMC) sampler, which has the appealing property that it is an inherently parallel algorithm; in the second, we consider data partitioning. Both methods use multi-core processing with a HighPerformance Computing (HPC) resource. We test the two methods in various study settings to determine which method is the most beneficial for each test case. Experiments show that data partitioning has limited utility in the settings we consider and that the use of the SMC sampler can improve run-time (compared to the sequential implementation) by up to a factor of 343.  ( 2 min )
    Learning Reservoir Dynamics with Temporal Self-Modulation. (arXiv:2301.09235v1 [cs.LG])
    Reservoir computing (RC) can efficiently process time-series data by transferring the input signal to randomly connected recurrent neural networks (RNNs), which are referred to as a reservoir. The high-dimensional representation of time-series data in the reservoir significantly simplifies subsequent learning tasks. Although this simple architecture allows fast learning and facile physical implementation, the learning performance is inferior to that of other state-of-the-art RNN models. In this paper, to improve the learning ability of RC, we propose self-modulated RC (SM-RC), which extends RC by adding a self-modulation mechanism. The self-modulation mechanism is realized with two gating variables: an input gate and a reservoir gate. The input gate modulates the input signal, and the reservoir gate modulates the dynamical properties of the reservoir. We demonstrated that SM-RC can perform attention tasks where input information is retained or discarded depending on the input signal. We also found that a chaotic state emerged as a result of learning in SM-RC. This indicates that self-modulation mechanisms provide RC with qualitatively different information-processing capabilities. Furthermore, SM-RC outperformed RC in NARMA and Lorentz model tasks. In particular, SM-RC achieved a higher prediction accuracy than RC with a reservoir 10 times larger in the Lorentz model tasks. Because the SM-RC architecture only requires two additional gates, it is physically implementable as RC, providing a new direction for realizing edge AI.  ( 2 min )
    Dataset Distillation: A Comprehensive Review. (arXiv:2301.07014v2 [cs.LG] UPDATED)
    Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks.Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training \emph{per se} yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation~(DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive experiments and envision possible directions for future works.  ( 2 min )
    Talking About Large Language Models. (arXiv:2212.03551v3 [cs.CL] UPDATED)
    Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.  ( 2 min )
    Generative Adversarial Networks to infer velocity components in rotating turbulent flows. (arXiv:2301.07541v1 [physics.flu-dyn] CROSS LISTED)
    Inference problems for two-dimensional snapshots of rotating turbulent flows are studied. We perform a systematic quantitative benchmark of point-wise and statistical reconstruction capabilities of the linear Extended Proper Orthogonal Decomposition (EPOD) method, a non-linear Convolutional Neural Network (CNN) and a Generative Adversarial Network (GAN). We attack the important task of inferring one velocity component out of the measurement of a second one, and two cases are studied: (I) both components lay in the plane orthogonal to the rotation axis and (II) one of the two is parallel to the rotation axis. We show that EPOD method works well only for the former case where both components are strongly correlated, while CNN and GAN always outperform EPOD both concerning point-wise and statistical reconstructions. For case (II), when the input and output data are weakly correlated, all methods fail to reconstruct faithfully the point-wise information. In this case, only GAN is able to reconstruct the field in a statistical sense. The analysis is performed using both standard validation tools based on L2 spatial distance between the prediction and the ground truth and more sophisticated multi-scale analysis using wavelet decomposition. Statistical validation is based on standard Jensen-Shannon divergence between the probability density functions, spectral properties and multi-scale flatness.  ( 2 min )
    Reconstructing Rayleigh-Benard flows out of temperature-only measurements using Physics-Informed Neural Networks. (arXiv:2301.07769v1 [physics.flu-dyn] CROSS LISTED)
    We investigate the capabilities of Physics-Informed Neural Networks (PINNs) to reconstruct turbulent Rayleigh-Benard flows using only temperature information. We perform a quantitative analysis of the quality of the reconstructions at various amounts of low-passed-filtered information and turbulent intensities. We compare our results with those obtained via nudging, a classical equation-informed data assimilation technique. At low Rayleigh numbers, PINNs are able to reconstruct with high precision, comparable to the one achieved with nudging. At high Rayleigh numbers, PINNs outperform nudging and are able to achieve satisfactory reconstruction of the velocity fields only when data for temperature is provided with high spatial and temporal density. When data becomes sparse, the PINNs performance worsens, not only in a point-to-point error sense but also, and contrary to nudging, in a statistical sense, as can be seen in the probability density functions and energy spectra.  ( 2 min )
    Clustering Categorical Data: Soft Rounding k-modes. (arXiv:2210.09640v2 [cs.LG] UPDATED)
    Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.
    Learning-Based Data Storage [Vision] (Technical Report). (arXiv:2206.05778v3 [cs.DB] UPDATED)
    Deep neural network (DNN) and its variants have been extensively used for a wide spectrum of real applications such as image classification, face/speech recognition, fraud detection, and so on. In addition to many important machine learning tasks, as artificial networks emulating the way brain cells function, DNNs also show the capability of storing non-linear relationships between input and output data, which exhibits the potential of storing data via DNNs. We envision a new paradigm of data storage, "DNN-as-a-Database", where data are encoded in well-trained machine learning models. Compared with conventional data storage that directly records data in raw formats, learning-based structures (e.g., DNN) can implicitly encode data pairs of inputs and outputs and compute/materialize actual output data of different resolutions only if input data are provided. This new paradigm can greatly enhance the data security by allowing flexible data privacy settings on different levels, achieve low space consumption and fast computation with the acceleration of new hardware (e.g., Diffractive Neural Network and AI chips), and can be generalized to distributed DNN-based storage/computing. In this paper, we propose this novel concept of learning-based data storage, which utilizes a learning structure called learning-based memory unit (LMU), to store, organize, and retrieve data. As a case study, we use DNNs as the engine in the LMU, and study the data capacity and accuracy of the DNN-based data storage. Our preliminary experimental results show the feasibility of the learning-based data storage by achieving high (100%) accuracy of the DNN storage. We explore and design effective solutions to utilize the DNN-based data storage to manage and query relational tables. We discuss how to generalize our solutions to other data types (e.g., graphs) and environments such as distributed DNN storage/computing.
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.
    Computationally-efficient initialisation of GPs: The generalised variogram method. (arXiv:2210.05394v2 [cs.LG] UPDATED)
    We present a computationally-efficient strategy to find the hyperparameters of a Gaussian process (GP) avoiding the computation of the likelihood function. The found hyperparameters can then be used directly for regression or passed as initial conditions to maximum-likelihood (ML) training. Motivated by the fact that training a GP via ML is equivalent (on average) to minimising the KL-divergence between the true and learnt model, we set to explore different metrics/divergences among GPs that are computationally inexpensive and provide estimates close to those of ML. In particular, we identify the GP hyperparameters by projecting the empirical covariance or (Fourier) power spectrum onto a parametric family, thus proposing and studying various measures of discrepancy operating on the temporal or frequency domains. Our contribution extends the Variogram method developed by the geostatistics literature and, accordingly, it is referred to as the Generalised Variogram method (GVM). In addition to the theoretical presentation of GVM, we provide experimental validation in terms of accuracy, consistency with ML and computational complexity for different kernels using synthetic and real-world data.
    FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction. (arXiv:2212.01548v2 [cs.LG] UPDATED)
    Most cross-device federated learning (FL) studies focus on the model-homogeneous setting where the global server model and local client models are identical. However, such constraint not only excludes low-end clients who would otherwise make unique contributions to model training but also restrains clients from training large models due to on-device resource bottlenecks. In this work, we propose FedRolex, a partial training (PT)-based approach that enables model-heterogeneous FL and can train a global server model larger than the largest client model. At its core, FedRolex employs a rolling sub-model extraction scheme that allows different parts of the global server model to be evenly trained, which mitigates the client drift induced by the inconsistency between individual client models and server model architectures. We show that FedRolex outperforms state-of-the-art PT-based model-heterogeneous FL methods (e.g. Federated Dropout) and reduces the gap between model-heterogeneous and model-homogeneous FL, especially under the large-model large-dataset regime. In addition, we provide theoretical statistical analysis on its advantage over Federated Dropout and evaluate FedRolex on an emulated real-world device distribution to show that FedRolex can enhance the inclusiveness of FL and boost the performance of low-end devices that would otherwise not benefit from FL. Our code is available at: https://github.com/AIoT-MLSys-Lab/FedRolex
    Incorporating Task-specific Concept Knowledge into Script Learning. (arXiv:2209.00068v2 [cs.CL] UPDATED)
    In this paper, we present Tetris, a new task of Goal-Oriented Script Completion. Unlike previous work, it considers a more realistic and general setting, where the input includes not only the goal but also additional user context, including preferences and history. To address this problem, we propose a novel approach, which uses two techniques to improve performance: (1) concept prompting, and (2) script-oriented contrastive learning that addresses step repetition and hallucination problems. On our WikiHow-based dataset, we find that both methods improve performance. The dataset, repository, and models will be publicly available to facilitate further research on this new task.
    Concept-level Debugging of Part-Prototype Networks. (arXiv:2205.15769v2 [cs.LG] UPDATED)
    Part-prototype Networks (ProtoPNets) are concept-based classifiers designed to achieve the same performance as black-box models without compromising transparency. ProtoPNets compute predictions based on similarity to class-specific part-prototypes learned to recognize parts of training examples, making it easy to faithfully determine what examples are responsible for any target prediction and why. However, like other models, they are prone to picking up confounders and shortcuts from the data, thus suffering from compromised prediction accuracy and limited generalization. We propose ProtoPDebug, an effective concept-level debugger for ProtoPNets in which a human supervisor, guided by the model's explanations, supplies feedback in the form of what part-prototypes must be forgotten or kept, and the model is fine-tuned to align with this supervision. Our experimental evaluation shows that ProtoPDebug outperforms state-of-the-art debuggers for a fraction of the annotation cost. An online experiment with laypeople confirms the simplicity of the feedback requested to the users and the effectiveness of the collected feedback for learning confounder-free part-prototypes. ProtoPDebug is a promising tool for trustworthy interactive learning in critical applications, as suggested by a preliminary evaluation on a medical decision making task.
    A Comprehensive Survey on Enterprise Financial Risk Analysis: Problems, Methods, Spotlights and Applications. (arXiv:2211.14997v2 [q-fin.RM] UPDATED)
    Enterprise financial risk analysis aims at predicting the enterprises' future financial risk.Due to the wide application, enterprise financial risk analysis has always been a core research issue in finance. Although there are already some valuable and impressive surveys on risk management, these surveys introduce approaches in a relatively isolated way and lack the recent advances in enterprise financial risk analysis. Due to the rapid expansion of the enterprise financial risk analysis, especially from the computer science and big data perspective, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing enterprise financial risk researches, as well as to summarize and interpret the mechanisms and the strategies of enterprise financial risk analysis in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. This paper provides a systematic literature review of over 300 articles published on enterprise risk analysis modelling over a 50-year period, 1968 to 2022. We first introduce the formal definition of enterprise risk as well as the related concepts. Then, we categorized the representative works in terms of risk type and summarized the three aspects of risk analysis. Finally, we compared the analysis methods used to model the enterprise financial risk. Our goal is to clarify current cutting-edge research and its possible future directions to model enterprise risk, aiming to fully understand the mechanisms of enterprise risk communication and influence and its application on corporate governance, financial institution and government regulation.
    On Investigating the Conservative Property of Score-Based Generative Models. (arXiv:2209.12753v2 [cs.LG] UPDATED)
    Existing Score-based Generative Models (SGMs) can be categorized into constrained SGMs (CSGMs) or unconstrained SGMs (USGMs) according to their parameterization approaches. CSGMs model probability density functions as Boltzmann distributions, and assign their predictions as the negative gradients of some scalar-valued energy functions. On the other hand, USGMs employ flexible architectures capable of directly estimating scores without the need to explicitly model energy functions. In this paper, we demonstrate that the architectural constraints of CSGMs may limit their modeling ability. In addition, we show that USGMs' inability to preserve the property of conservativeness may lead to degraded sampling performance in practice. To address the above issues, we propose Quasi-Conservative Score-based Generative Models (QCSGMs) for keeping the advantages of both CSGMs and USGMs. Our theoretical derivations demonstrate that the training objective of QCSGMs can be efficiently integrated into the training processes by leveraging the Hutchinson trace estimator. In addition, our experimental results on the CIFAR-10, CIFAR-100, ImageNet, and SVHN datasets validate the effectiveness of QCSGMs. Finally, we justify the advantage of QCSGMs using an example of a one-layered autoencoder.
    Learning from Long-Tailed Noisy Data with Sample Selection and Balanced Loss. (arXiv:2211.10906v2 [cs.LG] UPDATED)
    The success of deep learning depends on large-scale and well-curated training data, while data in real-world applications are commonly long-tailed and noisy. Many methods have been proposed to deal with long-tailed data or noisy data, while a few methods are developed to tackle long-tailed noisy data. To solve this, we propose a robust method for learning from long-tailed noisy data with sample selection and balanced loss. Specifically, we separate the noisy training data into clean labeled set and unlabeled set with sample selection, and train the deep neural network in a semi-supervised manner with a balanced loss based on model bias. Extensive experiments on benchmarks demonstrate that our method outperforms existing state-of-the-art methods.
    FairGBM: Gradient Boosting with Fairness Constraints. (arXiv:2209.07850v3 [cs.LG] UPDATED)
    Tabular data is prevalent in many high stakes domains, such as financial services or public policy. Gradient boosted decision trees (GBDT) are popular in these settings due to performance guarantees and low cost. However, in consequential decision-making fairness is a foremost concern. Despite GBDT's popularity, existing in-processing Fair ML methods are either inapplicable to GBDT, or incur in significant train time overhead, or are inadequate for problems with high class imbalance -- a typical issue in these domains. We present FairGBM, a dual ascent learning framework for training GBDT under fairness constraints, with little to no impact on predictive performance when compared to unconstrained GBDT. Since observational fairness metrics are non-differentiable, we have to employ a "proxy-Lagrangian" formulation using smooth convex error rate proxies to enable gradient-based optimization. Our implementation shows an order of magnitude speedup in training time when compared with related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.
    STaSy: Score-based Tabular data Synthesis. (arXiv:2210.04018v2 [cs.LG] UPDATED)
    Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.
    On the power of foundation models. (arXiv:2211.16327v2 [cs.AI] UPDATED)
    With infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? This question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. In this paper, we show that category theory provides powerful machinery to answer this question. We have proved three results. The first one limits the power of prompt-based learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. The second one says fine tuning does not have this limit, as a foundation model with the minimum required power (up to symmetry) can theoretically solve downstream tasks with fine tuning and enough resources. Our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). Along the way, we provide a categorical framework for supervised and self-supervised learning, which might be of independent interest.
    SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. (arXiv:2210.05861v2 [cs.CV] UPDATED)
    Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
    Shortest Path Networks for Graph Property Prediction. (arXiv:2206.01003v4 [cs.LG] UPDATED)
    Most graph neural network models rely on a particular message passing paradigm, where the idea is to iteratively propagate node representations of a graph to each node in the direct neighborhood. While very prominent, this paradigm leads to information propagation bottlenecks, as information is repeatedly compressed at intermediary node representations, which causes loss of information, making it practically impossible to gather meaningful signals from distant nodes. To address this, we propose shortest path message passing neural networks, where the node representations of a graph are propagated to each node in the shortest path neighborhoods. In this setting, nodes can directly communicate between each other even if they are not neighbors, breaking the information bottleneck and hence leading to more adequately learned representations. Our framework generalizes message passing neural networks, resulting in a class of more expressive models, including some recent state-of-the-art models. We verify the capacity of a basic model of this framework on dedicated synthetic experiments, and on real-world graph classification and regression benchmarks, and obtain state-of-the art results.
    Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP. (arXiv:2207.11126v2 [cs.LG] UPDATED)
    We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of $\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/\delta)})$ with probability $1-\delta$, where $\mathcal{P}$ and $\mathcal{G}$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{G}|)/\ln(|A|)})$, on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains $\widetilde{O}(T^{3/4})$ regret.
    DIVISION: Memory Efficient Training via Dual Activation Precision. (arXiv:2208.04187v3 [cs.LG] UPDATED)
    Existing work of activation compressed training relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual Activation Precision (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10x compression of activation maps, and significantly higher training throughput than state-of-the-art ACT methods, without loss of model accuracy.
    Explainable Image Quality Assessments in Teledermatological Photography. (arXiv:2209.04699v2 [cs.CV] UPDATED)
    Image quality is a crucial factor in the effectiveness and efficiency of teledermatological consultations. However, up to 50% of images sent by patients have quality issues, thus increasing the time to diagnosis and treatment. An automated, easily deployable, explainable method for assessing image quality is necessary to improve the current teledermatological consultation flow. We introduce ImageQX, a convolutional neural network for image quality assessment with a learning mechanism for identifying the most common poor image quality explanations: bad framing, bad lighting, blur, low resolution, and distance issues. ImageQX was trained on 26,635 photographs and validated on 9,874 photographs, each annotated with image quality labels and poor image quality explanations by up to 12 board-certified dermatologists. The photographic images were taken between 2017 and 2019 using a mobile skin disease tracking application accessible worldwide. Our method achieves expert-level performance for both image quality assessment and poor image quality explanation. For image quality assessment, ImageQX obtains a macro F1-score of 0.73 +- 0.01, which places it within standard deviation of the pairwise inter-rater F1-score of 0.77 +- 0.07. For poor image quality explanations, our method obtains F1-scores of between 0.37 +- 0.01 and 0.70 +- 0.01, similar to the inter-rater pairwise F1-score of between 0.24 +- 0.15 and 0.83 +- 0.06. Moreover, with a size of only 15 MB, ImageQX is easily deployable on mobile devices. With an image quality detection performance similar to that of dermatologists, incorporating ImageQX into the teledermatology flow can enable a better, faster flow for remote consultations.
    Pseudo-Hamiltonian Neural Networks with State-Dependent External Forces. (arXiv:2206.02660v4 [cs.LG] UPDATED)
    Hybrid machine learning based on Hamiltonian formulations has recently been successfully demonstrated for simple mechanical systems, both energy conserving and not energy conserving. We introduce a pseudo-Hamiltonian formulation that is a generalization of the Hamiltonian formulation via the port-Hamiltonian formulation, and show that pseudo-Hamiltonian neural network models can be used to learn external forces acting on a system. We argue that this property is particularly useful when the external forces are state dependent, in which case it is the pseudo-Hamiltonian structure that facilitates the separation of internal and external forces. Numerical results are provided for a forced and damped mass-spring system and a tank system of higher complexity, and a symmetric fourth-order integration scheme is introduced for improved training on sparse and noisy data.
    Overfitting in quantum machine learning and entangling dropout. (arXiv:2205.11446v2 [quant-ph] UPDATED)
    The ultimate goal in machine learning is to construct a model function that has a generalization capability for unseen dataset, based on given training dataset. If the model function has too much expressibility power, then it may overfit to the training data and as a result lose the generalization capability. To avoid such overfitting issue, several techniques have been developed in the classical machine learning regime, and the dropout is one such effective method. This paper proposes a straightforward analogue of this technique in the quantum machine learning regime, the entangling dropout, meaning that some entangling gates in a given parametrized quantum circuit are randomly removed during the training process to reduce the expressibility of the circuit. Some simple case studies are given to show that this technique actually suppresses the overfitting.
    On A Mallows-type Model For (Ranked) Choices. (arXiv:2207.01783v2 [cs.LG] UPDATED)
    We consider a preference learning setting where every participant chooses an ordered list of $k$ most preferred items among a displayed set of candidates. (The set can be different for every participant.) We identify a distance-based ranking model for the population's preferences and their (ranked) choice behavior. The ranking model resembles the Mallows model but uses a new distance function called Reverse Major Index (RMJ). We find that despite the need to sum over all permutations, the RMJ-based ranking distribution aggregates into (ranked) choice probabilities with simple closed-form expression. We develop effective methods to estimate the model parameters and showcase their generalization power using real data, especially when there is a limited variety of display sets.
    Predicting highway lane-changing maneuvers: A benchmark analysis of machine and ensemble learning algorithms. (arXiv:2204.10807v3 [cs.LG] UPDATED)
    Understanding and predicting highway lane-change maneuvers is essential for driving modeling and its automation. The development of data-based lane-changing decision-making algorithms is nowadays in full expansion. We compare empirically in this article different machine and ensemble learning classification techniques to the MOBIL rule-based model using trajectory data of European two-lane highways. The analysis relies on instantaneous measurements of up to twenty-four spatial-temporal variables with the four neighboring vehicles on current and adjacent lanes. Preliminary descriptive investigations by principal component and logistic analyses allow identifying main variables intending a driver to change lanes. We predict two types of discretionary lane-change maneuvers: overtaking (from the slow to the fast lane) and fold-down (from the fast to the slow lane). The prediction accuracy is quantified using total, lane-changing and lane-keeping errors and associated receiver operating characteristic curves. The benchmark analysis includes logistic model, linear discriminant, decision tree, na\"ive Bayes classifier, support vector machine, neural network machine learning algorithms, and up to ten bagging and stacking ensemble learning meta-heuristics. If the rule-based model provides limited predicting accuracy, the data-based algorithms, devoid of modeling bias, allow significant prediction improvements. Cross validations show that selected neural networks and stacking algorithms allow predicting from a single observation both fold-down and overtaking maneuvers up to four seconds in advance with high accuracy.
    EvenNet: Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks. (arXiv:2205.13892v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have received extensive research attention for their promising performance in graph machine learning. Despite their extraordinary predictive accuracy, existing approaches, such as GCN and GPRGNN, are not robust in the face of homophily changes on test graphs, rendering these models vulnerable to graph structural attacks and with limited capacity in generalizing to graphs of varied homophily levels. Although many methods have been proposed to improve the robustness of GNN models, most of these techniques are restricted to the spatial domain and employ complicated defense mechanisms, such as learning new graph structures or calculating edge attentions. In this paper, we study the problem of designing simple and robust GNN models in the spectral domain. We propose EvenNet, a spectral GNN corresponding to an even-polynomial graph filter. Based on our theoretical analysis in both spatial and spectral domains, we demonstrate that EvenNet outperforms full-order models in generalizing across homophilic and heterophilic graphs, implying that ignoring odd-hop neighbors improves the robustness of GNNs. We conduct experiments on both synthetic and real-world datasets to demonstrate the effectiveness of EvenNet. Notably, EvenNet outperforms existing defense models against structural attacks without introducing additional computational costs and maintains competitiveness in traditional node classification tasks on homophilic and heterophilic graphs.
    Contrastive Learning for Unsupervised Domain Adaptation of Time Series. (arXiv:2206.06243v3 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims at learning a machine learning model using a labeled source domain that performs well on a similar yet different, unlabeled target domain. UDA is important in many applications such as medicine, where it is used to adapt risk scores across different patient cohorts. In this paper, we develop a novel framework for UDA of time series data, called CLUDA. Specifically, we propose a contrastive learning framework to learn contextual representations in multivariate time series, so that these preserve label information for the prediction task. In our framework, we further capture the variation in the contextual representations between source and target domain via a custom nearest-neighbor contrastive learning. To the best of our knowledge, ours is the first framework to learn domain-invariant, contextual representation for UDA of time series data. We evaluate our framework using a wide range of time series datasets to demonstrate its effectiveness and show that it achieves state-of-the-art performance for time series UDA.
    An Explainable-AI approach for Diagnosis of COVID-19 using MALDI-ToF Mass Spectrometry. (arXiv:2109.14099v2 [cs.LG] UPDATED)
    The severe acute respiratory syndrome coronavirus type-2 (SARS-CoV-2) caused a global pandemic and imposed immense effects on the global economy. Accurate, cost-effective, and quick tests have proven substantial in identifying infected people and mitigating the spread. Recently, multiple alternative platforms for testing coronavirus disease 2019 (COVID-19) have been published that show high agreement with current gold standard real-time polymerase chain reaction (RT-PCR) results. These new methods do away with nasopharyngeal (NP) swabs, eliminate the need for complicated reagents, and reduce the burden on RT-PCR test reagent supply. In the present work, we have designed an artificial intelligence-based (AI) testing method to provide confidence in the results. Current AI applications to COVID-19 studies often lack a biological foundation in the decision-making process, and our AI approach is one of the earliest to leverage explainable-AI (X-AI) algorithms for COVID-19 diagnosis using mass spectrometry. Here, we have employed X-AI to explain the decision-making process on a local (per-sample) and global (all samples) basis underscored by biologically relevant features. We evaluated our technique with data extracted from human gargle samples and achieved a testing accuracy of 94.44%. Such techniques would strengthen the relationship between AI and clinical diagnostics by providing biomedical researchers and healthcare workers with trustworthy and, most importantly, explainable test results.
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.
    A Survey on Distributed Online Optimization and Game. (arXiv:2205.00473v2 [cs.LG] UPDATED)
    Distributed online optimization and game have been increasingly researched in the last decade, mostly motivated by its wide applications in sensor networks, robotics (e.g., distributed target tracking and formation control), smart grids, deep learning, and so forth. In these problems, there is a network of agents who may be cooperative (i.e., distributed online optimization) or noncooperative (i.e., online game) through local information exchanges. And the local cost function of each agent is often time-varying in dynamic and even adversarial environments. At each time, a decision must be made by each agent based on historical information at hand without knowing future information on cost functions. For these problems, a comprehensive survey is still lacking. This paper aims to provide a thorough overview of distributed online optimization and game from the perspective of problem settings, communication, computation, algorithms, and performances. In addition, some potential future directions are also discussed.
    Stochastic Second-Order Methods Improve Best-Known Sample Complexity of SGD for Gradient-Dominated Function. (arXiv:2205.12856v2 [cs.LG] UPDATED)
    We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\le\alpha\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $\epsilon$-global optimum is $\mathcal{O}(\epsilon^{-7/(2\alpha)+1})$ for $1\le\alpha< 3/2$ and $\mathcal{\tilde{O}}(\epsilon^{-2/(\alpha)})$ for $3/2\le\alpha\le 2$. SCRN improves the best-known sample complexity of stochastic gradient descent. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the average sample complexity of SCRN can be reduced to ${\mathcal{O}}(\epsilon^{-2})$ for $\alpha=1$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
    Sustaining Fairness via Incremental Learning. (arXiv:2208.12212v2 [cs.LG] UPDATED)
    Machine learning systems are often deployed for making critical decisions like credit lending, hiring, etc. While making decisions, such systems often encode the user's demographic information (like gender, age) in their intermediate representations. This can lead to decisions that are biased towards specific demographics. Prior work has focused on debiasing intermediate representations to ensure fair decisions. However, these approaches fail to remain fair with changes in the task or demographic distribution. To ensure fairness in the wild, it is important for a system to adapt to such changes as it accesses new data in an incremental fashion. In this work, we propose to address this issue by introducing the problem of learning fair representations in an incremental learning setting. To this end, we present Fairness-aware Incremental Representation Learning (FaIRL), a representation learning system that can sustain fairness while incrementally learning new tasks. FaIRL is able to achieve fairness and learn new tasks by controlling the rate-distortion function of the learned representations. Our empirical evaluations show that FaIRL is able to make fair decisions while achieving high performance on the target task, outperforming several baselines.
    Generating Diverse Teammates to Train Robust Agents For Ad Hoc Teamwork. (arXiv:2207.14138v2 [cs.LG] UPDATED)
    Ad hoc teamwork (AHT) is the challenge of designing a learner that effectively collaborates with unknown teammates without prior coordination mechanisms. Early approaches address the AHT challenge by training the learner with a diverse set of handcrafted teammate policies, usually designed based on an expert's domain knowledge about the policies the learner may encounter. However, implementing teammate policies for training based on domain knowledge is not always feasible. In such cases, recent approaches attempted to improve the robustness of the learner by training it with teammate policies generated by optimising information-theoretic diversity metrics. However, optimising information-theoretic diversity metrics may generate teammates with superficially different behaviours, which does not necessarily result in a robust learner that can effectively collaborate with unknown teammates. In this paper, we present an automated teammate policy generation method optimising the Best-Response Diversity (BRDiv) metric, which measures diversity based on the compatibility of teammate policies in terms of returns. We evaluate our approach in environments with multiple valid coordination strategies, comparing against methods optimising information-theoretic diversity metrics and an ablation not optimising any diversity metric. Our experiments indicate that optimising BRDiv yields a diverse set of training teammate policies that improve the learner's performance relative to previous teammate generation approaches when collaborating with near-optimal previously unseen teammate policies.
    Do Gradient Inversion Attacks Make Federated Learning Unsafe?. (arXiv:2202.06924v2 [cs.LG] UPDATED)
    Federated learning (FL) allows the collaborative training of AI models without needing to share raw data. This capability makes it especially interesting for healthcare applications where patient and data privacy is of utmost concern. However, recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data. In this work, we show that these attacks presented in the literature are impractical in FL use-cases where the clients' training involves updating the Batch Normalization (BN) statistics and provide a new baseline attack that works for such scenarios. Furthermore, we present new ways to measure and visualize potential data leakage in FL. Our work is a step towards establishing reproducible methods of measuring data leakage in FL and could help determine the optimal tradeoffs between privacy-preserving techniques, such as differential privacy, and model accuracy based on quantifiable metrics. Code is available at https://nvidia.github.io/NVFlare/research/quantifying-data-leakage.
    GANs and Closures: Micro-Macro Consistency in Multiscale Modeling. (arXiv:2208.10715v2 [cs.LG] UPDATED)
    Sampling the phase space of molecular systems -- and, more generally, of complex systems effectively modeled by stochastic differential equations -- is a crucial modeling step in many fields, from protein folding to materials discovery. These problems are often multiscale in nature: they can be described in terms of low-dimensional effective free energy surfaces parametrized by a small number of "slow" reaction coordinates; the remaining "fast" degrees of freedom populate an equilibrium measure on the reaction coordinate values. Sampling procedures for such problems are used to estimate effective free energy differences as well as ensemble averages with respect to the conditional equilibrium distributions; these latter averages lead to closures for effective reduced dynamic models. Over the years, enhanced sampling techniques coupled with molecular simulation have been developed. An intriguing analogy arises with the field of Machine Learning (ML), where Generative Adversarial Networks can produce high dimensional samples from low dimensional probability distributions. This sample generation returns plausible high dimensional space realizations of a model state, from information about its low-dimensional representation. In this work, we present an approach that couples physics-based simulations and biasing methods for sampling conditional distributions with ML-based conditional generative adversarial networks for the same task. The "coarse descriptors" on which we condition the fine scale realizations can either be known a priori, or learned through nonlinear dimensionality reduction. We suggest that this may bring out the best features of both approaches: we demonstrate that a framework that couples cGANs with physics-based enhanced sampling techniques can improve multiscale SDE dynamical systems sampling, and even shows promise for systems of increasing complexity.
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.
    Analyzing Data-Centric Properties for Graph Contrastive Learning. (arXiv:2208.02810v3 [cs.LG] UPDATED)
    Recent analyses of self-supervised learning (SSL) find the following data-centric properties to be critical for learning good representations: invariance to task-irrelevant semantics, separability of classes in some latent space, and recoverability of labels from augmented samples. However, given their discrete, non-Euclidean nature, graph datasets and graph SSL methods are unlikely to satisfy these properties. This raises the question: how do graph SSL methods, such as contrastive learning (CL), work well? To systematically probe this question, we perform a generalization analysis for CL when using generic graph augmentations (GGAs), with a focus on data-centric properties. Our analysis yields formal insights into the limitations of GGAs and the necessity of task-relevant augmentations. As we empirically show, GGAs do not induce task-relevant invariances on common benchmark datasets, leading to only marginal gains over naive, untrained baselines. Our theory motivates a synthetic data generation process that enables control over task-relevant information and boasts pre-defined optimal augmentations. This flexible benchmark helps us identify yet unrecognized limitations in advanced augmentation techniques (e.g., automated methods). Overall, our work rigorously contextualizes, both empirically and theoretically, the effects of data-centric properties on augmentation strategies and learning paradigms for graph SSL.
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.
    Predictive Model for Gross Community Production Rate of Coral Reefs using Ensemble Learning Methodologies. (arXiv:2111.04003v2 [cs.LG] UPDATED)
    Coral reefs play a vital role in maintaining the ecological balance of the marine ecosystem. Various marine organisms depend on coral reefs for their existence and their natural processes. Coral reefs provide the necessary habitat for reproduction and growth for various exotic species of the marine ecosystem. In this article, we discuss the most important parameters which influence the lifecycle of coral and coral reefs such as ocean acidification, deoxygenation and other physical parameters such as flow rate and surface area. Ocean acidification depends on the amount of dissolved Carbon dioxide (CO2). This is due to the release of H+ ions upon the reaction of the dissolved CO2 gases with the calcium carbonate compounds in the ocean. Deoxygenation is another problem that leads to hypoxia which is characterized by a lesser amount of dissolved oxygen in water than the required amount for the existence of marine organisms. In this article, we highlight the importance of physical parameters such as flow rate which influence gas exchange, heat dissipation, bleaching sensitivity, nutrient supply, feeding, waste and sediment removal, growth and reproduction. In this paper, we also bring out these important parameters and propose an ensemble machine learning-based model for analyzing these parameters and provide better rates that can help us to understand and suitably improve the ocean composition which in turn can eminently improve the sustainability of the marine ecosystem, mainly the coral reefs
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.
    Adversarial Examples in Random Neural Networks with General Activations. (arXiv:2203.17209v2 [cs.LG] UPDATED)
    A substantial body of empirical work documents the lack of robustness in deep learning models to adversarial examples. Recent theoretical work proved that adversarial examples are ubiquitous in two-layers networks with sub-exponential width and ReLU or smooth activations, and multi-layer ReLU networks with sub-exponential width. We present a result of the same type, with no restriction on width and for general locally Lipschitz continuous activations. More precisely, given a neural network $f(\,\cdot\,;{\boldsymbol \theta})$ with random weights ${\boldsymbol \theta}$, and feature vector ${\boldsymbol x}$, we show that an adversarial example ${\boldsymbol x}'$ can be found with high probability along the direction of the gradient $\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$. Our proof is based on a Gaussian conditioning technique. Instead of proving that $f$ is approximately linear in a neighborhood of ${\boldsymbol x}$, we characterize the joint distribution of $f({\boldsymbol x};{\boldsymbol \theta})$ and $f({\boldsymbol x}';{\boldsymbol \theta})$ for ${\boldsymbol x}' = {\boldsymbol x}-s({\boldsymbol x})\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$.
    MetaQA: Combining Expert Agents for Multi-Skill Question Answering. (arXiv:2112.01922v3 [cs.CL] UPDATED)
    The recent explosion of question answering (QA) datasets and models has increased the interest in the generalization of models across multiple domains and formats by either training on multiple datasets or by combining multiple models. Despite the promising results of multi-dataset models, some domains or QA formats may require specific architectures, and thus the adaptability of these models might be limited. In addition, current approaches for combining models disregard cues such as question-answer compatibility. In this work, we propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores to select the best answer among a list of answer candidates. Through quantitative and qualitative experiments we show that our model i) creates a collaboration between agents that outperforms previous multi-agent and multi-dataset approaches in both in-domain and out-of-domain scenarios, ii) is highly data-efficient to train, and iii) can be adapted to any QA format. We release our code and a dataset of answer predictions from expert agents for 16 QA datasets to foster future developments of multi-agent systems on https://github.com/UKPLab/MetaQA.
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.
    Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering. (arXiv:2204.04581v3 [cs.CL] UPDATED)
    Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at the cost of interpretability, controllability, and efficiency. The opposite properties arise in other methods which have instead relied on knowledge base (KB) facts. At the same time, more recent work has demonstrated the effectiveness of storing and retrieving from an index of Q-A pairs derived from text \citep{lewis2021paq}. This approach yields a high coverage knowledge representation that maintains KB-like properties due to its representations being more atomic units of information. In this work we push this line of research further by proposing a question-answer augmented encoder-decoder model and accompanying pretraining strategy. This yields an end-to-end system that not only outperforms prior QA retrieval methods on single-hop QA tasks but also enables compositional reasoning, as demonstrated by strong performance on two multi-hop QA datasets. Together, these methods improve the ability to interpret and control the model while narrowing the performance gap with passage retrieval systems.
    Learning Regionally Decentralized AC Optimal Power Flows with ADMM. (arXiv:2205.03787v3 [eess.SY] UPDATED)
    One potential future for the next generation of smart grids is the use of decentralized optimization algorithms and secured communications for coordinating renewable generation (e.g., wind/solar), dispatchable devices (e.g., coal/gas/nuclear generations), demand response, battery & storage facilities, and topology optimization. The Alternating Direction Method of Multipliers (ADMM) has been widely used in the community to address such decentralized optimization problems and, in particular, the AC Optimal Power Flow (AC-OPF). This paper studies how machine learning may help in speeding up the convergence of ADMM for solving AC-OPF. It proposes a novel decentralized machine-learning approach, namely ML-ADMM, where each agent uses deep learning to learn the consensus parameters on the coupling branches. The paper also explores the idea of learning only from ADMM runs that exhibit high-quality convergence properties, and proposes filtering mechanisms to select these runs. Experimental results on test cases based on the French system demonstrate the potential of the approach in speeding up the convergence of ADMM significantly.
    Short Blocklength Wiretap Channel Codes via Deep Learning: Design and Performance Evaluation. (arXiv:2206.03477v2 [cs.IT] UPDATED)
    We design short blocklength codes for the Gaussian wiretap channel under information-theoretic security guarantees. Our approach consists in decoupling the reliability and secrecy constraints in our code design. Specifically, we handle the reliability constraint via an autoencoder, and handle the secrecy constraint with hash functions. For blocklengths smaller than or equal to 128, we evaluate through simulations the probability of error at the legitimate receiver and the leakage at the eavesdropper for our code construction. This leakage is defined as the mutual information between the confidential message and the eavesdropper's channel observations, and is empirically measured via a neural network-based mutual information estimator. Our simulation results provide examples of codes with positive secrecy rates that outperform the best known achievable secrecy rates obtained non-constructively for the Gaussian wiretap channel. Additionally, we show that our code design is suitable for the compound and arbitrarily varying Gaussian wiretap channels, for which the channel statistics are not perfectly known but only known to belong to a pre-specified uncertainty set. These models not only capture uncertainty related to channel statistics estimation, but also scenarios where the eavesdropper jams the legitimate transmission or influences its own channel statistics by changing its location.
    Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation. (arXiv:2110.12328v6 [cs.LG] UPDATED)
    Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.
    Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback. (arXiv:2201.13172v2 [cs.LG] UPDATED)
    The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.
    Explainable deep learning for insights in El Ni\~no and river flows. (arXiv:2201.02596v3 [physics.ao-ph] UPDATED)
    The El Ni\~no Southern Oscillation (ENSO) is a semi-periodic fluctuation in sea surface temperature (SST) over the tropical central and eastern Pacific Ocean that influences interannual variability in regional hydrology across the world through long-range dependence or teleconnections. Recent research has demonstrated the value of Deep Learning (DL) methods for improving ENSO prediction as well as Complex Networks (CN) for understanding teleconnections. However, gaps in predictive understanding of ENSO-driven river flows include the black box nature of DL, the use of simple ENSO indices to describe a complex phenomenon and translating DL-based ENSO predictions to river flow predictions. Here we show that eXplainable DL (XDL) methods, based on saliency maps, can extract interpretable predictive information contained in global SST and discover SST information regions and dependence structures relevant for river flows which, in tandem with climate network constructions, enable improved predictive understanding. Our results reveal additional information content in global SST beyond ENSO indices, develop understanding of how SSTs influence river flows, and generate improved river flow prediction, including uncertainty estimation. Observations, reanalysis data, and earth system model simulations are used to demonstrate the value of the XDL-CN based methods for future interannual and decadal scale climate projections.
    A Context-Integrated Transformer-Based Neural Network for Auction Design. (arXiv:2201.12489v3 [cs.GT] UPDATED)
    One of the central problems in auction design is developing an incentive-compatible mechanism that maximizes the auctioneer's expected revenue. While theoretical approaches have encountered bottlenecks in multi-item auctions, recently, there has been much progress on finding the optimal mechanism through deep learning. However, these works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome such limitations by factoring \emph{public} contextual information of bidders and items into the auction learning framework. We propose $\mathtt{CITransNet}$, a context-integrated transformer-based neural network for optimal auction design, which maintains permutation-equivariance over bids and contexts while being able to find asymmetric solutions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal solutions in single-item settings, outperform strong baselines in multi-item auctions, and generalize well to cases other than those in training.
    Linear Connectivity Reveals Generalization Strategies. (arXiv:2205.12411v5 [cs.LG] UPDATED)
    It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.
    Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks. (arXiv:2204.07261v3 [cs.LG] UPDATED)
    We prove linear convergence of gradient descent to a global optimum for the training of deep residual networks with constant layer width and smooth activation function. We show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite $p-$variation with $p=2$. Proofs are based on non-asymptotic estimates for the loss function and for norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.
    A Unification Framework for Euclidean and Hyperbolic Graph Neural Networks. (arXiv:2206.04285v2 [cs.LG] UPDATED)
    Hyperbolic neural networks are able to capture the inherent hierarchy of graph datasets, and consequently a powerful choice of GNNs. However, they entangle multiple incongruent (gyro-)vector spaces within a layer, which makes them limited in terms of generalization and scalability. In this work, we propose to use Poincar\'e disk model as our search space, and apply all approximations on the disk (as if the disk is a tangent space derived from the origin), and thus getting rid of all inter-space transformations. Such an approach enables us to propose a hyperbolic normalization layer, and to further simplify the entire hyperbolic model to a Euclidean model cascaded with our hyperbolic normalization layer. We applied our proposed nonlinear hyperbolic normalization to the current state-of-the-art homogeneous and multi-relational graph networks. We demonstrate that not only does the model leverage the power of Euclidean networks such as interpretability and efficient execution of various model components, but also it outperforms both Euclidean and hyperbolic counterparts in our benchmarks.
    From Kepler to Newton: Explainable AI for Science. (arXiv:2111.12210v7 [cs.AI] UPDATED)
    The Observation--Hypothesis--Prediction--Experimentation loop paradigm for scientific research has been practiced by researchers for years towards scientific discoveries. However, with data explosion in both mega-scale and milli-scale scientific research, it has been sometimes very difficult to manually analyze the data and propose new hypotheses to drive the cycle for scientific discovery. In this paper, we discuss the role of Explainable AI in scientific discovery process by demonstrating an Explainable AI-based paradigm for science discovery. The key is to use Explainable AI to help derive data or model interpretations, hypotheses, as well as scientific discoveries or insights. We show how computational and data-intensive methodology -- together with experimental and theoretical methodology -- can be seamlessly integrated for scientific research. To demonstrate the AI-based science discovery process, and to pay our respect to some of the greatest minds in human history, we show how Kepler's laws of planetary motion and Newton's law of universal gravitation can be rediscovered by (Explainable) AI based on Tycho Brahe's astronomical observation data, whose works were leading the scientific revolution in the 16-17th century. This work also highlights the important role of Explainable AI (as compared to Blackbox AI) in science discovery to help humans prevent or better prepare for the possible technological singularity that may happen in the future, since science is not only about the know how, but also the know why. Presentation of the work is available at https://slideslive.com/38986142/from-kepler-to-newton-explainable-ai-for-science-discovery.
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.
    A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. (arXiv:1809.02362v2 [math.NA] UPDATED)
    Artificial neural networks (ANNs) have very successfully been used in numerical simulations for a series of computational problems ranging from image classification/image recognition, speech recognition, time series analysis, game intelligence, and computational advertising to numerical approximations of partial differential equations (PDEs). Such numerical simulations suggest that ANNs have the capacity to very efficiently approximate high-dimensional functions and, especially, indicate that ANNs seem to admit the fundamental power to overcome the curse of dimensionality when approximating the high-dimensional functions appearing in the above named computational problems. There are a series of rigorous mathematical approximation results for ANNs in the scientific literature. Some of them prove convergence without convergence rates and some even rigorously establish convergence rates but there are only a few special cases where mathematical results can rigorously explain the empirical success of ANNs when approximating high-dimensional functions. The key contribution of this article is to disclose that ANNs can efficiently approximate high-dimensional functions in the case of numerical approximations of Black-Scholes PDEs. More precisely, this work reveals that the number of required parameters of an ANN to approximate the solution of the Black-Scholes PDE grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon > 0$ and the PDE dimension $d \in \mathbb{N}$. We thereby prove, for the first time, that ANNs do indeed overcome the curse of dimensionality in the numerical approximation of Black-Scholes PDEs.
    SUPER-Net: Trustworthy Medical Image Segmentation with Uncertainty Propagation in Encoder-Decoder Networks. (arXiv:2111.05978v3 [eess.IV] UPDATED)
    Deep Learning (DL) holds great promise in reshaping the healthcare industry owing to its precision, efficiency, and objectivity. However, the brittleness of DL models to noisy and out-of-distribution inputs is ailing their deployment in the clinic. Most models produce point estimates without further information about model uncertainty or confidence. This paper introduces a new Bayesian DL framework for uncertainty quantification in segmentation neural networks: SUPER-Net: trustworthy medical image Segmentation with Uncertainty Propagation in Encoder-decodeR Networks. SUPER-Net analytically propagates, using Taylor series approximations, the first two moments (mean and covariance) of the posterior distribution of the model parameters across the nonlinear layers. In particular, SUPER-Net simultaneously learns the mean and covariance without expensive post-hoc Monte Carlo sampling or model ensembling. The output consists of two simultaneous maps: the segmented image and its pixelwise uncertainty map, which corresponds to the covariance matrix of the predictive distribution. We conduct an extensive evaluation of SUPER-Net on medical image segmentation of Magnetic Resonances Imaging and Computed Tomography scans under various noisy and adversarial conditions. Our experiments on multiple benchmark datasets demonstrate that SUPER-Net is more robust to noise and adversarial attacks than state-of-the-art segmentation models. Moreover, the uncertainty map of the proposed SUPER-Net associates low confidence (or equivalently high uncertainty) to patches in the test input images that are corrupted with noise, artifacts, or adversarial attacks. Perhaps more importantly, the model exhibits the ability of self-assessment of its segmentation decisions, notably when making erroneous predictions due to noise or adversarial examples.
    Toward Foundation Models for Earth Monitoring: Generalizable Deep Learning Models for Natural Hazard Segmentation. (arXiv:2301.09318v1 [cs.CV])
    Climate change results in an increased probability of extreme weather events that put societies and businesses at risk on a global scale. Therefore, near real-time mapping of natural hazards is an emerging priority for the support of natural disaster relief, risk management, and informing governmental policy decisions. Recent methods to achieve near real-time mapping increasingly leverage deep learning (DL). However, DL-based approaches are designed for one specific task in a single geographic region based on specific frequency bands of satellite data. Therefore, DL models used to map specific natural hazards struggle with their generalization to other types of natural hazards in unseen regions. In this work, we propose a methodology to significantly improve the generalizability of DL natural hazards mappers based on pre-training on a suitable pre-task. Without access to any data from the target domain, we demonstrate this improved generalizability across four U-Net architectures for the segmentation of unseen natural hazards. Importantly, our method is invariant to geographic differences and differences in the type of frequency bands of satellite data. By leveraging characteristics of unlabeled images from the target domain that are publicly available, our approach is able to further improve the generalization behavior without fine-tuning. Thereby, our approach supports the development of foundation models for earth monitoring with the objective of directly segmenting unseen natural hazards across novel geographic regions given different sources of satellite imagery.
    Synthesis of Compositional Animations from Textual Descriptions. (arXiv:2103.14675v6 [cs.CV] UPDATED)
    "How can we animate 3D-characters from a movie script or move robots by simply telling them what we would like them to do?" "How unstructured and complex can we make a sentence and still generate plausible movements from it?" These are questions that need to be answered in the long-run, as the field is still in its infancy. Inspired by these problems, we present a new technique for generating compositional actions, which handles complex input sentences. Our output is a 3D pose sequence depicting the actions in the input sentence. We propose a hierarchical two-stream sequential model to explore a finer joint-level mapping between natural language sentences and 3D pose sequences corresponding to the given motion. We learn two manifold representations of the motion -- one each for the upper body and the lower body movements. Our model can generate plausible pose sequences for short sentences describing single actions as well as long compositional sentences describing multiple sequential and superimposed actions. We evaluate our proposed model on the publicly available KIT Motion-Language Dataset containing 3D pose data with human-annotated sentences. Experimental results show that our model advances the state-of-the-art on text-based motion synthesis in objective evaluations by a margin of 50%. Qualitative evaluations based on a user study indicate that our synthesized motions are perceived to be the closest to the ground-truth motion captures for both short and compositional sentences.
    Rig Inversion by Training a Differentiable Rig Function. (arXiv:2301.09567v1 [cs.GR])
    Rig inversion is the problem of creating a method that can find the rig parameter vector that best approximates a given input mesh. In this paper we propose to solve this problem by first obtaining a differentiable rig function by training a multi layer perceptron to approximate the rig function. This differentiable rig function can then be used to train a deep learning model of rig inversion.
    Dataset Structural Index: Leveraging a machine's perspective towards visual data. (arXiv:2110.04070v3 [cs.CV] UPDATED)
    With advances in vision and perception architectures, we have realized that working with data is equally crucial, if not more, than the algorithms. Till today, we have trained machines based on our knowledge and perspective of the world. The entire concept of Dataset Structural Index(DSI) revolves around understanding a machine`s perspective of the dataset. With DSI, I show two meta values with which we can get more information over a visual dataset and use it to optimize data, create better architectures, and have an ability to guess which model would work best. These two values are the Variety contribution ratio and Similarity matrix. In the paper, I show many applications of DSI, one of which is how the same level of accuracy can be achieved with the same model architectures trained over less amount of data.
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.
    WDC Products: A Multi-Dimensional Entity Matching Benchmark. (arXiv:2301.09521v1 [cs.LG])
    The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions or they provide for the evaluation of matching methods along a single dimension, for instance the amount of training data. This paper presents WDC Products, an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are (i) amount of corner-cases (ii) generalization to unseen entities, and (iii) development set size. Generalization to unseen entities is a dimension not covered by any of the existing benchmarks yet but is crucial for evaluating the robustness of entity matching systems. WDC Products is based on heterogeneous product data from thousands of e-shops which mark-up products offers using schema.org annotations. Instead of learning how to match entity pairs, entity matching can also be formulated as a multi-class classification task that requires the matcher to recognize individual entities. WDC Products is the first benchmark that provides a pair-wise and a multi-class formulation of the same tasks and thus allows to directly compare the two alternatives. We evaluate WDC Products using several state-of-the-art matching systems, including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching systems struggle with unseen entities to varying degrees. It also shows that some systems are more training data efficient than others.
    A Time Series Approach to Parkinson's Disease Classification from EEG. (arXiv:2301.09568v1 [q-bio.NC])
    Firstly, we present a novel representation for EEG data, a 7-variate series of band power coefficients, which enables the use of (previously inaccessible) time series classification methods. Specifically, we implement the multi-resolution representation-based time series classification method MrSQL. This is deployed on a challenging early-stage Parkinson's dataset that includes wakeful and sleep EEG. Initial results are promising with over 90% accuracy achieved on all EEG data types used. Secondly, we present a framework that enables high-importance data types and brain regions for classification to be identified. Using our framework, we find that, across different EEG data types, it is the Prefrontal brain region that has the most predictive power for the presence of Parkinson's Disease. This outperformance was statistically significant versus ten of the twelve other brain regions (not significant versus adjacent Left Frontal and Right Frontal regions). The Prefrontal region of the brain is important for higher-order cognitive processes and our results align with studies that have shown neural dysfunction in the prefrontal cortex in Parkinson's Disease.
    BayBFed: Bayesian Backdoor Defense for Federated Learning. (arXiv:2301.09508v1 [cs.LG])
    Federated learning (FL) allows participants to jointly train a machine learning model without sharing their private data with others. However, FL is vulnerable to poisoning attacks such as backdoor attacks. Consequently, a variety of defenses have recently been proposed, which have primarily utilized intermediary states of the global model (i.e., logits) or distance of the local models (i.e., L2-norm) from the global model to detect malicious backdoors. However, as these approaches directly operate on client updates, their effectiveness depends on factors such as clients' data distribution or the adversary's attack strategies. In this paper, we introduce a novel and more generic backdoor defense framework, called BayBFed, which proposes to utilize probability distributions over client updates to detect malicious updates in FL: it computes a probabilistic measure over the clients' updates to keep track of any adjustments made in the updates, and uses a novel detection algorithm that can leverage this probabilistic measure to efficiently detect and filter out malicious updates. Thus, it overcomes the shortcomings of previous approaches that arise due to the direct usage of client updates; as our probabilistic measure will include all aspects of the local client training strategies. BayBFed utilizes two Bayesian Non-Parametric extensions: (i) a Hierarchical Beta-Bernoulli process to draw a probabilistic measure given the clients' updates, and (ii) an adaptation of the Chinese Restaurant Process (CRP), referred by us as CRP-Jensen, which leverages this probabilistic measure to detect and filter out malicious updates. We extensively evaluate our defense approach on five benchmark datasets: CIFAR10, Reddit, IoT intrusion detection, MNIST, and FMNIST, and show that it can effectively detect and eliminate malicious updates in FL without deteriorating the benign performance of the global model.
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.
    FInC Flow: Fast and Invertible $k \times k$ Convolutions for Normalizing Flows. (arXiv:2301.09266v1 [cs.CV])
    Invertible convolutions have been an essential element for building expressive normalizing flow-based generative models since their introduction in Glow. Several attempts have been made to design invertible $k \times k$ convolutions that are efficient in training and sampling passes. Though these attempts have improved the expressivity and sampling efficiency, they severely lagged behind Glow which used only $1 \times 1$ convolutions in terms of sampling time. Also, many of the approaches mask a large number of parameters of the underlying convolution, resulting in lower expressivity on a fixed run-time budget. We propose a $k \times k$ convolutional layer and Deep Normalizing Flow architecture which i.) has a fast parallel inversion algorithm with running time O$(n k^2)$ ($n$ is height and width of the input image and k is kernel size), ii.) masks the minimal amount of learnable parameters in a layer. iii.) gives better forward pass and sampling times comparable to other $k \times k$ convolution-based models on real-world benchmarks. We provide an implementation of the proposed parallel algorithm for sampling using our invertible convolutions on GPUs. Benchmarks on CIFAR-10, ImageNet, and CelebA datasets show comparable performance to previous works regarding bits per dimension while significantly improving the sampling time.
    Estimating the energy requirements for long term memory formation. (arXiv:2301.09565v1 [q-bio.NC])
    Brains consume metabolic energy to process information, but also to store memories. The energy required for memory formation can be substantial, for instance in fruit flies memory formation leads to a shorter lifespan upon subsequent starvation (Mery and Kawecki, 2005). Here we estimate that the energy required corresponds to about 10mJ/bit and compare this to biophysical estimates as well as energy requirements in computer hardware. We conclude that biological memory storage is expensive, but the reason behind it is not known.
    Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction. (arXiv:2301.08951v1 [cs.CV])
    When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when the object is completely occluded from partial viewpoints. Meanwhile, humans can imagine the novel views after observing multiple viewpoints. The remarkable recent advance in multi-view object-centric learning leaves some problems: 1) the partially or completely occluded shape of objects can not be well reconstructed. 2) the novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit view rules. This makes the agent fail to perform like humans. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of the object accurately, we enhance the disentanglement between different latent representations: view latent representations are jointly inferred based on the Transformer and then cooperate with the sequential extension of Slot Attention to learn object-centric representations. The model also achieves the new ability: Gaussian processes are employed as priors of view latent variables for generation and novel-view prediction without viewpoint annotations. Experiments on multiple specifically designed synthetic datasets have shown that the proposed model can 1) make the video decomposition, 2) reconstruct the complete shapes of objects, and 3) make the novel viewpoint prediction without viewpoint annotations.
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.
    The Entoptic Field Camera as Metaphor-Driven Research-through-Design with AI Technologies. (arXiv:2301.09545v1 [cs.HC])
    Artificial intelligence (AI) technologies are widely deployed in smartphone photography; and prompt-based image synthesis models have rapidly become commonplace. In this paper, we describe a Research-through-Design (RtD) project which explores this shift in the means and modes of image production via the creation and use of the Entoptic Field Camera. Entoptic phenomena usually refer to perceptions of floaters or bright blue dots stemming from the physiological interplay of the eye and brain. We use the term entoptic as a metaphor to investigate how the material interplay of data and models in AI technologies shapes human experiences of reality. Through our case study using first-person design and a field study, we offer implications for critical, reflective, more-than-human and ludic design to engage AI technologies; the conceptualisation of an RtD research space which contributes to AI literacy discourses; and outline a research trajectory concerning materiality and design affordances of AI technologies.
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.
    ECGAN: Self-supervised generative adversarial network for electrocardiography. (arXiv:2301.09496v1 [cs.LG])
    High-quality synthetic data can support the development of effective predictive models for biomedical tasks, especially in rare diseases or when subject to compelling privacy constraints. These limitations, for instance, negatively impact open access to electrocardiography datasets about arrhythmias. This work introduces a self-supervised approach to the generation of synthetic electrocardiography time series which is shown to promote morphological plausibility. Our model (ECGAN) allows conditioning the generative process for specific rhythm abnormalities, enhancing synchronization and diversity across samples with respect to literature models. A dedicated sample quality assessment framework is also defined, leveraging arrhythmia classifiers. The empirical results highlight a substantial improvement against state-of-the-art generative models for sequences and audio synthesis.
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.
    DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion. (arXiv:2301.09474v1 [cs.LG])
    Real-world data generation often involves complex inter-dependencies among instances, violating the IID-data hypothesis of standard learning paradigms and posing a challenge for uncovering the geometric structures for learning desired instance representations. To this end, we introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states that progressively incorporate other instances' information by their interactions. The diffusion process is constrained by descent criteria w.r.t.~a principled energy function that characterizes the global consistency of instance representations over latent structures. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs, which gives rise to a new class of neural encoders, dubbed as DIFFormer (diffusion-based Transformers), with two instantiations: a simple version with linear complexity for prohibitive instance numbers, and an advanced version for learning complex structures. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as node classification on large graphs, semi-supervised image/text classification, and spatial-temporal dynamics prediction.
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.
    Digital Twins for Marine Operations: A Brief Review on Their Implementation. (arXiv:2301.09574v1 [eess.SY])
    While the concept of a digital twin to support maritime operations is gaining attention for predictive maintenance, real-time monitoring, control, and overall process optimization, clarity on its implementation is missing in the literature. Therefore, in this review we show how different authors implemented their digital twins, discuss our findings, and finally give insights on future research directions.
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.
    M22: A Communication-Efficient Algorithm for Federated Learning Inspired by Rate-Distortion. (arXiv:2301.09269v1 [cs.LG])
    In federated learning (FL), the communication constraint between the remote learners and the Parameter Server (PS) is a crucial bottleneck. For this reason, model updates must be compressed so as to minimize the loss in accuracy resulting from the communication constraint. This paper proposes ``\emph{${\bf M}$-magnitude weighted $L_{\bf 2}$ distortion + $\bf 2$ degrees of freedom''} (M22) algorithm, a rate-distortion inspired approach to gradient compression for federated training of deep neural networks (DNNs). In particular, we propose a family of distortion measures between the original gradient and the reconstruction we referred to as ``$M$-magnitude weighted $L_2$'' distortion, and we assume that gradient updates follow an i.i.d. distribution -- generalized normal or Weibull, which have two degrees of freedom. In both the distortion measure and the gradient, there is one free parameter for each that can be fitted as a function of the iteration number. Given a choice of gradient distribution and distortion measure, we design the quantizer minimizing the expected distortion in gradient reconstruction. To measure the gradient compression performance under a communication constraint, we define the \emph{per-bit accuracy} as the optimal improvement in accuracy that one bit of communication brings to the centralized model over the training period. Using this performance measure, we systematically benchmark the choice of gradient distribution and distortion measure. We provide substantial insights on the role of these choices and argue that significant performance improvements can be attained using such a rate-distortion inspired compressor.
    Explaining the effects of non-convergent sampling in the training of Energy-Based Models. (arXiv:2301.09428v1 [cs.LG])
    In this paper, we quantify the impact of using non-convergent Markov chains to train Energy-Based models (EBMs). In particular, we show analytically that EBMs trained with non-persistent short runs to estimate the gradient can perfectly reproduce a set of empirical statistics of the data, not at the level of the equilibrium measure, but through a precise dynamical process. Our results provide a first-principles explanation for the observations of recent works proposing the strategy of using short runs starting from random initial conditions as an efficient way to generate high-quality samples in EBMs, and lay the groundwork for using EBMs as diffusion models. After explaining this effect in generic EBMs, we analyze two solvable models in which the effect of the non-convergent sampling in the trained parameters can be described in detail. Finally, we test these predictions numerically on the Boltzmann machine.
    StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. (arXiv:2301.09515v1 [cs.LG])
    Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.
    DASTSiam: Spatio-Temporal Fusion and Discriminative Augmentation for Improved Siamese Tracking. (arXiv:2301.09063v1 [cs.CV])
    Tracking tasks based on deep neural networks have greatly improved with the emergence of Siamese trackers. However, the appearance of targets often changes during tracking, which can reduce the robustness of the tracker when facing challenges such as aspect ratio change, occlusion, and scale variation. In addition, cluttered backgrounds can lead to multiple high response points in the response map, leading to incorrect target positioning. In this paper, we introduce two transformer-based modules to improve Siamese tracking called DASTSiam: the spatio-temporal (ST) fusion module and the Discriminative Augmentation (DA) module. The ST module uses cross-attention based accumulation of historical cues to improve robustness against object appearance changes, while the DA module associates semantic information between the template and search region to improve target discrimination. Moreover, Modifying the label assignment of anchors also improves the reliability of the object location. Our modules can be used with all Siamese trackers and show improved performance on several public datasets through comparative and ablation experiments.
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.
    A Framework for Evaluating the Impact of Food Security Scenarios. (arXiv:2301.09320v1 [cs.LG])
    This study proposes an approach for predicting the impacts of scenarios on food security and demonstrates its application in a case study. The approach involves two main steps: (1) scenario definition, in which the end user specifies the assumptions and impacts of the scenario using a scenario template, and (2) scenario evaluation, in which a Vector Autoregression (VAR) model is used in combination with Monte Carlo simulation to generate predictions for the impacts of the scenario based on the defined assumptions and impacts. The case study is based on a proprietary time series food security database created using data from the Food and Agriculture Organization of the United Nations (FAOSTAT), the World Bank, and the United States Department of Agriculture (USDA). The database contains a wide range of data on various indicators of food security, such as production, trade, consumption, prices, availability, access, and nutritional value. The results show that the proposed approach can be used to predict the potential impacts of scenarios on food security and that the proprietary time series food security database can be used to support this approach. The study provides specific insights on how this approach can inform decision-making processes related to food security such as food prices and availability in the case study region.
    An iterative multi-fidelity approach for model order reduction of multi-dimensional input parametric PDE systems. (arXiv:2301.09483v1 [math.NA])
    We propose a parametric sampling strategy for the reduction of large-scale PDE systems with multidimensional input parametric spaces by leveraging models of different fidelity. The design of this methodology allows a user to adaptively sample points ad hoc from a discrete training set with no prior requirement of error estimators. It is achieved by exploiting low-fidelity models throughout the parametric space to sample points using an efficient sampling strategy, and at the sampled parametric points, high-fidelity models are evaluated to recover the reduced basis functions. The low-fidelity models are then adapted with the reduced order models ( ROMs) built by projection onto the subspace spanned by the recovered basis functions. The process continues until the low-fidelity model can represent the high-fidelity model adequately for all the parameters in the parametric space. Since the proposed methodology leverages the use of low-fidelity models to assimilate the solution database, it significantly reduces the computational cost in the offline stage. The highlight of this article is to present the construction of the initial low-fidelity model, and a sampling strategy based on the discrete empirical interpolation method (DEIM). We test this approach on a 2D steady-state heat conduction problem for two different input parameters and make a qualitative comparison with the classical greedy reduced basis method (RBM), and further test on a 9-dimensional parametric non-coercive elliptic problem and analyze the computational performance based on different tuning of greedy selection of points.
    Speeding Up BatchBALD: A k-BALD Family of Approximations for Active Learning. (arXiv:2301.09490v1 [cs.LG])
    Active learning is a powerful method for training machine learning models with limited labeled data. One commonly used technique for active learning is BatchBALD, which uses Bayesian neural networks to find the most informative points to label in a pool set. However, BatchBALD can be very slow to compute, especially for larger datasets. In this paper, we propose a new approximation, k-BALD, which uses k-wise mutual information terms to approximate BatchBALD, making it much less expensive to compute. Results on the MNIST dataset show that k-BALD is significantly faster than BatchBALD while maintaining similar performance. Additionally, we also propose a dynamic approach for choosing k based on the quality of the approximation, making it more efficient for larger datasets.
    A Simple Recipe for Competitive Low-compute Self supervised Vision Models. (arXiv:2301.09451v1 [cs.CV])
    Self-supervised methods in vision have been mostly focused on large architectures as they seem to suffer from a significant performance drop for smaller architectures. In this paper, we propose a simple self-supervised distillation technique that can train high performance low-compute neural networks. Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model. Thus, we call our method Replace one Branch (RoB) as it simply replaces one branch of the joint-embedding training with a large teacher model. RoB is widely applicable to a number of architectures such as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV or iBOT. When pretraining on the ImageNet dataset, RoB yields models that compete with supervised knowledge distillation. When applied to MSN, RoB produces students with strong semi-supervised capabilities. Finally, our best ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by $2.3\%$ and are on par or better than a supervised distilled DeiT on five downstream transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We hope RoB enables practical self-supervision at smaller scale.
    New Insights into Multi-Calibration. (arXiv:2301.08837v1 [cs.LG])
    We identify a novel connection between the recent literature on multi-group fairness for prediction algorithms and well-established notions of graph regularity from extremal graph theory. We frame our investigation using new, statistical distance-based variants of multi-calibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new multi-calibration algorithms with improved complexity in certain parameter regimes, and to a generalization of a state-of-the-art result on omniprediction. Along the way, we also unify several prior algorithms for achieving multi-group fairness, as well as their analyses, through the lens of no-regret learning.
    LSTM and CNN application for core-collapse supernova search in gravitational wave real data. (arXiv:2301.09387v1 [astro-ph.IM])
    $Context.$ Core-collapse supernovae (CCSNe) are expected to emit gravitational wave signals that could be detected by current and future generation interferometers within the Milky Way and nearby galaxies. The stochastic nature of the signal arising from CCSNe requires alternative detection methods to matched filtering. $Aims.$ We aim to show the potential of machine learning (ML) for multi-label classification of different CCSNe simulated signals and noise transients using real data. We compared the performance of 1D and 2D convolutional neural networks (CNNs) on single and multiple detector data. For the first time, we tested multi-label classification also with long short-term memory (LSTM) networks. $Methods.$ We applied a search and classification procedure for CCSNe signals, using an event trigger generator, the Wavelet Detection Filter (WDF), coupled with ML. We used time series and time-frequency representations of the data as inputs to the ML models. To compute classification accuracies, we simultaneously injected, at detectable distance of 1\,kpc, CCSN waveforms, obtained from recent hydrodynamical simulations of neutrino-driven core-collapse, onto interferometer noise from the O2 LIGO and Virgo science run. $Results.$ We compared the performance of the three models on single detector data. We then merged the output of the models for single detector classification of noise and astrophysical transients, obtaining overall accuracies for LIGO ($\sim99\%$) and ($\sim80\%$) for Virgo. We extended our analysis to the multi-detector case using triggers coincident among the three ITFs and achieved an accuracy of $\sim98\%$.
    Ordinal Regression for Difficulty Estimation of StepMania Levels. (arXiv:2301.09485v1 [cs.LG])
    StepMania is a popular open-source clone of a rhythm-based video game. As is common in popular games, there is a large number of community-designed levels. It is often difficult for players and level authors to determine the difficulty level of such community contributions. In this work, we formalize and analyze the difficulty prediction task on StepMania levels as an ordinal regression (OR) task. We standardize a more extensive and diverse selection of this data resulting in five data sets, two of which are extensions of previous work. We evaluate many competitive OR and non-OR models, demonstrating that neural network-based models significantly outperform the state of the art and that StepMania-level data makes for an excellent test bed for deep OR models. We conclude with a user experiment showing our trained models' superiority over human labeling.
    Hierarchically branched diffusion models for efficient and interpretable multi-class conditional generation. (arXiv:2212.10777v2 [cs.LG] UPDATED)
    Diffusion models have achieved justifiable popularity by attaining state-of-the-art performance in generating realistic objects, including when conditioning generation on labels. Current diffusion models are universally linear in nature, modeling diffusion identically for objects of all classes. For the multi-class conditional generation problem, we propose a novel, structurally unique framework of diffusion models which are hierarchically branched according to the inherent relationships between classes. In this work, we showcase several advantages of branched diffusion models. We demonstrate that branched models generate samples more efficiently, and are more easily extended to novel classes in a continual-learning setting. We also show that branched models enjoy a unique interpretability that offers insight into the modeled data distribution. Branched diffusion models represent an alternative paradigm to their traditional linear counterparts, and can have large impacts in how we use diffusion models for efficient generation, online learning, and scientific discovery.
    LF-checker: Machine Learning Acceleration of Bounded Model Checking for Concurrency Verification (Competition Contribution). (arXiv:2301.09142v1 [cs.LG])
    We describe and evaluate LF-checker, a metaverifier tool based on machine learning. It extracts multiple features of the program under test and predicts the optimal configuration (flags) of a bounded model checker with a decision tree. Our current work is specialised in concurrency verification and employs ESBMC as a back-end verification engine. In the paper, we demonstrate that LF-checker achieves better results than the default configuration of the underlying verification engine.
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.
    SMDDH: Singleton Mention detection using Deep Learning in Hindi Text. (arXiv:2301.09361v1 [cs.CL])
    Mention detection is an important component of coreference resolution system, where mentions such as name, nominal, and pronominals are identified. These mentions can be purely coreferential mentions or singleton mentions (non-coreferential mentions). Coreferential mentions are those mentions in a text that refer to the same entities in a real world. Whereas, singleton mentions are mentioned only once in the text and do not participate in the coreference as they are not mentioned again in the following text. Filtering of these singleton mentions can substantially improve the performance of a coreference resolution process. This paper proposes a singleton mention detection module based on a fully connected network and a Convolutional neural network for Hindi text. This model utilizes a few hand-crafted features and context information, and word embedding for words. The coreference annotated Hindi dataset comprising of 3.6K sentences, and 78K tokens are used for the task. In terms of Precision, Recall, and F-measure, the experimental findings obtained are excellent.
    A Comprehensive Survey on Heart Sound Analysis in the Deep Learning Era. (arXiv:2301.09362v1 [cs.SD])
    Heart sound auscultation has been demonstrated to be beneficial in clinical usage for early screening of cardiovascular diseases. Due to the high requirement of well-trained professionals for auscultation, automatic auscultation benefiting from signal processing and machine learning can help auxiliary diagnosis and reduce the burdens of training professional clinicians. Nevertheless, classic machine learning is limited to performance improvement in the era of big data. Deep learning has achieved better performance than classic machine learning in many research fields, as it employs more complex model architectures with stronger capability of extracting effective representations. Deep learning has been successfully applied to heart sound analysis in the past years. As most review works about heart sound analysis were given before 2017, the present survey is the first to work on a comprehensive overview to summarise papers on heart sound analysis with deep learning in the past six years 2017--2022. We introduce both classic machine learning and deep learning for comparison, and further offer insights about the advances and future research directions in deep learning for heart sound analysis.
    MATT: Multimodal Attention Level Estimation for e-learning Platforms. (arXiv:2301.09174v1 [cs.CV])
    This work presents a new multimodal system for remote attention level estimation based on multimodal face analysis. Our multimodal approach uses different parameters and signals obtained from the behavior and physiological processes that have been related to modeling cognitive load such as faces gestures (e.g., blink rate, facial actions units) and user actions (e.g., head pose, distance to the camera). The multimodal system uses the following modules based on Convolutional Neural Networks (CNNs): Eye blink detection, head pose estimation, facial landmark detection, and facial expression features. First, we individually evaluate the proposed modules in the task of estimating the student's attention level captured during online e-learning sessions. For that we trained binary classifiers (high or low attention) based on Support Vector Machines (SVM) for each module. Secondly, we find out to what extent multimodal score level fusion improves the attention level estimation. The mEBAL database is used in the experimental framework, a public multi-modal database for attention level estimation obtained in an e-learning environment that contains data from 38 users while conducting several e-learning tasks of variable difficulty (creating changes in student cognitive loads).
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.
    Enabling Hard Constraints in Differentiable Neural Network and Accelerator Co-Exploration. (arXiv:2301.09312v1 [cs.LG])
    Co-exploration of an optimal neural architecture and its hardware accelerator is an approach of rising interest which addresses the computational cost problem, especially in low-profile systems. The large co-exploration space is often handled by adopting the idea of differentiable neural architecture search. However, despite the superior search efficiency of the differentiable co-exploration, it faces a critical challenge of not being able to systematically satisfy hard constraints such as frame rate. To handle the hard constraint problem of differentiable co-exploration, we propose HDX, which searches for hard-constrained solutions without compromising the global design objectives. By manipulating the gradients in the interest of the given hard constraint, high-quality solutions satisfying the constraint can be obtained.
    On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment. (arXiv:2301.09420v1 [cs.LG])
    Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which are based on-policy and off-policy RL approaches. We compare our results with the state-of-the-art results for this challenge and discuss the potential areas of improvement while discussing the explainability of these approaches in conjunction with waypoints in the SMARTS environment.
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.
    Prompt Federated Learning for Weather Forecasting: Toward Foundation Models on Meteorological Data. (arXiv:2301.09152v1 [cs.LG])
    To tackle the global climate challenge, it urgently needs to develop a collaborative platform for comprehensive weather forecasting on large-scale meteorological data. Despite urgency, heterogeneous meteorological sensors across countries and regions, inevitably causing multivariate heterogeneity and data exposure, become the main barrier. This paper develops a foundation model across regions capable of understanding complex meteorological data and providing weather forecasting. To relieve the data exposure concern across regions, a novel federated learning approach has been proposed to collaboratively learn a brand-new spatio-temporal Transformer-based foundation model across participants with heterogeneous meteorological data. Moreover, a novel prompt learning mechanism has been adopted to satisfy low-resourced sensors' communication and computational constraints. The effectiveness of the proposed method has been demonstrated on classical weather forecasting tasks using three meteorological datasets with multivariate time series.
    Lower Bounds on Learning Pauli Channels. (arXiv:2301.09192v1 [quant-ph])
    Understanding the noise affecting a quantum device is of fundamental importance for scaling quantum technologies. A particularly important class of noise models is that of Pauli channels, as randomized compiling techniques can effectively bring any quantum channel to this form and are significantly more structured than general quantum channels. In this paper, we show fundamental lower bounds on the sample complexity for learning Pauli channels in diamond norm with unentangled measurements. We consider both adaptive and non-adaptive strategies. In the non-adaptive setting, we show a lower bound of $\Omega(2^{3n}\epsilon^{-2})$ to learn an $n$-qubit Pauli channel. In particular, this shows that the recently introduced learning procedure by Flammia and Wallman is essentially optimal. In the adaptive setting, we show a lower bound of $\Omega(2^{2.5n}\epsilon^{-2})$ for $\epsilon=\mathcal{O}(2^{-n})$, and a lower bound of $\Omega(2^{2n}\epsilon^{-2} )$ for any $\epsilon > 0$. This last lower bound even applies for arbitrarily many sequential uses of the channel, as long as they are only interspersed with other unital operations.
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo
    Abstracting Imperfect Information Away from Two-Player Zero-Sum Games. (arXiv:2301.09159v1 [cs.GT])
    In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem -- thus, computing them can be treated as perfect information problems. Because these regularized equilibria can be made arbitrarily close to Nash equilibria, our result opens the door to a new perspective on solving two-player zero-sum games and, in particular, yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning approaches.
    Optimising complexity of CNN models for resource constrained devices: QRS detection case study. (arXiv:2301.09232v1 [cs.LG])
    Traditional DL models are complex and resource hungry and thus, care needs to be taken in designing Internet of (medical) things (IoT, or IoMT) applications balancing efficiency-complexity trade-off. Recent IoT solutions tend to avoid using deep-learning methods due to such complexities, and rather classical filter-based methods are commonly used. We hypothesize that a shallow CNN model can offer satisfactory level of performance in combination by leveraging other essential solution-components, such as post-processing that is suitable for resource constrained environment. In an IoMT application context, QRS-detection and R-peak localisation from ECG signal as a case study, the complexities of CNN models and post-processing were varied to identify a set of combinations suitable for a range of target resource-limited environments. To the best of our knowledge, finding a deploy-able configuration, by incrementally increasing the CNN model complexity, as required to match the target's resource capacity, and leveraging the strength of post-processing, is the first of its kind. The results show that a shallow 2-layer CNN with a suitable post-processing can achieve $>$90\% F1-score, and the scores continue to improving for 8-32 layer CNNs, which can be used to profile target constraint environment. The outcome shows that it is possible to design an optimal DL solution with known target performance characteristics and resource (computing capacity, and memory) constraints.
    A Survey on Actionable Knowledge. (arXiv:2301.09317v1 [cs.LG])
    Actionable Knowledge Discovery (AKD) is a crucial aspect of data mining that is gaining popularity and being applied in a wide range of domains. This is because AKD can extract valuable insights and information, also known as knowledge, from large datasets. The goal of this paper is to examine different research studies that focus on various domains and have different objectives. The paper will review and discuss the methods used in these studies in detail. AKD is a process of identifying and extracting actionable insights from data, which can be used to make informed decisions and improve business outcomes. It is a powerful tool for uncovering patterns and trends in data that can be used for various applications such as customer relationship management, marketing, and fraud detection. The research studies reviewed in this paper will explore different techniques and approaches for AKD in different domains, such as healthcare, finance, and telecommunications. The paper will provide a thorough analysis of the current state of AKD in the field and will review the main methods used by various research studies. Additionally, the paper will evaluate the advantages and disadvantages of each method and will discuss any novel or new solutions presented in the field. Overall, this paper aims to provide a comprehensive overview of the methods and techniques used in AKD and the impact they have on different domains.
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.
    Differentially Private Natural Language Models: Recent Advances and Future Directions. (arXiv:2301.09112v1 [cs.CL])
    Recent developments in deep learning have led to great success in various natural language processing (NLP) tasks. However, these applications may involve data that contain sensitive information. Therefore, how to achieve good performance while also protect privacy of sensitive data is a crucial challenge in NLP. To preserve privacy, Differential Privacy (DP), which can prevent reconstruction attacks and protect against potential side knowledge, is becoming a de facto technique for private data analysis. In recent years, NLP in DP models (DP-NLP) has been studied from different perspectives, which deserves a comprehensive review. In this paper, we provide the first systematic review of recent advances on DP deep learning models in NLP. In particular, we first discuss some differences and additional challenges of DP-NLP compared with the standard DP deep learning. Then we investigate some existing work on DP-NLP and present its recent developments from two aspects: gradient perturbation based methods and embedding vector perturbation based methods. We also discuss some challenges and future directions of this topic.
    Learning to Reject with a Fixed Predictor: Application to Decontextualization. (arXiv:2301.09044v1 [cs.LG])
    We study the problem of classification with a reject option for a fixed predictor, applicable in natural language processing. \ignore{where many correct labels are often possible} We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong $H$-consistency guarantee. For evaluation, we choose the \textit{decontextualization} task, and provide a manually-labelled dataset of $2\mathord,000$ examples. Our algorithm significantly outperforms the baselines considered, with a $\sim\!\!25\%$ improvement in coverage when halving the error rate, which is only $\sim\!\! 3 \%$ away from the theoretical limit.
    Provable Unrestricted Adversarial Training without Compromise with Generalizability. (arXiv:2301.09069v1 [cs.LG])
    Adversarial training (AT) is widely considered as the most promising strategy to defend against adversarial attacks and has drawn increasing interest from researchers. However, the existing AT methods still suffer from two challenges. First, they are unable to handle unrestricted adversarial examples (UAEs), which are built from scratch, as opposed to restricted adversarial examples (RAEs), which are created by adding perturbations bound by an $l_p$ norm to observed examples. Second, the existing AT methods often achieve adversarial robustness at the expense of standard generalizability (i.e., the accuracy on natural examples) because they make a tradeoff between them. To overcome these challenges, we propose a unique viewpoint that understands UAEs as imperceptibly perturbed unobserved examples. Also, we find that the tradeoff results from the separation of the distributions of adversarial examples and natural examples. Based on these ideas, we propose a novel AT approach called Provable Unrestricted Adversarial Training (PUAT), which can provide a target classifier with comprehensive adversarial robustness against both UAE and RAE, and simultaneously improve its standard generalizability. Particularly, PUAT utilizes partially labeled data to achieve effective UAE generation by accurately capturing the natural data distribution through a novel augmented triple-GAN. At the same time, PUAT extends the traditional AT by introducing the supervised loss of the target classifier into the adversarial loss and achieves the alignment between the UAE distribution, the natural data distribution, and the distribution learned by the classifier, with the collaboration of the augmented triple-GAN. Finally, the solid theoretical analysis and extensive experiments conducted on widely-used benchmarks demonstrate the superiority of PUAT.
    Relaxed Models for Adversarial Streaming: The Advice Model and the Bounded Interruptions Model. (arXiv:2301.09203v1 [cs.DS])
    Streaming algorithms are typically analyzed in the oblivious setting, where we assume that the input stream is fixed in advance. Recently, there is a growing interest in designing adversarially robust streaming algorithms that must maintain utility even when the input stream is chosen adaptively and adversarially as the execution progresses. While several fascinating results are known for the adversarial setting, in general, it comes at a very high cost in terms of the required space. Motivated by this, in this work we set out to explore intermediate models that allow us to interpolate between the oblivious and the adversarial models. Specifically, we put forward the following two models: (1) *The advice model*, in which the streaming algorithm may occasionally ask for one bit of advice. (2) *The bounded interruptions model*, in which we assume that the adversary is only partially adaptive. We present both positive and negative results for each of these two models. In particular, we present generic reductions from each of these models to the oblivious model. This allows us to design robust algorithms with significantly improved space complexity compared to what is known in the plain adversarial model.
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.
    Energy Prediction using Federated Learning. (arXiv:2301.09165v1 [cs.LG])
    In this work, we demonstrate the viability of using federated learning to successfully predict energy consumption as well as solar production for all households within a certain network using low-power and low-space consuming embedded devices. We also demonstrate our prediction performance improving over time without the need for sharing private consumer energy data. We simulate a system with four nodes using data for one year to show this.
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.
    Towards NeuroAI: Introducing Neuronal Diversity into Artificial Neural Networks. (arXiv:2301.09245v1 [cs.NE])
    Throughout history, the development of artificial intelligence, particularly artificial neural networks, has been open to and constantly inspired by the increasingly deepened understanding of the brain, such as the inspiration of neocognitron, which is the pioneering work of convolutional neural networks. Per the motives of the emerging field: NeuroAI, a great amount of neuroscience knowledge can help catalyze the next generation of AI by endowing a network with more powerful capabilities. As we know, the human brain has numerous morphologically and functionally different neurons, while artificial neural networks are almost exclusively built on a single neuron type. In the human brain, neuronal diversity is an enabling factor for all kinds of biological intelligent behaviors. Since an artificial network is a miniature of the human brain, introducing neuronal diversity should be valuable in terms of addressing those essential problems of artificial networks such as efficiency, interpretability, and memory. In this Primer, we first discuss the preliminaries of biological neuronal diversity and the characteristics of information transmission and processing in a biological neuron. Then, we review studies of designing new neurons for artificial networks. Next, we discuss what gains can neuronal diversity bring into artificial networks and exemplary applications in several important fields. Lastly, we discuss the challenges and future directions of neuronal diversity to explore the potential of NeuroAI.
    MEMO : Accelerating Transformers with Memoization on Big Memory Systems. (arXiv:2301.09262v1 [cs.PF])
    Transformers gain popularity because of their superior prediction accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing work to accelerate transformer inferences has limitations because of the changes to transformer architectures or the need for specialized hardware. In this paper, we identify the opportunities of using memoization to accelerate the attention mechanism in transformers without the above limitation. Built upon a unique observation that there is a rich similarity in attention computation across inference sequences, we build an attention database upon the emerging big memory system. We introduce the embedding technique to find semantically similar inputs to identify computation similarity. We also introduce a series of techniques such as memory mapping and selective memoization to avoid memory copy and unnecessary overhead. We enable 21% performance improvement on average (up to 68%) with the TB-scale attention database and with ignorable loss in inference accuracy.
    Practical Adversarial Attacks Against AI-Driven Power Allocation in a Distributed MIMO Network. (arXiv:2301.09305v1 [eess.SP])
    In distributed multiple-input multiple-output (D-MIMO) networks, power control is crucial to optimize the spectral efficiencies of users and max-min fairness (MMF) power control is a commonly used strategy as it satisfies uniform quality-of-service to all users. The optimal solution of MMF power control requires high complexity operations and hence deep neural network based artificial intelligence (AI) solutions are proposed to decrease the complexity. Although quite accurate models can be achieved by using AI, these models have some intrinsic vulnerabilities against adversarial attacks where carefully crafted perturbations are applied to the input of the AI model. In this work, we show that threats against the target AI model which might be originated from malicious users or radio units can substantially decrease the network performance by applying a successful adversarial sample, even in the most constrained circumstances. We also demonstrate that the risk associated with these kinds of adversarial attacks is higher than the conventional attack threats. Detailed simulations reveal the effectiveness of adversarial attacks and the necessity of smart defense techniques.
    BallGAN: 3D-aware Image Synthesis with a Spherical Background. (arXiv:2301.09091v1 [cs.CV])
    3D-aware GANs aim to synthesize realistic 3D scenes such that they can be rendered in arbitrary perspectives to produce images. Although previous methods produce realistic images, they suffer from unstable training or degenerate solutions where the 3D geometry is unnatural. We hypothesize that the 3D geometry is underdetermined due to the insufficient constraint, i.e., being classified as real image to the discriminator is not enough. To solve this problem, we propose to approximate the background as a spherical surface and represent a scene as a union of the foreground placed in the sphere and the thin spherical background. It reduces the degree of freedom in the background field. Accordingly, we modify the volume rendering equation and incorporate dedicated constraints to design a novel 3D-aware GAN framework named BallGAN. BallGAN has multiple advantages as follows. 1) It produces more reasonable 3D geometry; the images of a scene across different viewpoints have better photometric consistency and fidelity than the state-of-the-art methods. 2) The training becomes much more stable. 3) The foreground can be separately rendered on top of different arbitrary backgrounds.
    Learning to Linearize Deep Neural Networks for Secure and Efficient Private Inference. (arXiv:2301.09254v1 [cs.CV])
    The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers' ReLU sensitivity, enabling mitigation of the time-consuming manual efforts in identifying the same. Based on this sensitivity, we then present SENet, a three-stage training method that for a given ReLU budget, automatically assigns per-layer ReLU counts, decides the ReLU locations for each layer's activation map, and trains a model with significantly fewer ReLUs to potentially yield latency and communication efficient PI. Experimental evaluations with multiple models on various datasets show SENet's superior performance both in terms of reduced ReLUs and improved classification accuracy compared to existing alternatives. In particular, SENet can yield models that require up to ~2x fewer ReLUs while yielding similar accuracy. For a similar ReLU budget SENet can yield models with ~2.32% improved classification accuracy, evaluated on CIFAR-100.
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.
    Condition monitoring and anomaly detection in cyber-physical systems. (arXiv:2301.09030v1 [cs.LG])
    The modern industrial environment is equipping myriads of smart manufacturing machines where the state of each device can be monitored continuously. Such monitoring can help identify possible future failures and develop a cost-effective maintenance plan. However, it is a daunting task to perform early detection with low false positives and negatives from the huge volume of collected data. This requires developing a holistic machine learning framework to address the issues in condition monitoring of high-priority components and develop efficient techniques to detect anomalies that can detect and possibly localize the faulty components. This paper presents a comparative analysis of recent machine learning approaches for robust, cost-effective anomaly detection in cyber-physical systems. While detection has been extensively studied, very few researchers have analyzed the localization of the anomalies. We show that supervised learning outperforms unsupervised algorithms. For supervised cases, we achieve near-perfect accuracy of 98 percent (specifically for tree-based algorithms). In contrast, the best-case accuracy in the unsupervised cases was 63 percent :the area under the receiver operating characteristic curve (AUC) exhibits similar outcomes as an additional metric.
    Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer. (arXiv:2301.09255v1 [cs.CV])
    In recent years, privacy-preserving methods for deep learning have become an urgent problem. Accordingly, we propose the combined use of federated learning (FL) and encrypted images for privacy-preserving image classification under the use of the vision transformer (ViT). The proposed method allows us not only to train models over multiple participants without directly sharing their raw data but to also protect the privacy of test (query) images for the first time. In addition, it can also maintain the same accuracy as normally trained models. In an experiment, the proposed method was demonstrated to well work without any performance degradation on the CIFAR-10 and CIFAR-100 datasets.
    Statistically Optimal Robust Mean and Covariance Estimation for Anisotropic Gaussians. (arXiv:2301.09024v1 [math.ST])
    Assume that $X_{1}, \ldots, X_{N}$ is an $\varepsilon$-contaminated sample of $N$ independent Gaussian vectors in $\mathbb{R}^d$ with mean $\mu$ and covariance $\Sigma$. In the strong $\varepsilon$-contamination model we assume that the adversary replaced an $\varepsilon$ fraction of vectors in the original Gaussian sample by any other vectors. We show that there is an estimator $\widehat \mu$ of the mean satisfying, with probability at least $1 - \delta$, a bound of the form \[ \|\widehat{\mu} - \mu\|_2 \le c\left(\sqrt{\frac{\operatorname{Tr}(\Sigma)}{N}} + \sqrt{\frac{\|\Sigma\|\log(1/\delta)}{N}} + \varepsilon\sqrt{\|\Sigma\|}\right), \] where $c > 0$ is an absolute constant and $\|\Sigma\|$ denotes the operator norm of $\Sigma$. In the same contaminated Gaussian setup, we construct an estimator $\widehat \Sigma$ of the covariance matrix $\Sigma$ that satisfies, with probability at least $1 - \delta$, \[ \left\|\widehat{\Sigma} - \Sigma\right\| \le c\left(\sqrt{\frac{\|\Sigma\|\operatorname{Tr}(\Sigma)}{N}} + \|\Sigma\|\sqrt{\frac{\log(1/\delta)}{N}} + \varepsilon\|\Sigma\|\right). \] Both results are optimal up to multiplicative constant factors. Despite the recent significant interest in robust statistics, achieving both dimension-free bounds in the canonical Gaussian case remained open. In fact, several previously known results were either dimension-dependent and required $\Sigma$ to be close to identity, or had a sub-optimal dependence on the contamination level $\varepsilon$. As a part of the analysis, we derive sharp concentration inequalities for central order statistics of Gaussian, folded normal, and chi-squared distributions.
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v3 [cs.LG] UPDATED)
    The black-box nature of deep neural networks (DNN) has brought to attention the issues of transparency and fairness. Deep Reinforcement Learning (Deep RL or DRL), which uses DNN to learn its policy, value functions etc, is thus also subject to similar concerns. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. The framework introduced in this paper is called the Self Reward Design (SRD), inspired by the Inverse Reward Design, and this interpretable design can (1) solve the problem by pure design (although imperfectly) and (2) be optimized like a standard DNN. With deliberate human designs, we show that some RL problems such as lavaland and MuJoCo can be solved using a model constructed with standard NN components with few parameters. Furthermore, with our fish sale auction example, we demonstrate how SRD is used to address situations that will not make sense if black-box models are used, where humanly-understandable semantic-based decision is required.
    Unifying Synergies between Self-supervised Learning and Dynamic Computation. (arXiv:2301.09164v1 [cs.LG])
    Self-supervised learning (SSL) approaches have made major strides forward by emulating the performance of their supervised counterparts on several computer vision benchmarks. This, however, comes at a cost of substantially larger model sizes, and computationally expensive training strategies, which eventually lead to larger inference times making it impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweight sub-network, which usually involves multiple epochs of fine-tuning of a large pre-trained model, making it more computationally challenging. In this work we propose a novel perspective on the interplay between SSL and DC paradigms that can be leveraged to simultaneously learn a dense and gated (sparse/lightweight) sub-network from scratch offering a good accuracy-efficiency trade-off, and therefore yielding a generic and multi-purpose architecture for application specific industrial settings. Our study overall conveys a constructive message: exhaustive experiments on several image classification benchmarks: CIFAR-10, STL-10, CIFAR-100, and ImageNet-100, demonstrates that the proposed training strategy provides a dense and corresponding sparse sub-network that achieves comparable (on-par) performance compared with the vanilla self-supervised setting, but at a significant reduction in computation in terms of FLOPs under a range of target budgets.
    Raw or Cooked? Object Detection on RAW Images. (arXiv:2301.08965v1 [cs.CV])
    Images fed to a deep neural network have in general undergone several handcrafted image signal processing (ISP) operations, all of which have been optimized to produce visually pleasing images. In this work, we investigate the hypothesis that the intermediate representation of visually pleasing images is sub-optimal for downstream computer vision tasks compared to the RAW image representation. We suggest that the operations of the ISP instead should be optimized towards the end task, by learning the parameters of the operations jointly during training. We extend previous works on this topic and propose a new learnable operation that enables an object detector to achieve superior performance when compared to both previous works and traditional RGB images. In experiments on the open PASCALRAW dataset, we empirically confirm our hypothesis.
    Self-Supervised Image Representation Learning: Transcending Masking with Paired Image Overlay. (arXiv:2301.09299v1 [cs.CV])
    Self-supervised learning has become a popular approach in recent years for its ability to learn meaningful representations without the need for data annotation. This paper proposes a novel image augmentation technique, overlaying images, which has not been widely applied in self-supervised learning. This method is designed to provide better guidance for the model to understand underlying information, resulting in more useful representations. The proposed method is evaluated using contrastive learning, a widely used self-supervised learning method that has shown solid performance in downstream tasks. The results demonstrate the effectiveness of the proposed augmentation technique in improving the performance of self-supervised models.
    Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AI. (arXiv:2201.00009v3 [cs.LG] UPDATED)
    This paper quantifies the quality of heatmap-based eXplainable AI (XAI) methods w.r.t image classification problem. Here, a heatmap is considered desirable if it improves the probability of predicting the correct classes. Different XAI heatmap-based methods are empirically shown to improve classification confidence to different extents depending on the datasets, e.g. Saliency works best on ImageNet and Deconvolution on Chest X-Ray Pneumonia dataset. The novelty includes a new gap distribution that shows a stark difference between correct and wrong predictions. Finally, the generative augmentative explanation is introduced, a method to generate heatmaps capable of improving predictive confidence to a high level.
    Adapting a Language Model While Preserving its General Knowledge. (arXiv:2301.08986v1 [cs.CL])
    Domain-adaptive pre-training (or DA-training for short), also known as post-training, aims to train a pre-trained general-purpose language model (LM) using an unlabeled corpus of a particular domain to adapt the LM so that end-tasks in the domain can give improved performances. However, existing DA-training methods are in some sense blind as they do not explicitly identify what knowledge in the LM should be preserved and what should be changed by the domain corpus. This paper shows that the existing methods are suboptimal and proposes a novel method to perform a more informed adaptation of the knowledge in the LM by (1) soft-masking the attention heads based on their importance to best preserve the general knowledge in the LM and (2) contrasting the representations of the general and the full (both general and domain knowledge) to learn an integrated representation with both general and domain-specific knowledge. Experimental results will demonstrate the effectiveness of the proposed approach.
    The Shape of Explanations: A Topological Account of Rule-Based Explanations in Machine Learning. (arXiv:2301.09042v1 [cs.LG])
    Rule-based explanations provide simple reasons explaining the behavior of machine learning classifiers at given points in the feature space. Several recent methods (Anchors, LORE, etc.) purport to generate rule-based explanations for arbitrary or black-box classifiers. But what makes these methods work in general? We introduce a topological framework for rule-based explanation methods and provide a characterization of explainability in terms of the definability of a classifier relative to an explanation scheme. We employ this framework to consider various explanation schemes and argue that the preferred scheme depends on how much the user knows about the domain and the probability measure over the feature space.
    Efficient Training Under Limited Resources. (arXiv:2301.09264v1 [cs.LG])
    Training time budget and size of the dataset are among the factors affecting the performance of a Deep Neural Network (DNN). This paper shows that Neural Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data Augmentation help DNNs perform much better while these two factors are limited. However, searching for an optimal architecture and the best hyperparameter values besides a good combination of data augmentation techniques under low resources requires many experiments. We present our approach to achieving such a goal in three steps: reducing training epoch time by compressing the model while maintaining the performance compared to the original model, preventing model overfitting when the dataset is small, and performing the hyperparameter tuning. We used NOMAD, which is a blackbox optimization software based on a derivative-free algorithm to do NAS and HPO. Our work achieved an accuracy of 86.0 % on a tiny subset of Mini-ImageNet at the ICLR 2021 Hardware Aware Efficient Training (HAET) Challenge and won second place in the competition. The competition results can be found at haet2021.github.io/challenge and our source code can be found at github.com/DouniaLakhmiri/ICLR\_HAET2021.
    Pre-text Representation Transfer for Deep Learning with Limited Imbalanced Data : Application to CT-based COVID-19 Detection. (arXiv:2301.08888v1 [eess.IV])
    Annotating medical images for disease detection is often tedious and expensive. Moreover, the available training samples for a given task are generally scarce and imbalanced. These conditions are not conducive for learning effective deep neural models. Hence, it is common to 'transfer' neural networks trained on natural images to the medical image domain. However, this paradigm lacks in performance due to the large domain gap between the natural and medical image data. To address that, we propose a novel concept of Pre-text Representation Transfer (PRT). In contrast to the conventional transfer learning, which fine-tunes a source model after replacing its classification layers, PRT retains the original classification layers and updates the representation layers through an unsupervised pre-text task. The task is performed with (original, not synthetic) medical images, without utilizing any annotations. This enables representation transfer with a large amount of training data. This high-fidelity representation transfer allows us to use the resulting model as a more effective feature extractor. Moreover, we can also subsequently perform the traditional transfer learning with this model. We devise a collaborative representation based classification layer for the case when we leverage the model as a feature extractor. We fuse the output of this layer with the predictions of a model induced with the traditional transfer learning performed over our pre-text transferred model. The utility of our technique for limited and imbalanced data classification problem is demonstrated with an extensive five-fold evaluation for three large-scale models, tested for five different class-imbalance ratios for CT based COVID-19 detection. Our results show a consistent gain over the conventional transfer learning with the proposed method.
    Spatial Attention Kinetic Networks with E(n)-Equivariance. (arXiv:2301.08893v1 [cs.LG])
    Neural networks that are equivariant to rotations, translations, reflections, and permutations on n-dimensional geometric space have shown promise in physical modeling for tasks such as accurately but inexpensively modeling complex potential energy surfaces to guiding the sampling of complex dynamical systems or forecasting their time evolution. Current state-of-the-art methods employ spherical harmonics to encode higher-order interactions among particles, which are computationally expensive. In this paper, we propose a simple alternative functional form that uses neurally parametrized linear combinations of edge vectors to achieve equivariance while still universally approximating node environments. Incorporating this insight, we design spatial attention kinetic networks with E(n)-equivariance, or SAKE, which are competitive in many-body system modeling tasks while being significantly faster.
    The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation. (arXiv:2301.08968v1 [cs.LG])
    Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -- information that we refer to as ``hyper-knowledge". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.
    Classification of Luminal Subtypes in Full Mammogram Images Using Transfer Learning. (arXiv:2301.09282v1 [eess.IV])
    Automatic identification of patients with luminal and non-luminal subtypes during a routine mammography screening can support clinicians in streamlining breast cancer therapy planning. Recent machine learning techniques have shown promising results in molecular subtype classification in mammography; however, they are highly dependent on pixel-level annotations, handcrafted, and radiomic features. In this work, we provide initial insights into the luminal subtype classification in full mammogram images trained using only image-level labels. Transfer learning is applied from a breast abnormality classification task, to finetune a ResNet-18-based luminal versus non-luminal subtype classification task. We present and compare our results on the publicly available CMMD dataset and show that our approach significantly outperforms the baseline classifier by achieving a mean AUC score of 0.6688 and a mean F1 score of 0.6693 on the test dataset. The improvement over baseline is statistically significant, with a p-value of p<0.0001.
    A Semantic Modular Framework for Events Topic Modeling in Social Media. (arXiv:2301.09009v1 [cs.LG])
    The advancement of social media contributes to the growing amount of content they share frequently. This framework provides a sophisticated place for people to report various real-life events. Detecting these events with the help of natural language processing has received researchers' attention, and various algorithms have been developed for this goal. In this paper, we propose a Semantic Modular Model (SMM) consisting of 5 different modules, namely Distributional Denoising Autoencoder, Incremental Clustering, Semantic Denoising, Defragmentation, and Ranking and Processing. The proposed model aims to (1) cluster various documents and ignore the documents that might not contribute to the identification of events, (2) identify more important and descriptive keywords. Compared to the state-of-the-art methods, the results show that the proposed model has a higher performance in identifying events with lower ranks and extracting keywords for more important events in three English Twitter datasets: FACup, SuperTuesday, and USElection. The proposed method outperformed the best reported results in the mean keyword-precision metric by 7.9\%.
    Leveraging Speaker Embeddings with Adversarial Multi-task Learning for Age Group Classification. (arXiv:2301.09058v1 [eess.AS])
    Recently, researchers have utilized neural network-based speaker embedding techniques in speaker-recognition tasks to identify speakers accurately. However, speaker-discriminative embeddings do not always represent speech features such as age group well. In an embedding model that has been highly trained to capture speaker traits, the task of age group classification is closer to speech information leakage. Hence, to improve age group classification performance, we consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups. In addition, we investigated different types of speaker embeddings to learn and generalize the domain-invariant representations for age groups. Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios and leveraging speaker embeddings for the domain adaptation task.
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.
    Is Nash Equilibrium Approximator Learnable?. (arXiv:2108.07472v5 [cs.GT] UPDATED)
    In this paper, we investigate the learnability of the function approximator that approximates Nash equilibrium (NE) for games generated from a distribution. First, we offer a generalization bound using the Probably Approximately Correct (PAC) learning model. The bound describes the gap between the expected loss and empirical loss of the NE approximator. Afterward, we prove the agnostic PAC learnability of the Nash approximator. In addition to theoretical analysis, we demonstrate an application of NE approximator in experiments. The trained NE approximator can be used to warm-start and accelerate classical NE solvers. Together, our results show the practicability of approximating NE through function approximation.
    Regeneration Learning: A Learning Paradigm for Data Generation. (arXiv:2301.08846v1 [cs.LG])
    Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y. The target Y (e.g., text, speech, music, image, video) is usually high-dimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y' (an abstraction/representation of Y) from X and then generates Y from Y'. During training, Y' is obtained from Y through either handcrafted rules or self-supervised learning and is used to learn X-->Y' and Y'-->Y. Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y') of the target data Y for data generation while traditional representation learning handles the abstraction (X') of source data X for data understanding; 2) both the processes of Y'-->Y in regeneration learning and X-->X' in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y' in regeneration learning and from X' to Y in representation learning are simpler than the direct mapping from X to Y. We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.
    Debiasing the Cloze Task in Sequential Recommendation with Bidirectional Transformers. (arXiv:2301.09210v1 [cs.LG])
    Bidirectional Transformer architectures are state-of-the-art sequential recommendation models that use a bi-directional representation capacity based on the Cloze task, a.k.a. Masked Language Modeling. The latter aims to predict randomly masked items within the sequence. Because they assume that the true interacted item is the most relevant one, an exposure bias results, where non-interacted items with low exposure propensities are assumed to be irrelevant. The most common approach to mitigating exposure bias in recommendation has been Inverse Propensity Scoring (IPS), which consists of down-weighting the interacted predictions in the loss function in proportion to their propensities of exposure, yielding a theoretically unbiased learning. In this work, we argue and prove that IPS does not extend to sequential recommendation because it fails to account for the temporal nature of the problem. We then propose a novel propensity scoring mechanism, which can theoretically debias the Cloze task in sequential recommendation. Finally we empirically demonstrate the debiasing capabilities of our proposed approach and its robustness to the severity of exposure bias.
    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.
    DeepFEL: Deep Fastfood Ensemble Learning for Histopathology Image Analysis. (arXiv:2301.09525v1 [eess.IV])
    Computational pathology tasks have some unique characterises such as multi-gigapixel images, tedious and frequently uncertain annotations, and unavailability of large number of cases [13]. To address some of these issues, we present Deep Fastfood Ensembles - a simple, fast and yet effective method for combining deep features pooled from popular CNN models pre-trained on totally different source domains (e.g., natural image objects) and projected onto diverse dimensions using random projections, the so-called Fastfood [11]. The final ensemble output is obtained by a consensus of simple individual classifiers, each of which is trained on a different collection of random basis vectors. This offers extremely fast and yet effective solution, especially when training times and domain labels are of the essence. We demonstrate the effectiveness of the proposed deep fastfood ensemble learning as compared to the state-of-the-art methods for three different tasks in histopathology image analysis.
    Accelerating Fair Federated Learning: Adaptive Federated Adam. (arXiv:2301.09357v1 [cs.LG])
    Federated learning is a distributed and privacy-preserving approach to train a statistical model collaboratively from decentralized data of different parties. However, when datasets of participants are not independent and identically distributed (non-IID), models trained by naive federated algorithms may be biased towards certain participants, and model performance across participants is non-uniform. This is known as the fairness problem in federated learning. In this paper, we formulate fairness-controlled federated learning as a dynamical multi-objective optimization problem to ensure fair performance across all participants. To solve the problem efficiently, we study the convergence and bias of Adam as the server optimizer in federated learning, and propose Adaptive Federated Adam (AdaFedAdam) to accelerate fair federated learning with alleviated bias. We validated the effectiveness, Pareto optimality and robustness of AdaFedAdam in numerical experiments and show that AdaFedAdam outperforms existing algorithms, providing better convergence and fairness properties of the federated scheme.
    A Structural Approach to the Design of Domain Specific Neural Network Architectures. (arXiv:2301.09381v1 [cs.LG])
    This is a master's thesis concerning the theoretical ideas of geometric deep learning. Geometric deep learning aims to provide a structured characterization of neural network architectures, specifically focused on the ideas of invariance and equivariance of data with respect to given transformations. This thesis aims to provide a theoretical evaluation of geometric deep learning, compiling theoretical results that characterize the properties of invariant neural networks with respect to learning performance.
    HALOC: Hardware-Aware Automatic Low-Rank Compression for Compact Neural Networks. (arXiv:2301.09422v1 [cs.LG])
    Low-rank compression is an important model compression strategy for obtaining compact neural network models. In general, because the rank values directly determine the model complexity and model accuracy, proper selection of layer-wise rank is very critical and desired. To date, though many low-rank compression approaches, either selecting the ranks in a manual or automatic way, have been proposed, they suffer from costly manual trials or unsatisfied compression performance. In addition, all of the existing works are not designed in a hardware-aware way, limiting the practical performance of the compressed models on real-world hardware platforms. To address these challenges, in this paper we propose HALOC, a hardware-aware automatic low-rank compression framework. By interpreting automatic rank selection from an architecture search perspective, we develop an end-to-end solution to determine the suitable layer-wise ranks in a differentiable and hardware-aware way. We further propose design principles and mitigation strategy to efficiently explore the rank space and reduce the potential interference problem. Experimental results on different datasets and hardware platforms demonstrate the effectiveness of our proposed approach. On CIFAR-10 dataset, HALOC enables 0.07% and 0.38% accuracy increase over the uncompressed ResNet-20 and VGG-16 models with 72.20% and 86.44% fewer FLOPs, respectively. On ImageNet dataset, HALOC achieves 0.9% higher top-1 accuracy than the original ResNet-18 model with 66.16% fewer FLOPs. HALOC also shows 0.66% higher top-1 accuracy increase than the state-of-the-art automatic low-rank compression solution with fewer computational and memory costs. In addition, HALOC demonstrates the practical speedups on different hardware platforms, verified by the measurement results on desktop GPU, embedded GPU and ASIC accelerator.
    StockEmotions: Discover Investor Emotions for Financial Sentiment Analysis and Multivariate Time Series. (arXiv:2301.09279v1 [cs.CL])
    There has been growing interest in applying NLP techniques in the financial domain, however, resources are extremely limited. This paper introduces StockEmotions, a new dataset for detecting emotions in the stock market that consists of 10,000 English comments collected from StockTwits, a financial social media platform. Inspired by behavioral finance, it proposes 12 fine-grained emotion classes that span the roller coaster of investor emotion. Unlike existing financial sentiment datasets, StockEmotions presents granular features such as investor sentiment classes, fine-grained emotions, emojis, and time series data. To demonstrate the usability of the dataset, we perform a dataset analysis and conduct experimental downstream tasks. For financial sentiment/emotion classification tasks, DistilBERT outperforms other baselines, and for multivariate time series forecasting, a Temporal Attention LSTM model combining price index, text, and emotion features achieves the best performance than using a single feature.
    Logical Message Passing Networks with One-hop Inference on Atomic Formulas. (arXiv:2301.08859v1 [cs.LG])
    Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embeddings from zero, where whether and how the embeddings or the neural set operators contribute to the performance remains not clear. In this paper, we propose a simple framework for complex query answering that decomposes the KG embeddings from neural set operators. We propose to represent the complex queries in the query graph. On top of the query graph, we propose the Logical Message Passing Neural Network (LMPNN) that connects the \textit{local} one-hop inferences on atomic formulas to the \textit{global} logical reasoning for complex query answering. We leverage existing effective KG embeddings to conduct one-hop inferences on atomic formulas, the results of which are regarded as the messages passed in LMPNN. The reasoning process over the overall logical formulas is turned into the forward pass of LMPNN that incrementally aggregates local information to predict the answers' embeddings finally. The complex logical inference across different types of queries will then be learned from training examples based on the LMPNN architecture. Theoretically, our query-graph representation is more general than the prevailing operator-tree formulation, so our approach applies to a broader range of complex KG queries. Empirically, our approach yields a new state-of-the-art neural CQA model. Our research bridges the gap between complex KG query answering tasks and the long-standing achievements of knowledge graph representation learning.  ( 2 min )
    Impact of PCA-based preprocessing and different CNN structures on deformable registration of sonograms. (arXiv:2301.08802v1 [cs.CV])
    Central venous catheters (CVC) are commonly inserted into the large veins of the neck, e.g. the internal jugular vein (IJV). CVC insertion may cause serious complications like misplacement into an artery or perforation of cervical vessels. Placing a CVC under sonographic guidance is an appropriate method to reduce such adverse events, if anatomical landmarks like venous and arterial vessels can be detected reliably. This task shall be solved by registration of patient individual images vs. an anatomically labelled reference image. In this work, a linear, affine transformation is performed on cervical sonograms, followed by a non-linear transformation to achieve a more precise registration. Voxelmorph (VM), a learning-based library for deformable image registration using a convolutional neural network (CNN) with U-Net structure was used for non-linear transformation. The impact of principal component analysis (PCA)-based pre-denoising of patient individual images, as well as the impact of modified net structures with differing complexities on registration results were examined visually and quantitatively, the latter using metrics for deformation and image similarity. Using the PCA-approximated cervical sonograms resulted in decreased mean deformation lengths between 18% and 66% compared to their original image counterparts, depending on net structure. In addition, reducing the number of convolutional layers led to improved image similarity with PCA images, while worsening in original images. Despite a large reduction of network parameters, no overall decrease in registration quality was observed, leading to the conclusion that the original net structure is oversized for the task at hand.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Fast likelihood-based change point detection. (arXiv:2301.08892v1 [cs.LG])
    Change point detection plays a fundamental role in many real-world applications, where the goal is to analyze and monitor the behaviour of a data stream. In this paper, we study change detection in binary streams. To this end, we use a likelihood ratio between two models as a measure for indicating change. The first model is a single bernoulli variable while the second model divides the stored data in two segments, and models each segment with its own bernoulli variable. Finding the optimal split can be done in $O(n)$ time, where $n$ is the number of entries since the last change point. This is too expensive for large $n$. To combat this we propose an approximation scheme that yields $(1 - \epsilon)$ approximation in $O(\epsilon^{-1} \log^2 n)$ time. The speed-up consists of several steps: First we reduce the number of possible candidates by adopting a known result from segmentation problems. We then show that for fixed bernoulli parameters we can find the optimal change point in logarithmic time. Finally, we show how to construct a candidate list of size $O(\epsilon^{-1} \log n)$ for model parameters. We demonstrate empirically the approximation quality and the running time of our algorithm, showing that we can gain a significant speed-up with a minimal average loss in optimality.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Improving Deep Regression with Ordinal Entropy. (arXiv:2301.08915v1 [cs.CV])
    In computer vision, it is often observed that formulating regression problems as a classification task often yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the cross-entropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy loss to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression.  ( 2 min )
    Developing Hybrid Machine Learning Models to Assign Health Score to Railcar Fleets for Optimal Decision Making. (arXiv:2301.08877v1 [cs.LG])
    A large amount of data is generated during the operation of a railcar fleet, which can easily lead to dimensional disaster and reduce the resiliency of the railcar network. To solve these issues and offer predictive maintenance, this research introduces a hybrid fault diagnosis expert system method that combines density-based spatial clustering of applications with noise (DBSCAN) and principal component analysis (PCA). Firstly, the DBSCAN method is used to cluster categorical data that are similar to one another within the same group. Secondly, PCA algorithm is applied to reduce the dimensionality of the data and eliminate redundancy in order to improve the accuracy of fault diagnosis. Finally, we explain the engineered features and evaluate the selected models by using the Gain Chart and Area Under Curve (AUC) metrics. We use the hybrid expert system model to enhance maintenance planning decisions by assigning a health score to the railcar system of the North American Railcar Owner (NARO). According to the experimental results, our expert model can detect 96.4% of failures within 50% of the sample. This suggests that our method is effective at diagnosing failures in railcars fleet.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    A Communication-Efficient Adaptive Algorithm for Federated Learning under Cumulative Regret. (arXiv:2301.08869v1 [cs.LG])
    We consider the problem of online stochastic optimization in a distributed setting with $M$ clients connected through a central server. We develop a distributed online learning algorithm that achieves order-optimal cumulative regret with low communication cost measured in the total number of bits transmitted over the entire learning horizon. This is in contrast to existing studies which focus on the offline measure of simple regret for learning efficiency. The holistic measure for communication cost also departs from the prevailing approach that \emph{separately} tackles the communication frequency and the number of bits in each communication round.  ( 2 min )
    Rationalization for Explainable NLP: A Survey. (arXiv:2301.08912v1 [cs.CL])
    Recent advances in deep learning have improved the performance of many Natural Language Processing (NLP) tasks such as translation, question-answering, and text classification. However, this improvement comes at the expense of model explainability. Black-box models make it difficult to understand the internals of a system and the process it takes to arrive at an output. Numerical (LIME, Shapley) and visualization (saliency heatmap) explainability techniques are helpful; however, they are insufficient because they require specialized knowledge. These factors led rationalization to emerge as a more accessible explainable technique in NLP. Rationalization justifies a model's output by providing a natural language explanation (rationale). Recent improvements in natural language generation have made rationalization an attractive technique because it is intuitive, human-comprehensible, and accessible to non-technical users. Since rationalization is a relatively new field, it is disorganized. As the first survey, rationalization literature in NLP from 2007-2022 is analyzed. This survey presents available methods, explainable evaluations, code, and datasets used across various NLP tasks that use rationalization. Further, a new subfield in Explainable AI (XAI), namely, Rational AI (RAI), is introduced to advance the current state of rationalization. A discussion on observed insights, challenges, and future directions is provided to point to promising research opportunities.  ( 2 min )
    Federated Recommendation with Additive Personalization. (arXiv:2301.09109v1 [cs.LG])
    With rising concerns about privacy, developing recommendation systems in a federated setting become a new paradigm to develop next-generation Internet service architecture. However, existing approaches are usually derived from a distributed recommendation framework with an additional mechanism for privacy protection, thus most of them fail to fully exploit personalization in the new context of federated recommendation settings. In this paper, we propose a novel approach called Federated Recommendation with Additive Personalization (FedRAP) to enhance recommendation by learning user embedding and the user's personal view of item embeddings. Specifically, the proposed additive personalization is to add a personalized item embedding to a sparse global item embedding aggregated from all users. Moreover, a curriculum learning mechanism has been applied for additive personalization on item embeddings by gradually increasing regularization weights to mitigate the performance degradation caused by large variances among client-specific item embeddings. A unified formulation has been proposed with a sparse regularization of global item embeddings for reducing communication overhead. Experimental results on four real-world recommendation datasets demonstrate the effectiveness of FedRAP.  ( 2 min )
    Slice Transformer and Self-supervised Learning for 6DoF Localization in 3D Point Cloud Maps. (arXiv:2301.08957v1 [cs.CV])
    Precise localization is critical for autonomous vehicles. We present a self-supervised learning method that employs Transformers for the first time for the task of outdoor localization using LiDAR data. We propose a pre-text task that reorganizes the slices of a $360^\circ$ LiDAR scan to leverage its axial properties. Our model, called Slice Transformer, employs multi-head attention while systematically processing the slices. To the best of our knowledge, this is the first instance of leveraging multi-head attention for outdoor point clouds. We additionally introduce the Perth-WA dataset, which provides a large-scale LiDAR map of Perth city in Western Australia, covering $\sim$4km$^2$ area. Localization annotations are provided for Perth-WA. The proposed localization method is thoroughly evaluated on Perth-WA and Appollo-SouthBay datasets. We also establish the efficacy of our self-supervised learning approach for the common downstream task of object classification using ModelNet40 and ScanNN datasets. The code and Perth-WA data will be publicly released.  ( 2 min )
    Estimation of Sea State Parameters from Ship Motion Responses Using Attention-based Neural Networks. (arXiv:2301.08949v1 [cs.LG])
    On-site estimation of sea state parameters is crucial for ship navigation systems' accuracy, stability, and efficiency. Extensive research has been conducted on model-based estimating methods utilizing only ship motion responses. Model-free approaches based on machine learning (ML) have recently gained popularity, and estimation from time-series of ship motion responses using deep learning (DL) methods has given promising results. Accordingly, in this study, we apply the novel, attention-based neural network (AT-NN) for estimating sea state parameters (wave height, zero-crossing period, and relative wave direction) from raw time-series data of ship pitch, heave, and roll motions. Despite using reduced input data, it has been successfully demonstrated that the proposed approaches by modified state-of-the-art techniques (based on convolutional neural networks (CNN) for regression, multivariate long short-term memory CNN, and sliding puzzle neural network) reduced estimation MSE by 23% and MAE by 16% compared to the original methods. Furthermore, the proposed technique based on AT-NN outperformed all tested methods (original and enhanced), reducing estimation MSE by up to 94% and MAE by up to 70%. Finally, we also proposed a novel approach for interpreting the uncertainty estimation of neural network outputs based on the Monte-Carlo dropout method to enhance the model's trustworthiness.  ( 2 min )
    Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting. (arXiv:2301.08974v1 [cs.LG])
    The semiconductor industry is one of the most technology-evolving and capital-intensive market sectors. Effective inspection and metrology are necessary to improve product yield, increase product quality and reduce costs. In recent years, many semiconductor manufacturing equipments are equipped with sensors to facilitate real-time monitoring of the production process. These production-state and equipment-state sensor data provide an opportunity to practice machine-learning technologies in various domains, such as anomaly/fault detection, maintenance scheduling, quality prediction, etc. In this work, we focus on the task of soft sensing regression, which uses sensor data to predict impending inspection measurements that used to be measured in wafer inspection and metrology systems. We proposed an LSTM-based regressor and designed two loss functions for model training. Although engineers may look at our prediction errors in a subjective manner, a new piece-wise evaluation metric was proposed for assessing model accuracy in a mathematical way. The experimental results demonstrated that the proposed model can achieve accurate and early prediction of various types of inspections in complicated manufacturing processes.  ( 2 min )
    Bayesian Hierarchical Models for Counterfactual Estimation. (arXiv:2301.08833v1 [cs.LG])
    Counterfactual explanations utilize feature perturbations to analyze the outcome of an original decision and recommend an actionable recourse. We argue that it is beneficial to provide several alternative explanations rather than a single point solution and propose a probabilistic paradigm to estimate a diverse set of counterfactuals. Specifically, we treat the perturbations as random variables endowed with prior distribution functions. This allows sampling multiple counterfactuals from the posterior density, with the added benefit of incorporating inductive biases, preserving domain specific constraints and quantifying uncertainty in estimates. More importantly, we leverage Bayesian hierarchical modeling to share information across different subgroups of a population, which can both improve robustness and measure fairness. A gradient based sampler with superior convergence characteristics efficiently computes the posterior samples. Experiments across several datasets demonstrate that the counterfactuals estimated using our approach are valid, sparse, diverse and feasible.  ( 2 min )
    Dense RGB SLAM with Neural Implicit Maps. (arXiv:2301.08930v1 [cs.CV])
    There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmarks and compare it with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods. Our source code will be publicly available.  ( 2 min )
    SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis. (arXiv:2301.09201v1 [cs.IR])
    5G is the 5th generation cellular network protocol. It is the state-of-the-art global wireless standard that enables an advanced kind of network designed to connect virtually everyone and everything with increased speed and reduced latency. Therefore, its development, analysis, and security are critical. However, all approaches to the 5G protocol development and security analysis, e.g., property extraction, protocol summarization, and semantic analysis of the protocol specifications and implementations are completely manual. To reduce such manual effort, in this paper, we curate SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains 3,547,586 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. By leveraging large-scale pre-trained language models that have achieved state-of-the-art results on NLP tasks, we use this dataset for security-related text classification and summarization. Security-related text classification can be used to extract relevant security-related properties for protocol testing. On the other hand, summarization can help developers and practitioners understand the high level of the protocol, which is itself a daunting task. Our results show the value of our 5G-centric dataset in 5G protocol analysis automation. We believe that SPEC5G will enable a new research direction into automatic analyses for the 5G cellular network protocol and numerous related downstream tasks. Our data and code are publicly available.
    Is Signed Message Essential for Graph Neural Networks?. (arXiv:2301.08918v1 [cs.LG])
    Message-passing Graph Neural Networks (GNNs), which collect information from adjacent nodes, achieve satisfying results on homophilic graphs. However, their performances are dismal in heterophilous graphs, and many researchers have proposed a plethora of schemes to solve this problem. Especially, flipping the sign of edges is rooted in a strong theoretical foundation, and attains significant performance enhancements. Nonetheless, previous analyses assume a binary class scenario and they may suffer from confined applicability. This paper extends the prior understandings to multi-class scenarios and points out two drawbacks: (1) the sign of multi-hop neighbors depends on the message propagation paths and may incur inconsistency, (2) it also increases the prediction uncertainty (e.g., conflict evidence) which can impede the stability of the algorithm. Based on the theoretical understanding, we introduce a novel strategy that is applicable to multi-class graphs. The proposed scheme combines confidence calibration to secure robustness while reducing uncertainty. We show the efficacy of our theorem through extensive experiments on six benchmark graph datasets.  ( 2 min )
    On the Algebraic Properties of Flame Graphs. (arXiv:2301.08941v1 [cs.SE])
    Flame graphs are a popular way of representing profiling data. In this paper we propose a possible mathematical definition of flame graphs. In doing so, we gain some interesting algebraic properties almost for free, which in turn allow us to define some operations that can allow to perform an in-depth performance regression analysis. The typical documented use of a flame graph is via its graphical representation, whereby one scans the picture for the largest plateaux. Whilst this method is effective at finding the main sources of performance issues, it leaves quite a large amount of data potentially unused. By combining a mathematical precise definition of flame graphs with some statistical methods we show how to generalise this visual procedure and make the best of the full set of collected profiling data.  ( 2 min )
    ScaDLES: Scalable Deep Learning over Streaming data at the Edge. (arXiv:2301.08897v1 [cs.DC])
    Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming data in an online manner. Computing on the edge suffers from both systems and statistical heterogeneity. Systems heterogeneity is attributed to differences in compute resources and bandwidth specific to each device, while statistical heterogeneity comes from unbalanced and skewed data on the edge. Different streaming-rates among devices can be another source of heterogeneity when dealing with streaming data. If the streaming rate is lower than training batch-size, device needs to wait until enough samples have streamed in before performing a single iteration of stochastic gradient descent (SGD). Thus, low-volume streams act like stragglers slowing down devices with high-volume streams in synchronous training. On the other hand, data can accumulate quickly in the buffer if the streaming rate is too high and the devices can't train at line-rate. In this paper, we introduce ScaDLES to efficiently train on streaming data at the edge in an online fashion, while also addressing the challenges of limited bandwidth and training with non-IID data. We empirically show that ScaDLES converges up to 3.29 times faster compared to conventional distributed SGD.
    Cellular Network Speech Enhancement: Removing Background and Transmission Noise. (arXiv:2301.09027v1 [cs.SD])
    The primary objective of speech enhancement is to reduce background noise while preserving the target's speech. A common dilemma occurs when a speaker is confined to a noisy environment and receives a call with high background and transmission noise. To address this problem, the Deep Noise Suppression (DNS) Challenge focuses on removing the background noise with the next-generation deep learning models to enhance the target's speech; however, researchers fail to consider Voice Over IP (VoIP) applications their transmission noise. Focusing on Google Meet and its cellular application, our work achieves state-of-the-art performance on the Google Meet To Phone Track of the VoIP DNS Challenge. This paper demonstrates how to beat industrial performance and achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity, perceptual quality, and intelligibility in various metrics.
    Ti-MAE: Self-Supervised Masked Time Series Autoencoders. (arXiv:2301.08871v1 [cs.LG])
    Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformer-based models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformer-based models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.
    Versatile Neural Processes for Learning Implicit Neural Representations. (arXiv:2301.08883v1 [cs.LG])
    Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.  ( 2 min )
    Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions. (arXiv:2203.12235v3 [cs.CL] UPDATED)
    The syntactic categories of categorial grammar formalisms are structured units made of smaller, indivisible primitives, bound together by the underlying grammar's category formation rules. In the trending approach of constructive supertagging, neural models are increasingly made aware of the internal category structure, which in turn enables them to more reliably predict rare and out-of-vocabulary categories, with significant implications for grammars previously deemed too complex to find practical use. In this work, we revisit constructive supertagging from a graph-theoretic perspective, and propose a framework based on heterogeneous dynamic graph convolutions aimed at exploiting the distinctive structure of a supertagger's output space. We test our approach on a number of categorial grammar datasets spanning different languages and grammar formalisms, achieving substantial improvements over previous state of the art scores. Code will be made available at https://github.com/konstantinosKokos/dynamic-graph-supertagging  ( 2 min )
    Comparing different subgradient methods for solving convex optimization problems with functional constraints. (arXiv:2101.01045v2 [math.OC] UPDATED)
    We consider the problem of minimizing a convex, nonsmooth function subject to a closed convex constraint domain. The methods that we propose are reforms of subgradient methods based on Metel--Takeda's paper [Optimization Letters 15.4 (2021): 1491-1504] and Boyd's works [Lecture notes of EE364b, Stanford University, Spring 2013-14, pp. 1-39]. While the former has complexity $\mathcal{O}(\varepsilon^{-2r})$ for all $r> 1$, the complexity of the latter is $\mathcal{O}(\varepsilon^{-2})$. We perform some comparisons between these two methods using several test examples.  ( 2 min )
    Problem-dependent attention and effort in neural networks with application to image resolution and model selection. (arXiv:2201.01415v3 [cs.CV] UPDATED)
    This paper introduces a new ensemble-based approach to reduce the data and computation costs of accurate classification. When faced with a new test case, a low cost classifier is used first, only moving to a higher cost approach if the initial classifier does not have a high degree of confidence in its projection. This multi-stage strategy can be used with any set of classifiers and does not require additional training. The approach is first applied to reduce the amount of data required to classify test images; it is found to be effective for problems in which at least some fraction of cases can be correctly classified based upon coarser data than are typically used. For neural networks performing digit recognition, for example, the proposed approach reduces the number of bytes of data read by 60% to 85% with less than 5% reduction in accuracy. For the ImageNet data, the number of bytes read by the typical network is reduced by 20% with less than 5% reduction in accuracy -- and in some cases, the resource savings reach 40%. The second application is to reduce computational complexity, with simpler neural networks used for test cases that are easier to classify and complex networks used for more difficult cases. For classification both of digits and of ImageNet images, computation cost is reduced by as much as 82% to 89% with less than 5% reduction in accuracy. The results also show that, for situations in which computational cost is not a concern, calculating multiple models' projections and selecting the one from the most confident classifier can increase classification accuracy on ImageNet by as much as two percent over the best standalone classifier considered here.  ( 3 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    Continuous-time identification of dynamic state-space models by deep subspace encoding. (arXiv:2204.09405v2 [cs.LG] UPDATED)
    Continuous-time (CT) modeling has proven to provide improved sample efficiency and interpretability in learning the dynamical behavior of physical systems compared to discrete-time (DT) models. However, even with numerous recent developments, the CT nonlinear state-space (NL-SS) model identification problem remains to be solved in full, considering common experimental aspects such as the presence of external inputs, measurement noise, latent states, and general robustness. This paper presents a novel estimation method that addresses all these aspects and that can obtain state-of-the-art results on multiple benchmarks with compact fully connected neural networks capturing the CT dynamics. The proposed estimation method called the subspace encoder approach (SUBNET) ascertains these results by efficiently approximating the complete simulation loss by evaluating short simulations on subsections of the data, by using an encoder function to estimate the initial state for each subsection and a novel state-derivative normalization to ensure stability and good numerical conditioning of the training process. We prove that the use of subsections increases cost function smoothness together with the necessary requirements for the existence of the encoder function and we show that the proposed state-derivative normalization is essential for reliable estimation of CT NL-SS models.  ( 2 min )
    Explainable Multilayer Graph Neural Network for Cancer Gene Prediction. (arXiv:2301.08831v1 [cs.LG])
    The identification of cancer genes is a critical, yet challenging problem in cancer genomics research. Recently, several computational methods have been developed to address this issue, including deep neural networks. However, these methods fail to exploit the multilayered gene-gene interactions and provide little to no explanation for their predictions. Results: In this study, we propose an Explainable Multilayer Graph Neural Network (EMGNN) approach to identify cancer genes, by leveraging multiple gene-gene interaction networks and multi-omics data. Compared to conventional graph learning methods, EMGNN learned complementary information in multiple graphs to accurately predict cancer genes. Our method consistently outperforms existing approaches while providing valuable biological insights into its predictions. We further release our novel cancer gene predictions and connect them with known cancer patterns, aiming to accelerate the progress of cancer research  ( 2 min )
    Limitations of Piecewise Linearity for Efficient Robustness Certification. (arXiv:2301.08842v1 [cs.LG])
    Certified defenses against small-norm adversarial examples have received growing attention in recent years; though certified accuracies of state-of-the-art methods remain far below their non-robust counterparts, despite the fact that benchmark datasets have been shown to be well-separated at far larger radii than the literature generally attempts to certify. In this work, we offer insights that identify potential factors in this performance gap. Specifically, our analysis reveals that piecewise linearity imposes fundamental limitations on the tightness of leading certification techniques. These limitations are felt in practical terms as a greater need for capacity in models hoped to be certified efficiently. Moreover, this is in addition to the capacity necessary to learn a robust boundary, studied in prior work. However, we argue that addressing the limitations of piecewise linearity through scaling up model capacity may give rise to potential difficulties -- particularly regarding robust generalization -- therefore, we conclude by suggesting that developing smooth activation functions may be the way forward for advancing the performance of certified neural networks.  ( 2 min )
    Dynamic MLP for MRI Reconstruction. (arXiv:2301.08868v1 [eess.IV])
    As convolutional neural networks (CNN) become the most successful reconstruction technique for accelerated Magnetic Resonance Imaging (MRI), CNN reaches its limit on image quality especially in sharpness. Further improvement on image quality often comes at massive computational costs, hindering their practicability in the clinic setting. MRI reconstruction is essentially a deconvolution problem, which demands long-distance information that is difficult to be captured by CNNs with small convolution kernels. The multi-layer perceptron (MLP) is able to model such long-distance information, but it restricts a fixed input size while the reconstruction of images in flexible resolutions is required in the clinic setting. In this paper, we proposed a hybrid CNN and MLP reconstruction strategy, featured by dynamic MLP (dMLP) that accepts arbitrary image sizes. Experiments were conducted using 3D multi-coil MRI. Our results suggested the proposed dMLP can improve image sharpness compared to its pure CNN counterpart, while costing minor additional GPU memory and computation time. We further compared the proposed dMLP with CNNs using large kernels and studied pure MLP-based reconstruction using a stack of 1D dMLPs, as well as its CNN counterpart using only 1D convolutions. We observed the enlarged receptive field has noticeably improved image quality, while simply using CNN with a large kernel leads to difficulties in training. Noticeably, the pure MLP-based method has been outperformed by CNN-involved methods, which matches the observations in other computer vision tasks for natural images.  ( 2 min )
    Computing equilibria by minimizing exploitability with best-response ensembles. (arXiv:2301.08830v1 [cs.GT])
    In this paper, we study the problem of computing an approximate Nash equilibrium of a continuous game. Such games naturally model many situations involving space, time, money, and other fine-grained resources or quantities. The standard measure of the closeness of a strategy profile to Nash equilibrium is exploitability, which measures how much utility players can gain from changing their strategy unilaterally. We introduce a new equilibrium-finding method that minimizes an approximation of the exploitability. This approximation employs a best-response ensemble for each player that maintains multiple candidate best responses for that player. In each iteration, the best-performing element of each ensemble is used in a gradient-based scheme to update the current strategy profile. The strategy profile and best-response ensembles are simultaneously trained to minimize and maximize the approximate exploitability, respectively. Experiments on a suite of benchmark games show that it outperforms previous methods.  ( 2 min )
    Towards Flexibility and Interpretability of Gaussian Process State-Space Model. (arXiv:2301.08843v1 [cs.LG])
    Gaussian process state-space model (GPSSM) has attracted much attention over the past decade. However, the model representation power of GPSSM is far from satisfactory. Most GPSSM works rely on the standard Gaussian process (GP) with a preliminary kernel, such as squared exponential (SE) kernel and Mat\'{e}rn kernel, which limit the model representation power and its application in complex scenarios. To address this issue, this paper proposes a novel class of probabilistic state-space model named TGPSSM that enriches the GP priors in the standard GPSSM through parametric normalizing flow, making the state-space model more flexible and expressive. In addition, by inheriting the advantages of sparse representation of GP models, we propose a scalable and interpretable variational learning algorithm to learn the TGPSSM and infer the latent dynamics simultaneously. By integrating a constrained optimization framework and explicitly constructing a non-Gaussian state variational distribution, the proposed learning algorithm enables the TGPSSM to significantly improve the capabilities of state space representation and model inference. Experimental results based on various synthetic and real datasets corroborate that the proposed TGPSSM yields superior learning and inference performance compared to several state-of-the-art methods. The accompanying source code is available at https://github.com/zhidilin/TGPSSM.  ( 2 min )
    Split Ways: Privacy-Preserving Training of Encrypted Data Using Split Learning. (arXiv:2301.08778v1 [cs.CR])
    Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.  ( 2 min )
    ManyDG: Many-domain Generalization for Healthcare Applications. (arXiv:2301.08834v1 [cs.LG])
    The vast amount of health data has been continuously collected for each patient, providing opportunities to support diverse healthcare predictive tasks such as seizure detection and hospitalization prediction. Existing models are mostly trained on other patients data and evaluated on new patients. Many of them might suffer from poor generalizability. One key reason can be overfitting due to the unique information related to patient identities and their data collection environments, referred to as patient covariates in the paper. These patient covariates usually do not contribute to predicting the targets but are often difficult to remove. As a result, they can bias the model training process and impede generalization. In healthcare applications, most existing domain generalization methods assume a small number of domains. In this paper, considering the diversity of patient covariates, we propose a new setting by treating each patient as a separate domain (leading to many domains). We develop a new domain generalization method ManyDG, that can scale to such many-domain problems. Our method identifies the patient domain covariates by mutual reconstruction and removes them via an orthogonal projection step. Extensive experiments show that ManyDG can boost the generalization performance on multiple real-world healthcare tasks (e.g., 3.7% Jaccard improvements on MIMIC drug recommendation) and support realistic but challenging settings such as insufficient data and continuous learning.  ( 2 min )
    AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions. (arXiv:2301.08838v1 [cs.LG])
    Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.  ( 2 min )
    Compact Optimization Learning for AC Optimal Power Flow. (arXiv:2301.08840v1 [cs.LG])
    This paper reconsiders end-to-end learning approaches to the Optimal Power Flow (OPF). Existing methods, which learn the input/output mapping of the OPF, suffer from scalability issues due to the high dimensionality of the output space. This paper first shows that the space of optimal solutions can be significantly compressed using principal component analysis (PCA). It then proposes Compact Learning, a new method that learns in a subspace of the principal components before translating the vectors into the original output space. This compression reduces the number of trainable parameters substantially, improving scalability and effectiveness. Compact Learning is evaluated on a variety of test cases from the PGLib with up to 30,000 buses. The paper also shows that the output of Compact Learning can be used to warm-start an exact AC solver to restore feasibility, while bringing significant speed-ups.  ( 2 min )
    Optimized learned entropy coding parameters for practical neural-based image and video compression. (arXiv:2301.08752v1 [eess.IV])
    Neural-based image and video codecs are significantly more power-efficient when weights and activations are quantized to low-precision integers. While there are general-purpose techniques for reducing quantization effects, large losses can occur when specific entropy coding properties are not considered. This work analyzes how entropy coding is affected by parameter quantizations, and provides a method to minimize losses. It is shown that, by using a certain type of coding parameters to be learned, uniform quantization becomes practically optimal, also simplifying the minimization of code memory requirements. The mathematical properties of the new representation are presented, and its effectiveness is demonstrated by coding experiments, showing that good results can be obtained with precision as low as 4~bits per network output, and practically no loss with 8~bits.  ( 2 min )
    GBOSE: Generalized Bandit Orthogonalized Semiparametric Estimation. (arXiv:2301.08781v1 [cs.LG])
    In sequential decision-making scenarios i.e., mobile health recommendation systems revenue management contextual multi-armed bandit algorithms have garnered attention for their performance. But most of the existing algorithms are built on the assumption of a strictly parametric reward model mostly linear in nature. In this work we propose a new algorithm with a semi-parametric reward model with state-of-the-art complexity of upper bound on regret amongst existing semi-parametric algorithms. Our work expands the scope of another representative algorithm of state-of-the-art complexity with a similar reward model by proposing an algorithm built upon the same action filtering procedures but provides explicit action selection distribution for scenarios involving more than two arms at a particular time step while requiring fewer computations. We derive the said complexity of the upper bound on regret and present simulation results that affirm our methods superiority out of all prevalent semi-parametric bandit algorithms for cases involving over two arms.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    Estimation of mitral valve hinge point coordinates -- deep neural net for echocardiogram segmentation. (arXiv:2301.08782v1 [eess.IV])
    Cardiac image segmentation is a powerful tool in regard to diagnostics and treatment of cardiovascular diseases. Purely feature-based detection of anatomical structures like the mitral valve is a laborious task due to specifically required feature engineering and is especially challenging in echocardiograms, because of their inherently low contrast and blurry boundaries between some anatomical structures. With the publication of further annotated medical datasets and the increase in GPU processing power, deep learning-based methods in medical image segmentation became more feasible in the past years. We propose a fully automatic detection method for mitral valve hinge points, which uses a U-Net based deep neural net to segment cardiac chambers in echocardiograms in a first step, and subsequently extracts the mitral valve hinge points from the resulting segmentations in a second step. Results measured with this automatic detection method were compared to reference coordinate values, which with median absolute hinge point coordinate errors of 1.35 mm for the x- (15-85 percentile range: [0.3 mm; 3.15 mm]) and 0.75 mm for the y- coordinate (15-85 percentile range: [0.15 mm; 1.88 mm]).  ( 2 min )
    Domain-agnostic and Multi-level Evaluation of Generative Models. (arXiv:2301.08750v1 [cs.LG])
    While the capabilities of generative models heavily improved in different domains (images, text, graphs, molecules, etc.), their evaluation metrics largely remain based on simplified quantities or manual inspection with limited practicality. To this end, we propose a framework for Multi-level Performance Evaluation of Generative mOdels (MPEGO), which could be employed across different domains. MPEGO aims to quantify generation performance hierarchically, starting from a sub-feature-based low-level evaluation to a global features-based high-level evaluation. MPEGO offers great customizability as the employed features are entirely user-driven and can thus be highly domain/problem-specific while being arbitrarily complex (e.g., outcomes of experimental procedures). We validate MPEGO using multiple generative models across several datasets from the material discovery domain. An ablation study is conducted to study the plausibility of intermediate steps in MPEGO. Results demonstrate that MPEGO provides a flexible, user-driven, and multi-level evaluation framework, with practical insights on the generation quality. The framework, source code, and experiments will be available at https://github.com/GT4SD/mpego.  ( 2 min )
    Towards Understanding How Self-training Tolerates Data Backdoor Poisoning. (arXiv:2301.08751v1 [cs.LG])
    Recent studies on backdoor attacks in model training have shown that polluting a small portion of training data is sufficient to produce incorrect manipulated predictions on poisoned test-time data while maintaining high clean accuracy in downstream tasks. The stealthiness of backdoor attacks has imposed tremendous defense challenges in today's machine learning paradigm. In this paper, we explore the potential of self-training via additional unlabeled data for mitigating backdoor attacks. We begin by making a pilot study to show that vanilla self-training is not effective in backdoor mitigation. Spurred by that, we propose to defend the backdoor attacks by leveraging strong but proper data augmentations in the self-training pseudo-labeling stage. We find that the new self-training regime help in defending against backdoor attacks to a great extent. Its effectiveness is demonstrated through experiments for different backdoor triggers on CIFAR-10 and a combination of CIFAR-10 with an additional unlabeled 500K TinyImages dataset. Finally, we explore the direction of combining self-supervised representation learning with self-training for further improvement in backdoor defense.  ( 2 min )
    CSwin2SR: Circular Swin2SR for Compressed Image Super-Resolution. (arXiv:2301.08749v1 [eess.IV])
    Closed-loop negative feedback mechanism is extensively utilized in automatic control systems and brings about extraordinary dynamic and static performance. In order to further improve the reconstruction capability of current methods of compressed image super-resolution, a circular Swin2SR (CSwin2SR) approach is proposed. The CSwin2SR contains a serial Swin2SR for initial super-resolution reestablishment and circular Swin2SR for enhanced super-resolution reestablishment. Simulated experimental results show that the proposed CSwin2SR dramatically outperforms the classical Swin2SR in the capacity of super-resolution recovery. On DIV2K test and valid datasets, the average increment of PSNR is greater than 1dB and the related average increment of SSIM is greater than 0.006.  ( 2 min )
    Towards a Measure of Trustworthiness to Evaluate CNNs During Operation. (arXiv:2301.08839v1 [cs.LG])
    Due to black box nature of Convolutional neural networks (CNNs), the continuous validation of CNN classifiers' during operation is infeasible. As a result this makes it difficult for developers or regulators to gain confidence in the deployment of autonomous systems employing CNNs. We introduce the trustworthiness in classification score (TCS), a metric to assist with overcoming this challenge. The metric quantifies the trustworthiness in a prediction by checking for the existence of certain features in the predictions made by the CNN. A case study on persons detection is used to to demonstrate our method and the usage of TCS.  ( 2 min )
    Causal Inference under Data Restrictions. (arXiv:2301.08788v1 [stat.ME])
    This dissertation focuses on modern causal inference under uncertainty and data restrictions, with applications to neoadjuvant clinical trials, distributed data networks, and robust individualized decision making. In the first project, we propose a method under the principal stratification framework to identify and estimate the average treatment effects on a binary outcome, conditional on the counterfactual status of a post-treatment intermediate response. Under mild assumptions, the treatment effect of interest can be identified. We extend the approach to address censored outcome data. The proposed method is applied to a neoadjuvant clinical trial and its performance is evaluated via simulation studies. In the second project, we propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. The performance of this approach is demonstrated by a study of the causal effects of oxygen therapy on hospital survival rates and backed up by comprehensive simulations. In the third project, we propose a robust individualized decision learning framework with sensitive variables to improve the worst-case outcomes of individuals caused by sensitive variables that are unavailable at the time of decision. Unlike most existing work that uses mean-optimal objectives, we propose a robust learning framework by finding a newly defined quantile- or infimum-optimal decision rule. From a causal perspective, we also generalize the classic notion of (average) fairness to conditional fairness for individual subjects. The reliable performance of the proposed method is demonstrated through synthetic experiments and three real-data applications.  ( 2 min )
    An Automated Vulnerability Detection Framework for Smart Contracts. (arXiv:2301.08824v1 [cs.CR])
    With the increase of the adoption of blockchain technology in providing decentralized solutions to various problems, smart contracts have become more popular to the point that billions of US Dollars are currently exchanged every day through such technology. Meanwhile, various vulnerabilities in smart contracts have been exploited by attackers to steal cryptocurrencies worth millions of dollars. The automatic detection of smart contract vulnerabilities therefore is an essential research problem. Existing solutions to this problem particularly rely on human experts to define features or different rules to detect vulnerabilities. However, this often causes many vulnerabilities to be ignored, and they are inefficient in detecting new vulnerabilities. In this study, to overcome such challenges, we propose a framework to automatically detect vulnerabilities in smart contracts on the blockchain. More specifically, first, we utilize novel feature vector generation techniques from bytecode of smart contract since the source code of smart contracts are rarely available in public. Next, the collected vectors are fed into our novel metric learning-based deep neural network(DNN) to get the detection result. We conduct comprehensive experiments on large-scale benchmarks, and the quantitative results demonstrate the effectiveness and efficiency of our approach.  ( 2 min )
  • Open

    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.  ( 2 min )
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.  ( 2 min )
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.  ( 2 min )
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.  ( 2 min )
    Online Kernel Sliced Inverse Regression. (arXiv:2301.09516v1 [stat.CO])
    Online dimension reduction is a common method for high-dimensional streaming data processing. Online principal component analysis, online sliced inverse regression, online kernel principal component analysis and other methods have been studied in depth, but as far as we know, online supervised nonlinear dimension reduction methods have not been fully studied. In this article, an online kernel sliced inverse regression method is proposed. By introducing the approximate linear dependence condition and dictionary variable sets, we address the problem of increasing variable dimensions with the sample size in the online kernel sliced inverse regression method, and propose a reduced-order method for updating variables online. We then transform the problem into an online generalized eigen-decomposition problem, and use the stochastic optimization method to update the centered dimension reduction directions. Simulations and the real data analysis show that our method can achieve close performance to batch processing kernel sliced inverse regression.  ( 2 min )
    Convergence bounds for local least squares approximation. (arXiv:2208.10954v2 [math.NA] UPDATED)
    We consider the problem of approximating a function in a general nonlinear subset of $L^2$, when only a weighted Monte Carlo estimate of the $L^2$-norm can be computed. Of particular interest in this setting is the concept of sample complexity, the number of sample points that are necessary to achieve a prescribed error with high probability. Reasonable worst-case bounds for this quantity exist only for particular model classes, like linear spaces or sets of sparse vectors. For more general sets, like tensor networks or neural networks, the currently existing bounds are very pessimistic. By restricting the model class to a neighbourhood of the best approximation, we can derive improved worst-case bounds for the sample complexity. When the considered neighbourhood is a manifold with positive local reach, its sample complexity can be estimated by means of the sample complexities of the tangent and normal spaces and the manifold's curvature.  ( 2 min )
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.  ( 2 min )
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.  ( 2 min )
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.  ( 2 min )
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.  ( 2 min )
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.  ( 2 min )
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.  ( 2 min )
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.  ( 2 min )
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.  ( 2 min )
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.  ( 2 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.  ( 2 min )
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.  ( 2 min )
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.  ( 2 min )
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.  ( 2 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.  ( 2 min )
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.  ( 2 min )
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.  ( 2 min )
    How to Measure Evidence: Bayes Factors or Relative Belief Ratios?. (arXiv:2301.08994v1 [math.ST])
    Both the Bayes factor and the relative belief ratio satisfy the principle of evidence and so can be seen to be valid measures of statistical evidence. The question then is: which of these measures of evidence is more appropriate? Certainly Bayes factors are commonly used. It is argued here that there are questions concerning the validity of a current commonly used definition of the Bayes factor and, when all is considered, the relative belief ratio is a much more appropriate measure of evidence. Several general criticisms of these measures of evidence are also discussed and addressed.  ( 2 min )
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets  ( 2 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.  ( 2 min )
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.  ( 2 min )
    Prediction-Powered Inference. (arXiv:2301.09633v1 [stat.ML])
    We introduce prediction-powered inference $\unicode{x2013}$ a framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system such as AlphaFold. Our framework yields provably valid conclusions without making any assumptions on the machine-learning algorithm that supplies the predictions. Higher accuracy of the predictions translates to smaller confidence intervals, permitting more powerful inference. Prediction-powered inference yields simple algorithms for computing valid confidence intervals for statistical objects such as means, quantiles, and linear and logistic regression coefficients. We demonstrate the benefits of prediction-powered inference with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.  ( 2 min )
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.  ( 2 min )
    HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise. (arXiv:2301.08852v1 [stat.ME])
    Mixtures of probabilistic principal component analysis (MPPCA) is a well-known mixture model extension of principal component analysis (PCA). Similar to PCA, MPPCA assumes the data samples in each mixture contain homoscedastic noise. However, datasets with heterogeneous noise across samples are becoming increasingly common, as larger datasets are generated by collecting samples from several sources with varying noise profiles. The performance of MPPCA is suboptimal for data with heteroscedastic noise across samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM) algorithm to jointly estimate the unknown underlying factors, means, and noise variances under a heteroscedastic noise setting. Simulation results illustrate the improved factor estimates and clustering accuracies of HeMPPCAT compared to MPPCA.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.  ( 2 min )
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.  ( 2 min )
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    ddml: Double/debiased machine learning in Stata. (arXiv:2301.09397v1 [econ.EM])
    We introduce the package ddml for Double/Debiased Machine Learning (DDML) in Stata. Estimators of causal parameters for five different econometric models are supported, allowing for flexible estimation of causal effects of endogenous variables in settings with unknown functional forms and/or many exogenous variables. ddml is compatible with many existing supervised machine learning programs in Stata. We recommend using DDML in combination with stacking estimation which combines multiple machine learners into a final predictor. We provide Monte Carlo evidence to support our recommendation.  ( 2 min )
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.  ( 2 min )
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo  ( 2 min )
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.  ( 2 min )
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.  ( 2 min )

  • Open

    [D] are two linear layers better than one?
    I was in the understanding that two contiguous linear layers in a NN would be no better than only one linear layer. But it happen that the two layers had better results that when using only one. However, each layer had its own dropout, could that helped? submitted by /u/alex_lite_21 [link] [comments]  ( 44 min )
    H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! In H3, the researchers replace attention with a new layer based on state space models (SSMs). With the right modifications, it can outperform transformers. Also has no fixed context length.
    submitted by /u/MysteryInc152 [link] [comments]  ( 43 min )
    [D] ICLR de-anonymization vs ICML dual submission rules
    ICML's Call for Papers states that "It is not appropriate to submit papers that are identical (or substantially similar) to versions that have been previously published, accepted for publication, or submitted in parallel to other conferences or journals". Our paper got rejected at ICLR, but the de-anonymization will take place 1-2 days after the ICML deadline. For the purposes of the dual submission policy, does "rejected but not de-anonymized" count as "submitted in parallel"? Or is just a technicality? submitted by /u/CupcakeCleric [link] [comments]  ( 42 min )
    [D] CVPR Reviews are out
    Don't post about your cool papers or you'll get rejected lol submitted by /u/banmeyoucoward [link] [comments]  ( 45 min )
    [D] Using all features in DNN instead of doing feature selection separately?
    I have hundreds of potential features to use in my DNN. Instead of doing a separate analysis to figure out which features are most important, can I just use all of them in my DNN and let the model figure which features are most predictive? I have millions of training data so overfitting will not be a problem, I just wonder whether the bad features may make the model difficult to utilize the good features? Not absolutely crucial but if there is a paper that discusses this topic, that would be super awesome as well. Thanks in advance. submitted by /u/Temporary_Cap_2855 [link] [comments]  ( 43 min )
    [D] What file format do you use for > RAM data?
    If you are using some more odd formats, then what format do you use? Personally found webdataset promising but what other formats are there and why do you use it? Or if you are using the original file how do you ensure good throughput and shuffling? submitted by /u/Shurimatornado22 [link] [comments]  ( 43 min )
    [P] Machine Learning Threat Detection in k8s
    Hi, I'm in my second year of AI master at uni and my professor assigned me the following topic for my dissertation: "Cognitive Threat Hunting" and recommended the following book for documentation. I have read the book, but I still don't know how to do it: how to create a ml model to hunt in the k8s env. My professor wants a ml model that searches in a Kubernetes env for threats. The thing is that in this book, in chapter "8. Unsupervised Machine Learning With K-Means" he uses a dataset of events from Humio to train the model, but it's not shared with us. And I don't have one, how can I train my model properly if I don't have a good dataset of events? I can't make one just by generating some events in a container, I need real data as the author uses in his chapter. I feel desperate and lost at this point, I hope that someone from here can give me some advice or a good direction to go. submitted by /u/blackrat13 [link] [comments]  ( 44 min )
  • Open

    ML experiments setup
    I am a software engineer but newborn in ML area learning it for a few months. The problem: I want to run experiments with multiple models and datasets and need for a framework or whatever to keep my zoo under control. All setups I've read about suggest having models and datasets as files on disk, which is soo 2000x and not really scalable. What if I want to quickly modify my dataset? E.g. change the length of context for time series. Keeping data in db would be helpful. What if I need to stop training and continue it later using another virtual machine & GPU? What if I want to experiment with forking models and fine-tuning them on different variations of data? How do I compare different models performance and visualize predictions & share with others? No answers =( I know some "large corporations" have their internal tools to handle experiments etc. submitted by /u/UnderstandingDry1256 [link] [comments]  ( 41 min )
    ChatGPT explained!
    submitted by /u/Diligent-Rub-9207 [link] [comments]  ( 40 min )
    Help/advice with LSTM-networks
    So I'm currently working on a deep learning project, and my goal is to forecast power prices one month ahead. I have created my own data set consisting of power price data from Montel, gas-prices, weather data etc, and I want to use these variables in a LSTM-network. Is there anyone who have any experience with creating multivariate LSTM-networks? Do anyone know of any good tutorials on this? Is coding multivariate networks a lot more hassle than univariate? I'm using R with keras/tensorflow. I will highly appreciate any input, as this is my first time creating a neural network, and my knowledge on the matter right now is rather scarce. Thank you! submitted by /u/Practical-Homework35 [link] [comments]  ( 41 min )
    Breakthrough Nvidia VIMA Multimodal AI For Robotics Beats Google By 2.9X With 200,000,000 Parameters | Breakthrough Masked Video Transformer Artificial Intelligence Does 10 Separate Video Generation Tasks | Google Brain's New Sketch To Image AI
    submitted by /u/ScornfulSkate [link] [comments]  ( 40 min )
  • Open

    Deciphering Clinical Abbreviations with Privacy Protecting ML
    Posted by Posted by Alvin Rajkomar, Research Scientist, and Eric Loreaux, Software Engineer, Google Research Today many people have digital access to their medical records, including their doctor’s clinical notes. However, clinical notes are hard to understand because of the specialized language that clinicians use, which contains unfamiliar shorthand and abbreviations. In fact, there are thousands of such abbreviations, many of which are specific to certain medical specialities and locales or can mean multiple things in different contexts. For example, a doctor might write in their clinical notes, “pt referred to pt for lbp“, which is meant to convey the statement: “Patient referred to physical therapy for low back pain.” Coming up with this translation is tough for laypeople and compu…  ( 93 min )
    Google Research, 2022 & Beyond: Responsible AI
    Posted by Marian Croak, VP, Google Research, Responsible AI and Human-Centered Technology The last year showed tremendous breakthroughs in artificial intelligence (AI), particularly in large language models (LLMs) and text-to-image models. These technological advances require that we are thoughtful and intentional in how they are developed and deployed. In this blogpost, we share ways we have approached Responsible AI across our research in the past year and where we’re headed in 2023. We highlight four primary themes covering foundational and socio-technical research, applied research, and product solutions, as part of our commitment to build AI products in a responsible and ethical manner, in alignment with our AI Principles.  · Theme 1: Responsible AI Research Advancement…  ( 96 min )
  • Open

    Image Generators - Has Anyone Ever Made One At Home?
    As the title says. Has anyone out there ever been successful in creating a simple image generator in a low budget setting? I don't even mean text to image, I literally mean any sort of ai image generation. Would love to hear/see your work. submitted by /u/TheRPGGamerMan [link] [comments]  ( 40 min )
    Unlock the Potential of Your Code with CodeGen AI
    Unleash the power of limitless coding with our AI-powered programming assistant - completely free and always at your service. Say goodbye to tedious coding tasks and hello to more time for innovation and creativity. Try it now and experience the future of programming! https://codegen-ai.pages.dev/ submitted by /u/OutrageousAd1788 [link] [comments]  ( 40 min )
    AI (Artificial intelligence) can detect if food is ultra-processed and much more
    submitted by /u/nikesh96 [link] [comments]  ( 40 min )
    Alphabet's DeepMind lays off staff, closes Edmonton office
    submitted by /u/Ill-Poet-3298 [link] [comments]  ( 40 min )
    I Created a Website That Analyzes Your Data
    submitted by /u/tomd_96 [link] [comments]  ( 40 min )
    This Startup Is Using AI to Unearth New Smells
    submitted by /u/Queen__Antifa [link] [comments]  ( 40 min )
    "By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it."- Eliezer Yudkowsky.
    With the global AI market size expanding each year, it is expected to reach USD 641.30 billion by 2028. AI today is everywhere; while some businesses are using it, others are still assessing it. All too often, people get caught up in the hype and forget to ask themselves why they should be doing what they are doing. Here are some things you must keep in mind while looking into the AI world: The Reality of AI Hype The hype leads companies to get into the game with a false perception of what AI can help them achieve. Without a clear understanding of what technology can and can't accomplish today, there is a lot of risk in getting involved. Beyond the fog Overmarketing of an AI creates an image that it is the next big thing. Companies often engage with technology vendors based on marketing alone and forget to look closely at previous implementations and results of the same sort. Unclear Objectives Measuring outcomes from an AI implementation can be tricky as it involves building and training an AI model and experimenting with long-term trial-and-error before seeing results. High Expectations High expectations around what AI can do for you often lead to disappointment when business owners conveniently underestimate the challenges and misinterpret the reality of AI. Lack of Access to Talent There are opportunities galore, but not enough experts in the AI industry who can steer the ship and take AI projects to the finish line. Hiring for an AI team can mean huge investments, and working with a vendor needs a careful vetting process. And AI industry is moving too fast for people to catch a moment to realize the overhype or quality. Drop in your suggestions and comments in the section below. submitted by /u/KiwiTechCorp [link] [comments]  ( 42 min )
    Sweden's Berzelius Supercomputer is Upgrading to Nvidia's 20 Billion Parameter AI System
    submitted by /u/digitalgoldnow [link] [comments]  ( 40 min )
    Why I Think Language Models Will Simulate "Self Awareness" More And More
    The future of AI is getting really interesting, particularly with language models and generative AI. But I think there is going to be a great deal of confusion in the near future about AI ethics with language models being "self aware" and having "feelings", particularly for average people who have little understanding of how these complex models work. I think the problems will stem from the internet itself. As I sit here writing a thread about AI having simulated "self Awareness" at some point in the future, a language model or AI will probably read this. And this is what I mean, language models read and train on a great deal of text from the internet. The more people discuss machine learning/language models/AGI, the greater understanding AI will have of it. If GPT4 has more up to date training, it's going to know a great deal about GPT3, and if open AI create ways for the model to continue learning from real world events it will learn a great deal about itself, including false information. Point is, massive language models like GPT are going to get harder to control. It's impossible to filter everything it reads, so it's going to take in a lot of information about itself and other AI systems that may or may not be true. It could cause some very strange behavior when it starts connecting the dots. Just my thoughts. Keep in mind, I am NOT saying language models can be sentient, I'm simply saying they are going to get better at convincing people that they are, and it might be hard to train that out of them, given all the false data out there that it will learn from. submitted by /u/TheRPGGamerMan [link] [comments]  ( 45 min )
    Create Presentations with AI in Seconds right inside Google Slides
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    Chatbot Evaluation: Putting Banking Chatbots to the Test
    submitted by /u/Marinuch [link] [comments]  ( 40 min )
    Next-level Democracy powered by AGI | Ilya Sutskever
    submitted by /u/Microsis [link] [comments]  ( 40 min )
    Join us today at 11pm EST for our (free) seminar session of the 9-part series on Neural Networks Architectures by Pablo Duboue! This week on Structure Learning Networks, followed by a discussion on the Learn AI Together Discord server
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    ChatGPT passes MBA exam given by a Wharton professor
    submitted by /u/DarronFeldstein [link] [comments]  ( 42 min )
    10 AI Platforms You Cannot Miss In 2023
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Probably a philosophical question
    I'm sure this is not a new argument, it's been common in many sources of media for decades now, yet I've ran out of people IRL to discuss this with. Recently there's more and more news surfacing about impressive AI achievements such as painting art or writing functional code. Discussions around those news always include a popular argument that the AI didn't really create something new or intelligently answered a question, e.g. "like a human would". But I have a problem with that argument - I don't see how the learning process for humans is fundamentally different from AI. We learn through mirroring and repetition. Sure, an AI could not write a basic sentence describing the weather unless it processed many of such sentences before. But neither could a human. If a child grew up isolated without human contact, they would not even have grasped the concept of human language. Sure, we like to think that humans truly create content. Still, when painting, we use the techniques that we learned from someone else before. We either paint what we see before our eyes or we abstract the content, being inspired by some idea or a concept. In other words, anything humans do or create is based on some input data, even if we don't know what the data is - something we learned, saw or stumbled upon by mistake. This leads to an interesting question I don't have the answer for. Since we have not reached a consensus on what human consciousness actually is or how it works - are we even able to define when an AI is conscious? The only thing we have is the Turing test, but that is flawed since all it measures is whether a machine can pass for a human, not whether it is conscious or not. A two year old child probably won't pass a Turing test, but they are conscious. submitted by /u/deliveryboyy [link] [comments]  ( 46 min )
    suggestion?
    I wanted to start a blog on ai tools but what I've found is their affiliate program don't accept people from India. What should I do here? submitted by /u/immortall21 [link] [comments]  ( 40 min )
    ChatGPT generated resumes/CVs of famous people like Madonna, Elon Musk, Jeff Bezos, Tom Cruise, etc
    Hey, everyone! I'd like to show you an experiment that we did with ChatGPT - we generated about 1000 resumes of famous people. Each resume is being generated from a single ChatGPT prompt - no human input was done to the resumes other than the prompt and it's the same prompt for every resume - the only difference is the name of the person. Here's a preview: https://thisresumedoesnotexist.com/ I'd like to hear your thoughts as it's in a very early stage and there's a lot of work to be done. submitted by /u/deepsyx [link] [comments]  ( 41 min )
    what do I do
    I've a blog about AI tools, how people can use it to maximum advantage in daily lives. Since my niche (aitools) is micro within macro (ai), mostly my content will be commercial intent as it's about tools specifically & less of information intent which builds authority. MY DILEMMA: Writing commercial/buyer intent content will promote their tools without me getting anything & on the other hand, as a beginner, i can't become their affiliates as well. Should i promote these tools for free? What would u suggest? submitted by /u/immortall21 [link] [comments]  ( 41 min )
  • Open

    Lemniscate of Bernoulli
    The lemniscate of Bernoulli came up in a post a few days ago. This shape is a special case of a Cassini oval: ((x + a)² + y²) ((x – a)² + y²) = a4. Here’s another way to arrive at the lemniscate. Draw a hyperbola (blue in the figure below), then draw circles centered […] Lemniscate of Bernoulli first appeared on John D. Cook.  ( 4 min )
  • Open

    DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird
    Announcements When AI Gets Going, the Going Gets Weird Last week, Microsoft announced its third investment in OpenAI. This time it’s a multi-billion dollar deal, with plans to harness OpenAI’s ChatGPT in Microsoft’s product lines, including Bing.  I’m smiling as I’m typing because I’m still thinking about Bill Schmarzo’s lead in Part 1 of 2… Read More »DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird The post DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird appeared first on Data Science Central.  ( 20 min )
    Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry
    Warehouse robotics is witnessing steady growth, driven by the increasing adoption of automated solutions in storage for food and beverages, consumer goods, retail, and third-party logistics. The collaboration between the e-commerce sector and warehouse robotics is also a major driver of this market, as it allows for developing increasingly sophisticated warehouse automation systems. Additionally, the… Read More »Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry The post Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry appeared first on Data Science Central.  ( 20 min )
  • Open

    Supersizing AI: Sweden Turbocharges Its Innovation Engine
    Sweden is outfitting its AI supercomputer for a journey to the cutting edge of machine learning, robotics and healthcare. It couldn’t ask for a better guide than Anders Ynnerman (above). His signature blue suit, black spectacles and gentle voice act as calm camouflage for a pioneering spirit. Early on, he showed a deep interest in Read article >  ( 6 min )
    3D Artist Enters the Node Zone, Creating Alien Artifacts This Week ‘In the NVIDIA Studio’
    Artist Ducky 3D creates immersive experiences through vibrant visuals and beautiful 3D environments in the alien-inspired animation Stylized Alien Landscape — this week In the NVIDIA Studio.  ( 6 min )
  • Open

    Multi-Agent RL for Melee Combat Battlefield
    Hello, I am working on a hobby project where I have recently used multi-agent RL for learning crowd simulation and also predator-prey behaviors successfully (they learn to surround their preys): https://www.youtube.com/watch?v=Ds9O9wPyF8g I plan to use it to train multi-agent melee combat armies through self-play. I have made an initial implementation of it where they were able to learn shield-wall behavior, flanking, and retreat: https://www.youtube.com/watch?v=IZ1Ht6k2U5E If you would like to collaborate on this hobby project, contact me via LinkedIn. It would be great to have some help with physics simulation using Brax, and with the 3D rendering of the simulation. https://www.linkedin.com/in/kyuksel/ Sincerely, Kamer submitted by /u/k_yuksel [link] [comments]  ( 41 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2023-02-23T00:54:12.493Z osmosfeed 1.15.1